今天是数据分析的最后一天,明天开始就将进入web的世界了。我们一步步来,先把今天的API搞定。不过这两天的内容看似简单,但以后估计真要做的时候还要回头看。否则还是不知道如何下手。
17.1 使用Web API
Web API是网站的一部分,用于与使用具体URL请求特定信息的程序交互。这种请求称为API调用。
17.1.1 Git和GitHub
GitHub的名字来自Git,是一个分布式版本控制系统,帮助人们管理为项目所做的工作,避免一个人所做的修改影响其他人所做的修改。
17.1.2 使用API调用请求数据
要知道API调用是什么样的,请在浏览器的地址栏输入如下地址并按回车键:
https://api.github.com/search/repositories?q=language:python&sort=stars
(胭惜雨:手打网址真是累......)
这个调用返回GitHub当前托管了多少个Python项目,以及有关最受欢迎的Python仓库的信息。
开头的https://api.github.com/将请求发送到GitHub网站中相应API调用的部分,接下来的search/repositories 让API搜索GitHub上的所有仓库。
repositories后面的问号指出需要传递一个实参。q表示查询,而等号让我们能够开始制定查询。我们使用language:python指出指向获取主要语言为Python的仓库的信息。最后的&sort=stars指定将项目按星级排序。
17.1.3 安装Requests
Requests包让Python程序能够轻松地向网站请求信息并检查返回的响应。要安装requests,可使用pip:
$ python -m pip install --user requests
17.1.4 处理API响应
import requests
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept':'application/vnd.github.v3+json'}
r = requests.get(url,headers=headers)
print(f"status code:{r.status_code}")
# 将API响应赋给一个变量
respons_dict = r.json()
# 处理结果
print(respons_dict.keys())
17.1.5 处理响应字典
import requests
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept':'application/vnd.github.v3+json'}
r = requests.get(url,headers=headers)
print(f"status code:{r.status_code}")
# 将API响应赋给一个变量
respons_dict = r.json()
# 处理结果
print(f"Total repositories:{respons_dict['total_count']}")
# 探索有关仓库的信息
repo_dicts = respons_dict['items']
print(f"Repositories returned:{len(repo_dicts)}")
# 研究第一个仓库
repo_dict = repo_dicts[0]
print("\nSelected information about first repository(有关第一个存储库的选定信息:):")
print(f"Name:{repo_dict['name']}")
print(f"owner:{repo_dict['owner']['login']}")
print(f"stars:{repo_dict['stargazers_count']}")
print(f"repository:{repo_dict['html_url']}")
print(f"created:{repo_dict['created_at']}")
print(f"updated{repo_dict['updated_at']}")
print(f"Description:{repo_dict['description']}")
翻译:
name=名字
owner=主人、主人翁
stars=星
repository=存储库
created=创建
updated=升级
Description=说明
layout = 布局
17.1.6 概述最受欢迎的仓库
import requests
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept':'application/vnd.github.v3+json'}
r = requests.get(url,headers=headers)
print(f"status code:{r.status_code}")
# 将API响应赋给一个变量
respons_dict = r.json()
# 处理结果
print(f"Total repositories:{respons_dict['total_count']}")
# 探索有关仓库的信息
repo_dicts = respons_dict['items']
print(f"Repositories returned:{len(repo_dicts)}")
print("\nSelected information about each repository(有关每个存储库的选定信息):")
for repo_dict in repo_dicts:
print(f"Name:{repo_dict['name']}")
print(f"owner:{repo_dict['owner']['login']}")
print(f"stars:{repo_dict['stargazers_count']}")
print(f"repository:{repo_dict['html_url']}")
print(f"created:{repo_dict['created_at']}")
print(f"updated{repo_dict['updated_at']}")
print(f"Description:{repo_dict['description']}")
17.1.7 监视API的速率限制
在浏览器输入http://api.github.com/rate_limit 查看API的速率限制
17.2 使用plotly可视化仓库
import requests
from plotly.graph_objs import Bar
from plotly import offline
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept':'application/vnd.github.v3+json'}
r = requests.get(url,headers=headers)
print(f"status code:{r.status_code}")
# 处理结果
respons_dict = r.json()
repo_dicts = respons_dict['items']
repo_names,stars=[],[]
for repo_dict in repo_dicts:
repo_names.append(repo_dict['name'])
stars.append(repo_dict['stargazers_count'])
# 可视化
data = [{'type':'bar','x':repo_names,'y':stars}]
my_layout = {'title':'Github上最受欢迎的Python项目',
'xaxis':{'title':'repository'},
'yaxis':{'title':'stars'}
}
fig = {'data':data,'layout':my_layout}
offline.plot(fig,filename='python_repos.html')
17.2.1 改进plotly图表
import requests
from plotly.graph_objs import Bar
from plotly import offline
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept':'application/vnd.github.v3+json'}
r = requests.get(url,headers=headers)
print(f"status code:{r.status_code}")
# 处理结果
respons_dict = r.json()
repo_dicts = respons_dict['items']
repo_names,stars=[],[]
for repo_dict in repo_dicts:
repo_names.append(repo_dict['name'])
stars.append(repo_dict['stargazers_count'])
# 可视化
data = [{'type':'bar',
'x':repo_names,
'y':stars,
'marker':{'color':'rgb(60,100,150)','line':{'width':1.5,'color':'rgb(25,25,25)'},'opacity':0.6,}
}]
my_layout = {'title':'Github上最受欢迎的Python项目',
'xaxis':{'title':'repository'},
'yaxis':{'title':'stars'}
}
fig = {'data':data,'layout':my_layout}
offline.plot(fig,filename='python_repos.html')
import requests
from plotly.graph_objs import Bar
from plotly import offline
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept':'application/vnd.github.v3+json'}
r = requests.get(url,headers=headers)
print(f"status code:{r.status_code}")
# 处理结果
respons_dict = r.json()
repo_dicts = respons_dict['items']
repo_names,stars=[],[]
for repo_dict in repo_dicts:
repo_names.append(repo_dict['name'])
stars.append(repo_dict['stargazers_count'])
# 可视化
data = [{'type':'bar',
'x':repo_names,
'y':stars,
'marker':{'color':'rgb(60,100,150)','line':{'width':1.5,'color':'rgb(25,25,25)'},'opacity':0.6,}
}]
my_layout = {'title':'Github上最受欢迎的Python项目','titlefont':{'size':28},
'xaxis':{'title':'repository',
'titlefont': {'size':24},
'tickfont':{'size':14}
},
'yaxis':{'title':'stars',
'titlefont': {'size':24},
'tickfont':{'size':14}}
}
fig = {'data':data,'layout':my_layout}
offline.plot(fig,filename='python_repos.html')
17.2.2 添加自定义工具提示
import requests
from plotly.graph_objs import Bar
from plotly import offline
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept': 'application/vnd.github.v3+json'}
r = requests.get(url, headers=headers)
print(f"status code:{r.status_code}")
# 处理结果
respons_dict = r.json()
repo_dicts = respons_dict['items']
repo_names, stars, labels = [], [], []
for repo_dict in repo_dicts:
repo_names.append(repo_dict['name'])
stars.append(repo_dict['stargazers_count'])
owner = repo_dict['owner']['login']
description = repo_dict['description']
label = f"{owner}<br /> {description}"
labels.append(label)
# 可视化
data = [{'type': 'bar',
'x': repo_names,
'y': stars,
'hovertext': labels,
'marker': {'color': 'rgb(60,100,150)',
'line': {'width': 1.5,
'color': 'rgb(25,25,25)'},
'opacity': 0.6,
}}]
my_layout = {'title': 'Github上最受欢迎的Python项目', 'titlefont': {'size': 28},
'xaxis': {'title': 'repository',
'titlefont': {'size': 24},
'tickfont': {'size': 14}
},
'yaxis': {'title': 'stars',
'titlefont': {'size': 24},
'tickfont': {'size': 14}}
}
fig = {'data': data, 'layout': my_layout}
offline.plot(fig, filename='python_repos.html')
17.2.3 在图标中添加可单击的链接
import requests
from plotly.graph_objs import Bar
from plotly import offline
# 执行API调用并存储响应
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept': 'application/vnd.github.v3+json'}
r = requests.get(url, headers=headers)
print(f"status code:{r.status_code}")
# 处理结果
respons_dict = r.json()
repo_dicts = respons_dict['items']
repo_names, stars, labels,repo_links = [], [], [],[]
for repo_dict in repo_dicts:
repo_name = repo_dict['name']
repo_url = repo_dict['html_url']
repo_link = f"<a href='{repo_url}'>{repo_name}</a>"
repo_links.append(repo_link)
repo_names.append(repo_dict['name'])
stars.append(repo_dict['stargazers_count'])
owner = repo_dict['owner']['login']
description = repo_dict['description']
label = f"{owner}<br /> {description}"
labels.append(label)
# 可视化
data = [{'type': 'bar',
'x': repo_links,
'y': stars,
'hovertext': labels,
'marker': {'color': 'rgb(60,100,150)',
'line': {'width': 1.5,
'color': 'rgb(25,25,25)'},
'opacity': 0.6,
}}]
my_layout = {'title': 'Github上最受欢迎的Python项目', 'titlefont': {'size': 28},
'xaxis': {'title': 'repository',
'titlefont': {'size': 24},
'tickfont': {'size': 14}
},
'yaxis': {'title': 'stars',
'titlefont': {'size': 24},
'tickfont': {'size': 14}}
}
fig = {'data': data, 'layout': my_layout}
offline.plot(fig, filename='python_repos.html')
17.3 Hacker News API
调用返回最热门的文章的信息:
https://hacker-news.firebaseio.com/v0/item/19155826.json
(胭惜雨:这个网址一直处于打不开的状态,直到科学上网后才行)
{"by":"jimktrains2","descendants":221,"id":19155826,"kids":[19156572,19158857,19156773,19157251,19156415,19159820,19157154,19156385,19156489,19158522,19156755,19156974,19158319,19157034,19156935,19158935,19157531,19158638,19156466,19156758,19156565,19156498,19156335,19156041,19156704,19159047,19159127,19156217,19156375,19157945],"score":728,"time":1550085414,"title":"Nasa’s Mars Rover Opportunity Concludes a 15-Year Mission","type":"story","url":"https://www.nytimes.com/2019/02/13/science/mars-opportunity-rover-dead.html"}
import requests
import json
# 执行API调用并存储响应
url='https://hacker-news.firebaseio.com/v0/item/19155826.json'
r = requests.get(url)
print(f"status code:{r.status_code}")
# 探索数据结构
response_dict = r.json()
readable_file = 'readable_hn_data.json'
with open (readable_file,'w') as f:
json.dump(response_dict,f,indent=4)
from operator import itemgetter
import requests
# 执行API调用并存储响应
url='https://hacker-news.firebaseio.com/v0/topstories.json'
r = requests.get(url)
print(f"status code:{r.status_code}")
# 探索数据结构
submission_ids = r.json()
submission_dicts =[]
for submission_id in submission_ids[:10]:
# 对于每篇文章,都执行一个API调用
url = f"https://hacker-news.firebaseio.com/v0/item/{submission_id}.json"
r = requests.get(url)
print(f"id:{submission_id}\tstatus:{r.status_code}")
response_dict = r.json()
# 对于每篇文章,都创建一个字典
submission_dict = {
'title':response_dict['title'],
'hn_link':f"http://news.ycombinator.com/item?id={submission_id}",
'comments':response_dict['descendants'],
}
submission_dicts.append(submission_dict)
submission_dicts = sorted(submission_dicts,key=itemgetter('comments'),
reverse=True)
for submission_dict in submission_dicts:
print(f"\nTitle:{submission_dict['title']}")
print(f"Discussion link:{submission_dict['hn_link']}")
print(f"Comments:{submission_dict['comments']}")
第17章到此就算结束了,经过了这两个项目我有一个心得:在书上看一遍,再写一遍的目的主要是为了让你知道不会写了去哪查。指望书上一个项目就能让你不看书就能自己敲出一个小项目基本是不可能的。
真想学会,要么就是死磕一种类型的项目,不停的做,做到能不看书就能做出个数据分析图出来。要么就是多看书,看到背熟那几个函数和结构。
反正,都不容易。
胭惜雨
2021年03月03日