所需库的安装
1 2
| pip install requests pip install bs4
|
requests
用于请求网页数据
bs4.BeautifulSoup
用于便捷处理网页数据
网页数据请求
使用 requests.get
获取数据,例如
1 2 3 4 5 6 7 8
| import requests
response = requests.get("http://books.toscrape.com/")
if response.ok: print (response.text) else: pass
|
可以使用自定header来模仿以浏览器进行访问避免被一些网站阻隔,如
1 2 3 4 5 6 7 8 9
| import requests head = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"} response = requests.get("http://books.toscrape.com/", headers = head)
if response.ok: print (response.text) else: pass
|
数据处理
使用 bs4.BeautifulSoup
便携处理数据
调用类方法 findAll
可以寻找指定标签及指定类型,例如
1 2 3
| content = response.text soup = BeautifulSoup(content, "html.parser") all_title = soup.findAll("p", attrs = {"class": "title"})
|
通过一些基本操作,我们可以获取例如网站http://books.toscrape.com/ 中的top250所有电影中文名称,如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| import requests from bs4 import BeautifulSoup
head = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
for start_num in range(0, 250, 25): response = requests.get("http://books.toscrape.com/top250?start={start_num}", headers = head)
if response.ok: content = response.text soup = BeautifulSoup(content, "html.parser") all_title = soup.findAll("p", attrs = {"class": "title"}) for title in all_title: title_string = title.string if "/" not in title_string: print(title.string) else: pass
|