爬虫

image-20230811165027568

URL管理器

管理带爬取URL

设立优先级 循环爬取 重复爬取等问题

网页解析器

提取价值数据

提取新的待爬URL

网页下载器

网页内容下载

requests

image-20230811173150599

image-20230811175609237

正确的encoding才能正确的解析网页

r.url显示的是编码后的url

与输入的不同

r.text是字符型的返回

url是字节型的返回

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import requests
>>> url="http://www.crazyant.net"
>>> r=requets.get(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'requets' is not defined
>>> r=requests.get(url)
>>> r.status_code
200
>>> r.headers
{'Server': 'nginx', 'Date': 'Fri, 11 Aug 2023 14:36:41 GMT', 'Content-Type': 'text/
html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', '
Vary': 'Accept-Encoding', 'X-UA-Compatible': 'IE=edge', 'Link': '<http://www.crazyant.net/wp-json/>; rel="https://api.w.org/"', 'Content-Encoding': 'gzip'}
>>> r.cookies
<RequestsCookieJar[]>


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 url="http://www.baidu.com"
>>> r=requests.get(url)
>>> r.status_code
200
>>> r.encoding
'ISO-8859-1'
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'C
onnection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html',
'Date': 'Fri, 11 Aug 2023 14:51:59 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:3
6 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
>>> r.cookies
<RequestsCookieJar[Cookie(version=0, name='BDORZ', value='27315', port=None, port_s
pecified=False, domain='.baidu.com', domain_specified=True, domain_initial_dot=True
, path='/', path_specified=True, secure=False, expires=1691851919, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>

headers中没有返回content—type

requests默认认为其是ISO-8859-1

r.text返回的中文有乱码

查看网页源码 判断其编码方式

r.encoding=”utf-8”

然后再看其text

中文信息正常返回

url管理器

image-20230811225755761

html简介

image-20230813102121391

bs

image-20230813102533841

image-20230813105850664

image-20230813105903236

image-20230813110704838

1
2
3
4
5
6
7
8
9
10
11
from bs4 import BeautifulSoup

with open("./test.html",encoding="utf-8") as fin:
html_doc=fin.read()

soup=BeautifulSoup(html_doc,"html.parser")

links=soup.find_all("a")
for link in links:
print(link.name,link["href"],link.get_text())
#名称 属性:id class href 文字
1
2
3
a baidu.com  baidu
a bilibili.com bilibili
a fffjay.fun fffjay

id属性是全局唯一的

可以通过查找id定位到段落,减小查询范围,加快寻找效率

如下

1
2
3
4
5
6
7
8
9
10
11
12
from bs4 import BeautifulSoup

with open("./test.html",encoding="utf-8") as fin:
html_doc=fin.read()

soup=BeautifulSoup(html_doc,"html.parser")
div_node=soup.find("div",id="content")
print(div_node)
print("#"*30)
links=div_node.find_all("a")
for link in links:
print(link.name,link["href"],link.get_text())
1
2
3
4
5
6
7
8
9
10
<div class="default" id="content">
<p>passage</p>
<a href="baidu.com"> baidu</a><br/>
<a href="bilibili.com"> bilibili</a><br/>
<a href="fffjay.fun">fffjay</a><br/>
</div>
##############################
a baidu.com baidu
a bilibili.com bilibili
a fffjay.fun fffjay

分析目标

1 查看网页源代码

2 检查

elements 展示给用户的源代码 经JavaScript加载后的

对于这种动态的网页,采用s

network 类似于抓包 选择doc模块

实战1爬自己博客的tag

1
2
3
4
5
6
7
8
9
10
11
12
13
url="http://fffjay.fun/"
import requests
r=requests.get(url)
if r.status_code!=200:
raise Exception()
html_doc=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,"html.parser")
h2_node=soup.find_all(class_="menus_item" )

for nodes in h2_node:
link=nodes.find("a")
print(link["href"],link.get_text)
1
2
3
4
5
6
7
8
9
10
11
12
ound method PageElement.get_text of <a class="site-page" href="/"><i class="fa-fw fas fa-home"></i><span> 首页</span></a>>
/archives/ <bound method PageElement.get_text of <a class="site-page" href="/archives/"><i class="fa-fw fas fa-archive"></i><span> 时间轴</span></a>>
/tags/ <bound method PageElement.get_text of <a class="site-page" href="/tags/"><i class="fa-fw fas fa-tags"></i><span> 标签</span></a>>
/categories/ <bound method PageElement.get_text of <a class="site-page" href="/categories/"><i class="fa-fw fas fa-folder-open"></i><span> 分类</span></a>>
/link/ <bound method PageElement.get_text of <a class="site-page" href="/link/"><i class="fa-fw fas fa-link"></i><span> 友链</span></a>>
/about/ <bound method PageElement.get_text of <a class="site-page" href="/about/"><i class="fa-fw fas fa-heart"></i><span> 关于</span></a>>
/ <bound method PageElement.get_text of <a class="site-page" href="/"><i class="fa-fw fas fa-home"></i><span> 首页</span></a>>
/archives/ <bound method PageElement.get_text of <a class="site-page" href="/archives/"><i class="fa-fw fas fa-archive"></i><span> 时间轴</span></a>>
/tags/ <bound method PageElement.get_text of <a class="site-page" href="/tags/"><i class="fa-fw fas fa-tags"></i><span> 标签</span></a>>
/categories/ <bound method PageElement.get_text of <a class="site-page" href="/categories/"><i class="fa-fw fas fa-folder-open"></i><span> 分类</span></a>>
/link/ <bound method PageElement.get_text of <a class="site-page" href="/link/"><i class="fa-fw fas fa-link"></i><span> 友链</span></a>>
/about/ <bound method PageElement.get_text of <a class="site-page" href="/about/"><i class="fa-fw fas fa-heart"></i><span> 关于</span></a>>

也不知道算成功吗。。

1
2
3
4
5
6
7
8
9
10
11
12
13
url="http://fffjay.fun/"
import requests
r=requests.get(url)
if r.status_code!=200:
raise Exception()
html_doc=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,"html.parser")
h2_node=soup.find_all(class_="menus_item" )

for nodes in h2_node:
link=nodes.find("a")
print("http://fffjay.fun/"+link["href"],)

修改后

成功力

1
2
3
4
5
6
7
8
9
10
11
12
13
14

http://fffjay.fun//
http://fffjay.fun//archives/
http://fffjay.fun//tags/
http://fffjay.fun//categories/
http://fffjay.fun//link/
http://fffjay.fun//about/
http://fffjay.fun//
http://fffjay.fun//archives/
http://fffjay.fun//tags/
http://fffjay.fun//categories/
http://fffjay.fun//link/
http://fffjay.fun//about/

实战2爬取博客网站全部文章列表

还是爬我自己

image-20230813225653355

\d代表数字

\d+代表多个数字

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from utils import url_manager
import requests
from bs4 import BeautifulSoup
import re
url="http://www.crazyant.net/"
urls=url_manager.UrLManager()
urls.add_new_url(url)
fout=open("craw_all_pages.txt","w")
while urls.has_new_url():
curr_url=urls.get_url()
r=requests.get(curr_url,timeout=3)#三秒不返回结果就跳过
if r.status_code!=200:
print("error")
continue
soup=BeautifulSoup(r.text,"html.parser")
title=soup.title.string

fout.write("%s\t%s\n"%(curr_url,title))
fout.flush()
#fout.write("%s\t%s\n" % (curr_url, title))
print("嘻嘻: %s,%s,%d"%(curr_url,title,len(urls.new_urls)))

links=soup.find_all("a")
for link in links:
href=link["href"]
pattern=r'^http://www.crazyant.net/\d+.html$'
if re.match(pattern,href):
urls.add_new_url(href)
fout.close()

爬我自己的方法不同,还是爬作者的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from utils import url_manager
import requests
from bs4 import BeautifulSoup
import re
url="http://www.crazyant.net/"
urls=url_manager.UrLManager()
urls.add_new_url(url)
fout=open("craw_all_pages.txt","w")
while urls.has_new_url():
curr_url=urls.get_url()
r=requests.get(curr_url,timeout=3)#三秒不返回结果就跳过
if r.status_code!=200:
print("error")
continue
soup=BeautifulSoup(r.text,"html.parser")
title=soup.title.string

fout.write("%s\t%s\n"%(curr_url,title))
fout.flush()
#fout.write("%s\t%s\n" % (curr_url, title))
print("嘻嘻: %s,%s,%d"%(curr_url,title,len(urls.new_urls)))

links=soup.find_all("a")#此处其实只来一次,一并将所需url打入管理器
for link in links:
href=link["href"]
pattern=r'^http://www.crazyant.net/\d+.html$'
if re.match(pattern,href):
urls.add_new_url(href)
fout.close()

成功

实战三 豆瓣电影top

1 爬取网页

2soup解析数据

3借助panda将数据写出到excel

退役,我才刚学,结果人家网站做反爬了

实战4 历史天气

1
https://tianqi.2345.com/wea_history/54511.htm

发现不论选哪年的

url不改变说明是动态网页

要抓包分析

右键选择检查 观察网络板块

发现其有ua反爬机制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
url="https://tianqi.2345.com/Pc/GetHistory"
payload={
"areaInfo[areaId]": 54511,
"areaInfo[areaType]": 2,
"date[year]": 2015,
"date[month]": 6
}
headers={
"User-Agent":'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203'''
}#防止ua中有双引号发生冲突,直接三引号
import requests
resp=requests.get(url,headers=headers,params=payload)
print(resp.status_code)
print(resp.text)

初步成功

1
2
200
{"code":1,"msg":"","data":"<ul class=\"history-msg\">\n ....
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
url="https://tianqi.2345.com/Pc/GetHistory"
payload={
"areaInfo[areaId]": 54511,
"areaInfo[areaType]": 2,
"date[year]": 2015,
"date[month]": 6
}
headers={
"User-Agent":'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203'''
}#防止ua中有双引号发生冲突,直接三引号
import requests
import pandas as pd
resp=requests.get(url,headers=headers,params=payload)
print(resp.status_code)
data=resp.json()["data"]
#data frame
df=pd.read_html(data)
print(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[               日期  最高温  最低温     天气             风力风向
0 2015-06-01 周一 34° 23° 多云~阴 无持续风向~北风微风~3-4级
1 2015-06-02 周二 33° 17° 晴 北风~无持续风向4-5级~微风
2 2015-06-03 周三 32° 22° 多云~阴 无持续风向微风
3 2015-06-04 周四 23° 17° 雷雨~阴 无持续风向微风
4 2015-06-05 周五 32° 20° 多云~晴 无持续风向微风
5 2015-06-06 周六 32° 18° 雷雨~阴 无持续风向微风
6 2015-06-07 周日 29° 17° 雷雨~晴 无持续风向微风
7 2015-06-08 周一 32° 20° 晴~多云 无持续风向微风
8 2015-06-09 周二 29° 22° 阵雨~雷雨 无持续风向微风
9 2015-06-10 周三 25° 19° 雷雨 无持续风向微风
10 2015-06-11 周四 30° 18° 多云 北风~无持续风向3-4级~微风
11 2015-06-12 周五 29° 19° 多云 北风~无持续风向4-5级~微风
12 2015-06-13 周六 28° 17° 阵雨~多云 北风~无持续风向3-4级~微风
13 2015-06-14 周日 32° 19° 晴 无持续风向微风
14 2015-06-15 周一 32° 22° 多云 无持续风向微风
15 2015-06-16 周二 32° 22° 雷雨 无持续风向微风
16 2015-06-17 周三 32° 20° 雷雨~晴 无持续风向微风
17 2015-06-18 周四 32° 19° 多云~雷雨 北风~无持续风向3-4级~微风
18 2015-06-19 周五 26° 17° 雷雨~多云 无持续风向微风
19 2015-06-20 周六 32° 19° 晴 无持续风向微风
20 2015-06-21 周日 32° 21° 多云 无持续风向微风
21 2015-06-22 周一 31° 22° 多云~阵雨 无持续风向微风
22 2015-06-23 周二 29° 22° 雷雨~阴 无持续风向微风
23 2015-06-24 周三 29° 22° 阵雨~阴 无持续风向微风
24 2015-06-25 周四 28° 21° 雷雨 无持续风向微风
25 2015-06-26 周五 28° 21° 雷雨~阴 无持续风向微风
26 2015-06-27 周六 31° 22° 阴~多云 无持续风向微风
27 2015-06-28 周日 30° 24° 多云~阴 无持续风向微风
28 2015-06-29 周一 28° 21° 阵雨 北风~无持续风向3-4级~微风
29 2015-06-30 周二 28° 18° 阴~晴 无持续风向微风]

成功

接下来将该功能封装以爬取更多网页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
url="https://tianqi.2345.com/Pc/GetHistory"

headers={
"User-Agent":'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203'''
}#防止ua中有双引号发生冲突,直接三引号
import requests
import pandas as pd
def pa(year,month):
# 提供年份和月份
payload = {
"areaInfo[areaId]": 54511,
"areaInfo[areaType]": 2,
"date[year]": year,
"date[month]": month
}
resp=requests.get(url,headers=headers,params=payload)
print(resp.status_code)
data=resp.json()["data"]
#data frame
df=pd.read_html(data)
return df

df=pa(2013,10)
print(df)

捏码吗的,爬一个的时候没问题,封装起来后爬一个也没问题,但是一整合表格出问题了,搞了好久才明白

md不想了

df = pd.read_html(data)[0] # 解析 HTML 表格并转换为 DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
url="https://tianqi.2345.com/Pc/GetHistory"

headers={
"User-Agent":'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203'''
}#防止ua中有双引号发生冲突,直接三引号
import requests
import pandas as pd
def pa(year,month):
# 提供年份和月份
payload = {
"areaInfo[areaId]": 54511,
"areaInfo[areaType]": 2,
"date[year]": year,
"date[month]": month
}
resp=requests.get(url,headers=headers,params=payload)
print(resp.status_code)
data=resp.json()["data"]
#data frame
df=pd.read_html(data)[0]
return df

df_list=[]
for year in range(2011,2012):
for month in range(1,13):
print("”爬取“",year,month)
df=pa(year,month)
df_list.append(df)

pd.concat(df_list).to_excel("beijing.xlsx",index=False)


最喜欢的环节 爬小说

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import requests
from bs4 import BeautifulSoup
def get_novel_topic():
root_url="https://m.bbtxt8.com/book/89536/"
r=requests.get(root_url)
r.encoding="utf-8"
soup=BeautifulSoup(r.text,"html.parser")
for dd in soup.find_all("dd"):
link=dd.find("a")
if not link:
continue
print(link)
get_novel_topic()

标题爬取成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import requests
from bs4 import BeautifulSoup
import os

# 获取小说章节链接和标题列表
def get_novel_topic():
root_url = "http://www.msxsw.com/35_35948/"
r = requests.get(root_url)
r.encoding = "gbk"
soup = BeautifulSoup(r.text, "html.parser")
data = []
for dd in soup.find_all("dd"):
link = dd.find("a")
data.append(("http://www.msxsw.com%s" % link["href"], link.get_text()))#二维数组
return data

# 获取章节内容,添加容错处理
def get_chapter(url):
r = requests.get(url)
r.encoding = "gbk"
soup = BeautifulSoup(r.text, "html.parser")
content_div = soup.find("div", id="content")
if content_div:
return content_div.get_text()
else:
return "Chapter content not found"

# 设置输出子目录
output_dir = os.path.join(os.path.dirname(__file__), "夜的命名树")
if not os.path.exists(output_dir):
os.makedirs(output_dir)

# 遍历爬取章节并保存
for chapter in get_novel_topic():
url, title = chapter
text = get_chapter(url).encode("utf-8")
with open(os.path.join(output_dir, "%s.txt" % title), "wb") as font:
font.write(text)

成功爬取笔趣阁小说

难点:编码text = get_chapter(url).encode(“utf-8”)

get_chapter(url) 函数返回的文本内容可能包含了一些在 UTF-8 编码下无法表示的字符

因此使用前要先编码

实战六 爬取图片