python爬取百度热搜

# 爬取百度热搜  前6条
import requests
import datetime
import win32api,win32con
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/515.66 (KHTML, like Gecko) Chrome/80.0.3789.163 Safari/515.66"}
response = requests.get("https://www.baidu.com/", headers=headers)
response.encoding="utf8"
# 指定字符编码为utf-8
bs = BeautifulSoup(response.text,'lxml')

def formatGMTime(timestamp):
    GMT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'
    a = datetime.datetime.strptime(timestamp, GMT_FORMAT) + datetime.timedelta(hours=8)
    return a

resDate = response.headers.get('Date')
resDate = formatGMTime(resDate)
nameList = bs.find_all("li",attrs={"class":{"hotsearch-item odd","hotsearch-item even"}})
tests = []
for name in nameList:
    tests.append(name.getText())
fspath=open('2.txt','a+',encoding="utf8")
print(resDate,file=fspath)
tests.sort()
for news in tests:
    news=news[0]+':'+news[1:]
    print(news,file=fspath)
fspath.close()
win32api.MessageBox(0,"任务完成","",win32con.MB_OK)

代码运行结果如下

2021-08-07 17:08:56
1:#中国包揽男子10米台跳水冠亚军#沸
2:#晓舟一叶梦圆东京#
3:#苏炳添和老婆爱情保鲜秘诀#
4:林书豪确诊感染新冠热
5:南京2岁男童做核酸检测被感染
6:中国队一出场水花都变得害羞

修改程序代码

# 爬取百度热搜   30条
import requests
import datetime
import win32api,win32con
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/515.66 (KHTML, like Gecko) Chrome/80.0.3789.163 Safari/515.66"}
response = requests.get("https://www.baidu.com/", headers=headers)
response.encoding="utf8"
# 指定字符编码为utf-8
bs = BeautifulSoup(response.text,'lxml')

def formatGMTime(timestamp):
    GMT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'
    a = datetime.datetime.strptime(timestamp, GMT_FORMAT) + datetime.timedelta(hours=8)
    return a

resDate = response.headers.get('Date')
resDate = formatGMTime(resDate)
nameList = bs.find_all("textarea", id="hotsearch_data",style="display:none;")
tests = []
for name in nameList:
    tests.append(name.getText())

fspath=open('3.txt','a+',encoding="utf8")
print(resDate,file=fspath)
j=1
res=re.split(';|,|"',tests[0])
pat=re.compile(u'[\u4e00-\u9fa5]+')
for i in res:
    match = pat.search(i)
    if match:
        i=str(j)+':'+i
        print(i,file=fspath)
        j+=1
fspath.close()
win32api.MessageBox(0,"任务完成","",win32con.MB_OK)

修改后运行结果如下

2021-08-07 17:09:28
1:#中国包揽男子10米台跳水冠亚军#
2:#晓舟一叶梦圆东京#
3:#苏炳添和老婆爱情保鲜秘诀#
4:林书豪确诊感染新冠
5:南京2岁男童做核酸检测被感染
6:中国队一出场水花都变得害羞
7:#谷红获得拳击女子沉量级银牌#
8:重庆大学通报女副教授坠亡
9:因极特殊原因需进返京怎么办?
10:美国男篮险胜法国夺奥运4连冠
11:货拉拉跳车事件司机妻子发声
12:#曝梅西即将加盟巴黎圣日耳曼#
13:#国乒男团卫冕夺第35金#
14:湖北累计报告本土确诊31例
15:奥恰洛夫把铜牌借给替补队友合影
16:决赛前刘国梁教队员展示国旗
17:全红婵爸爸拒收20万慰问金
18:#龙蟒组合时隔五年的十指相扣#
19:#立秋到了#
20:曹缘第一跳满分
21:北京:学科类培训机构暑期不再开课
22:澳方称不接受重启对话条件中方回应
23:新冠疫苗对德尔塔变异株还有用吗
24:南京第1例重型患者出院
25:女子造谣央美教授出轨被抓
26:水谷隼宣布打算退役
27:巴赫称赞东京奥运会很成功
28:中国队金牌总数已追平伦敦奥运
29:湖南新增本土确诊9例 张家界6例
30:#许昕说把金牌送给快出生的小棉袄#