Python爬取制服买家秀小姐姐
这个网站大概是1个月之前在水漫金山某位大神发出来的,当天我就写了爬虫,今天没事又去爬一下看更新了没,发现是空的,网站内容更改了。
所以刚刚又重新用scrapy写了一个整站爬虫,但还是不发出来,省的各位把网站给爬死了。
复制出来改成单分类爬虫,剩下的想爬取,自己更改!!!
[Python] 纯文本查看 复制代码
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
# from ip_proxy import ips import requests, os, re, random from lxml import etree # ip_add = random.choice(ips()) if not os.path.exists( './zhifu' ): os.mkdir( './zhifu' ) headers = { 'user-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36' } for i in range ( 1 , 4 ): url = 'https://www.ikmjx.com/index.php?g=portal&m=list&a=index&id=3&p=' + str (i) r = requests.get(url = url, headers = headers).text tree = etree.HTML(r) div_list = tree.xpath( '/html/body/main/div/div[2]/div' )[ 1 : - 1 ] for li in div_list: a = 0 src = 'https://www.ikmjx.com' + li.xpath( './div[2]/a/@href' )[ 0 ] titles = li.xpath( './div[2]/a/@title' )[ 0 ] title = titles.replace( '?' ,'') req = requests.get(url = src, headers = headers).text tree1 = etree.HTML(req) div1_list = tree1.xpath( '/html/body/main/div/div/div/div[3]/p[2]' ) for p in div1_list: src_path = p.xpath( './img/@src' ) # print(src_path) for img in src_path: a = a + 1 img_data = requests.get(url = img, headers = headers).content img_path = './zhifu/' + title + '_' + str (a) + '.jpg' with open (img_path, 'wb' ) as fp: fp.write(img_data) # print(img_data, '下载完成!!!') |