这个网站大概是1个月之前在水漫金山某位大神发出来的,当天我就写了爬虫,今天没事又去爬一下看更新了没,发现是空的,网站内容更改了。
所以刚刚又重新用scrapy写了一个整站爬虫,但还是不发出来,省的各位把网站给爬死了
复制出来改成单分类爬虫,剩下的想爬取,自己更改!!!

[Python] 纯文本查看 复制代码
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# from ip_proxy import ips
import requests, os, re, random
from lxml import etree
# ip_add = random.choice(ips())
if not os.path.exists('./zhifu'):
    os.mkdir('./zhifu')
headers = {
    'user-agent''Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
for in range(1,4):
    url = 'https://www.ikmjx.com/index.php?g=portal&m=list&a=index&id=3&p=' + str(i)
    = requests.get(url=url, headers=headers).text
    tree = etree.HTML(r)
    div_list = tree.xpath('/html/body/main/div/div[2]/div')[1:-1]
    for li in div_list:
        = 0
        src = 'https://www.ikmjx.com' + li.xpath('./div[2]/a/@href')[0]
        titles = li.xpath('./div[2]/a/@title')[0]
        title = titles.replace('?','')
        req = requests.get(url=src, headers=headers).text
        tree1 = etree.HTML(req)
        div1_list = tree1.xpath('/html/body/main/div/div/div/div[3]/p[2]')
        for in div1_list:
            src_path = p.xpath('./img/@src')
            # print(src_path)
            for img in src_path:
                = a+1
                img_data = requests.get(url=img, headers=headers).content
                img_path = './zhifu/' + title + '_' + str(a) + '.jpg'
                with open(img_path, 'wb') as fp:
                    fp.write(img_data)
                    # print(img_data, '下载完成!!!')