[python每日一练]--0013:爬百度贴吧图片

题目链接:https://github.com/Show-Me-the-Code/show-me-the-code
代码github链接:https://github.com/wjsaya/python_spider_learn/tree/master/python_daily
个人博客地址:https://wjsaya.github.io

第 0013 题: 用 Python 写一个爬图片的程序,爬 这个链接 里的日本妹子图片 :-)


思路:

  1. 加载帖子url
  2. 解析返回的网页结构,获取所有图片的id
  3. 根据图片id下载对应的预览图和大图
    1. 下载的预览图存放路径为./pic/XXX.jpg
    2. 大图路径为./picBig/XXX.jpg

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Author: wjsaya(http://www.wjsaya.top)
# @Date: 2018-08-15 23:55:53
# @Last Modified by: wjsaya(http://www.wjsaya.top)
# @Last Modified time: 2018-08-16 02:10:36
import os
import requests
from lxml import etree
head = {
"Host" : "tieba.baidu.com",
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language" : "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
"Accept-Encoding" : "gzip, deflate"
}
def main():
'''主函數 \n
下载的帖子预览图在pic目录 \n
高清图在picBig目录 \n
'''
createDir("pic")
createDir("picBig")
rsp = requests.get("http://tieba.baidu.com/p/2166231880?traceid=", headers=head).content.decode('utf-8')
a = etree.HTML(rsp)
rsp = a.xpath("//img[@class='BDE_Image']/@src")
for url in rsp:
get_pic(url)
get_big_pic(url)
def get_pic(url):
'''下载预览图
'''
head['host'] = "imgsrc.baidu.com"
fileName = "./pic/" + url.split("=")[1].split("/")[1]
rsp = requests.get(url, headers=head).content
with open (fileName, 'wb') as f:
f.write(rsp)
def get_big_pic(url):
'''下载高清图
'''
head['host'] = "imgsrc.baidu.com"
id = url.split("=")[1].split("/")[1][:-4]
fileName = "./picBig/" + id + ".jpg"
fileUrl = "http://imgsrc.baidu.com/forum/pic/item/" + id + ".jpg"
rsp = requests.get(fileUrl, headers=head).content
with open (fileName, 'wb') as f:
f.write(rsp)
print(fileName + "下载完毕")
def createDir(dirName):
'''传入目录 \n
目录不存在,创建之 \n
存在,删除其下所有文件 \n
'''
if dirName not in os.listdir():
os.mkdir(dirName)
else:
for i in os.listdir(dirName):
fileName = ".\\" + dirName + "\\" + i
os.remove(fileName)
if __name__ == '__main__':
main()

效果图:

0013_run
0013_files

总结:

  爬虫自己写过一些了,这个题目的需求还是比较简单的。只是简单的获取一下网页上的图片而已,不需要解析网页加载过程。前段时间把贴吧的自动签到功能实现了,后续还实现了模拟手机app签到以获得双倍经验。有空了把那份代码整理整理发出来。。。

0%