[python每日一练]--0013:爬百度贴吧图片

题目链接：https://github.com/Show-Me-the-Code/show-me-the-code
代码github链接：https://github.com/wjsaya/python_spider_learn/tree/master/python_daily
个人博客地址：https://wjsaya.github.io

第 0013 题： 用 Python 写一个爬图片的程序，爬 这个链接 里的日本妹子图片 :-)

思路：

加载帖子url
解析返回的网页结构，获取所有图片的id
根据图片id下载对应的预览图和大图
1. 下载的预览图存放路径为./pic/XXX.jpg
2. 大图路径为./picBig/XXX.jpg

代码：

#!/usr/bin/env python3
# -*- coding: utf-8 -*- 
# @Author:	wjsaya(http://www.wjsaya.top) 
# @Date:	2018-08-15 23:55:53 
# @Last Modified by:	wjsaya(http://www.wjsaya.top) 
# @Last Modified time:	2018-08-16 02:10:36 
import os
import requests
from lxml import etree
head = {
    "Host" : "tieba.baidu.com",
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0",
    "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language" : "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
    "Accept-Encoding" : "gzip, deflate"
}
def main():
    '''主函數 \n
    下载的帖子预览图在pic目录 \n
    高清图在picBig目录 \n
    '''
    createDir("pic")
    createDir("picBig")
    rsp = requests.get("http://tieba.baidu.com/p/2166231880?traceid=", headers=head).content.decode('utf-8')
    a = etree.HTML(rsp)
    rsp = a.xpath("//img[@class='BDE_Image']/@src")
    for url in rsp:
        get_pic(url)
        get_big_pic(url)
def get_pic(url):
    '''下载预览图
    '''
    head['host'] = "imgsrc.baidu.com"
    fileName = "./pic/" + url.split("=")[1].split("/")[1]
    rsp = requests.get(url, headers=head).content
    with open (fileName, 'wb') as f:
        f.write(rsp)
def get_big_pic(url):
    '''下载高清图
    '''
    head['host'] = "imgsrc.baidu.com"
    id = url.split("=")[1].split("/")[1][:-4]
    fileName = "./picBig/" + id + ".jpg"
    fileUrl = "http://imgsrc.baidu.com/forum/pic/item/" + id + ".jpg"
    rsp = requests.get(fileUrl, headers=head).content
    with open (fileName, 'wb') as f:
        f.write(rsp)
    print(fileName + "下载完毕")
    
def createDir(dirName):
    '''传入目录 \n
    目录不存在，创建之 \n
    存在，删除其下所有文件 \n
    '''
    if dirName not in os.listdir():
        os.mkdir(dirName)
    else:
        for i in os.listdir(dirName):
            fileName =  ".\\" + dirName + "\\" + i 
            os.remove(fileName)
if __name__ == '__main__':
    main()

效果图：

总结:

爬虫自己写过一些了，这个题目的需求还是比较简单的。只是简单的获取一下网页上的图片而已，不需要解析网页加载过程。前段时间把贴吧的自动签到功能实现了，后续还实现了模拟手机app签到以获得双倍经验。有空了把那份代码整理整理发出来。。。