python爬虫----简单的图片爬取

来源：互联网发布：中国人才流失严重知乎编辑：程序博客网时间：2024/05/02 05:07

大致说下思路和步骤吧

一、网页分析

1、输入关键词搜索后会得到瀑布流形式展现的图片，我们要爬取的不是这种图，而是点进去后分辨率为960*720的图片，因此还要获取单个图片的页面。

查看图片地址发现，例如：https://cdn.pixabay.com/photo/2017/06/04/12/31/sea-2370936_960_720.jpg。只需匹配2017/06/04/12/31/sea-2370936这段就可以

锁定一张图片

2、在一页中所有的图片的源代码中，用正则表达式匹配每个具体图片的信息。在网页源码中前十几个图片代码和后面的有所区别，因此在匹配的时候要区别开来

re.compile(r'<img srcset="https://cdn.pixabay.com/photo(.*?)-(.*?)__340.*?',re.S)

re.compile(r'data-lazy-srcset="https://cdn.pixabay.com/photo(.*?)-(.*?)__340.*?',re.S)

二、代码功能

1、实现输入关键词能自动搜索图片

2、显示搜索关键词图片的图片总个数和总页数

3、能够指定开始下载的页码和下载的页数

3、在下载完成后显示成功下载和下载失败图片的个数（这里我用了全局变量）

三、文件保存

就是常规的文件操作

因为图片是二进制的格式，所以要用‘wb’来写入文件

我可能写的不是太清楚

建议去看看原作者的文章：https://zhuanlan.zhihu.com/p/26354353

原作者的知乎专栏：https://zhuanlan.zhihu.com/Waking-up

import reimport requestsimport os#定义全局变量T_download_num和F_download_num，分别表示下载成功，下载失败的个数global T_download_numT_download_num = 0global F_download_numF_download_num = 0def getSource(url):    try:        r = requests.get(url)        r.raise_for_status()        r.encoding = r.apparent_encoding        return r    except:        print('wrong!!!!!!!!!!!!!!!!!!!!!!!')def getPhotoSource(url):#获取单张图片地址专用函数    try:        r = requests.get(url)        r.raise_for_status()        r.encoding = r.apparent_encoding        return r    except:        return 'Photo wrong'def getPage_data(url):        result = getSource(url)#用正则表达式匹配时，要根据网页源代码的先后顺序来匹配    pattern = re.compile(r'<input name="pagi" type="text" value="1" style="width:30px">.*?(\d+).*?<h1 class="hide-xs" style="font-size:13px;color:#bbb;margin:0 19px">.*?(\d+).*?',re.S)    items = re.findall(pattern,result.text)[0]            print('\n\n\n您所搜索的图片共有%d张，一共%d页'% (int(items[0]),int(items[1])))    def getOnePagePhoto(url):    result = getSource(url)    pattern1 = re.compile(r'<img srcset="https://cdn.pixabay.com/photo(.*?)-(.*?)__340.*?',re.S)    pattern2 = re.compile(r'data-lazy-srcset="https://cdn.pixabay.com/photo(.*?)-(.*?)__340.*?',re.S)    items = re.findall(pattern1,result.text)    i = 0    for item in items:        try:            Photo_url = 'https://cdn.pixabay.com/photo/' + str(item[0]) + '-' + str(item[1]) + '_960_720.jpg'            DownLoadOnePhoto(Photo_url,item)        except:            continue                i = i + 1            items = re.findall(pattern2,result.text)    for item in items:        try:            Photo_url = 'https://cdn.pixabay.com/photo/' + str(item[0]) + '-' + str(item[1]) + '_960_720.jpg'            #这里因为下面的f.write触发异常也会触发异常，直接执行except里的语句，因此下面f.write后面的语句都将跳过            DownLoadOnePhoto(Photo_url,item)        except:            continuedef DownLoadOnePhoto(Photo_url,item):    global T_download_num    global F_download_num         fpath = 'E:/python/Photo/' +str(item[1]) + '.jpg'        print('正在下载图片......')        result = getPhotoSource(Photo_url)    if result == 'Photo wrong':        F_download_num += 1    else:        print('下载成功！')        T_download_num += 1         E = os.path.exists(fpath)    if not E:        with open(fpath,'wb') as f:            try:                #result如果是str型，会触发异常，回到getOnePagePhoto(url)函数中for循环当中                f.write(result.content)            except:                print('图片下载失败！')    else:        print('图片已存在')    #这条代码会因为f.write发生异常导致上面函数异常，直接执行except里continue，从而这句不执行        print(F_download_num,T_download_num)def main():    key = str(input('请输入搜索关键词（英文）：'))    url = 'https://pixabay.com/zh/photos/?min_height=&image_type=&cat=&q='+ key + '&min_width=&pagi='    num = int(input('请输入总共要搜索的页数：'))    start_page = int(input('请输入开始搜索的页面：'))    getPage_data(url)    for i in range(start_page,start_page + num):        new_url = url + str(i)        getOnePagePhoto(new_url)    print('\n\n\n成功获取图片%d张，嘿嘿！' % T_download_num)    print('%d张图片不知去向-_-!' % F_download_num)main()

总结：

在写的过程中我遇到了一个问题，就是在代码f.wirte发生异常后，后面的代码没有执行。

经过一番查找才发现，原来是getOnePagePhoto(url)函数发生异常，执行了except中的continue，所以后面的代码都被跳过了

有一个多月没写爬虫了（因为那可恶的期末考试），最近重拾起来，发现还是有点生疏了，还得多写多练

最近也看了一些其他人写的项目，许多都写的很好也很有趣。自己也要慢慢学着写，这个爬虫也是模仿着来写的，能写出来感觉还是不错。

阅读全文

0 0