爬取京东商城的手机图片

来源：互联网发布：网络被攻击怎么办编辑：程序博客网时间：2024/04/29 10:18

按照韦玮老师书中代码，如下

import reimport urllib.requestdef craw(url, page):    html = urllib.request.urlopen(url).read()    html = str(html)    pat1 = '<div id="plist".+? <div class="page clearfix">'    result1 = re.compile(pat1).findall(html)    result1 = result1[0]    pat2 = '<img width="220" height="220" data-img="1" src="//(.+?\.jpg)">'    imagelist = re.compile(pat2).findall(html)    x = 1    for imageurl in imagelist:        imagename = "D://spider/img1/" + str(page) + str(x)  + ".jpg"        imageurl = "http://" + imageurl        try:            urllib.request.urlretrieve(imageurl, filename=imagename)        except urllib.error.URLError as e:            if hasattr(e, "code"):                print(e.code)                x+=1            if hasattr(e, "reason"):                print(e.reason)                x+=1        x+=1for i in range(1, 79):    url="https://list.jd.com/list.html?cat=9987653,655&page=" + str(i)    craw(url, i)

代码思路：
1.通过urllib.request.urlopen(url).read()读取对应网页的源代码
2.按照pat1进行第一次过滤
3.按照pat2进行第二次过滤，并将图片地址存到一个列表中
4.通过urllib.request.urlretrieve(imageurl, filename=imagename)将图片保存到本地

几个问题：
1.两次过滤的效率会高过一次直接过滤吗？如果不是，是为了防止爬取到其他图片？
2.pat1中的.?+没有加（），而pat2中加了，是为什么。（已解决，加（）会获取到（）中的内容，不加（）会获取包含（）外的内容）
3.在处理URLError时if hasattr(e, "code"): print(e.code) x+=1 if hasattr(e, "reason"): print(e.reason) x+=1
当错误为HTTPError时，两次判断都为true，那么是不是就会变成x+=2，会漏掉一些图片

本人新手程序猿一枚，如果有大佬路过，还请多多指教

这里写图片描述

阅读全文

0 0