<四>、python爬虫抓取购物网站商品信息--图片价格名称
来源:互联网 发布:人民网软件下载 编辑:程序博客网 时间:2024/04/19 22:14
本篇博客参考:python爬虫入门教程 http://blog.csdn.net/wxg694175346/article/category/1418998
Python爬虫爬取网页图片 http://www.cnblogs.com/abelsu/p/4540711.html
一、项目分析
为了给我的出于实验目的网上商城批量增加商品信息,我需要自动从网上获取大量的商品名称、价格、图片信息保存到本地,再传到我自己的web应用中,为后续实验使用。
看完上面的参考博客就基本可以上手了,需要注意的一点是网上很多案例是python 2.X版本的,而现在一般是python 3.X版本的环境,有些地方代码需要调整,引用的包也有不同。
整个项目没使用scrapy、bs4,比较原生简单,最大的难点应该在于对网页源代码分析,通过正则表达式获取url,这里可能会出现两种和预期不同的错误场景:一是匹配不到,二是匹配过多,需要对正则表达式好好检查。
我选择爬取的是苏宁易购里面的大聚惠类似于聚划算,起点是 https://ju.suning.com/,分析源代码很简易找到各分类的URL
<!--商品列表一级导航栏 [[ --><div class="ju-nav-wrapper"><div class="ju-nav"><table><tr><td class="active"><a name="columnId" id="0" value="1" href="/pc/new/home.html" name1="mps_index_qbsp_qb">全部商品</a></td><td><a name="categCode" href="/pc/column/products-1-0.html#refresh" value="1" name1="mps_index_qbsp_spml1">大家电</a></td><td><a name="categCode" href="/pc/column/products-2-0.html#refresh" value="2" name1="mps_index_qbsp_spml2">电脑数码</a></td><td><a name="categCode" href="/pc/column/products-17-0.html#refresh" value="17" name1="mps_index_qbsp_spml3">生活家电</a></td><td><a name="categCode" href="/pc/column/products-733-0.html#refresh" value="733" name1="mps_index_qbsp_spml4">手机</a></td><td><a name="categCode" href="/pc/column/products-81-0.html#refresh" value="81" name1="mps_index_qbsp_spml5">车品</a></td><td><a name="categCode" href="/pc/column/products-11-0.html#refresh" value="11" name1="mps_index_qbsp_spml6">居家日用</a></td><td><a name="categCode" href="/pc/column/products-10-0.html#refresh" value="10" name1="mps_index_qbsp_spml7">食品</a></td><td><a name="categCode" href="/pc/column/products-8-0.html#refresh" value="8" name1="mps_index_qbsp_spml8">美妆</a></td><td><a name="categCode" href="/pc/column/products-9-0.html#refresh" value="9" name1="mps_index_qbsp_spml9">母婴</a></td><td><a name="categCode" href="/pc/column/products-464-0.html#refresh" value="464" name1="mps_index_qbsp_spml10">服饰鞋包</a></td><td><a name="categCode" href="/pc/column/products-468-0.html#refresh" value="468" name1="mps_index_qbsp_spml11">纸品洗护</a></td><td><a name="categCode" href="/pc/column/products-125-0.html#refresh" value="125" name1="mps_index_qbsp_spml12">家装</a></td></tr></table></div></div>将/pc/column/products-1-0.html改成
https://ju.suning.com/pc/column/products-1-0.html 就是大家电分类的显示页面,然后再对其进行源码分析
<a href="/pc/column/products-1-0.html#refresh" value="0" class="active" name1="mps_1_qbsp_ejqb">全 部</a><input type="hidden" value="P" id="secCategCodeBrand"/><a href="/pc/column/products-1-.html#P" value="P" class="floor" name1="mps_1_qbsp_ejml1">精选品牌</a><a href="/pc/column/products-1-139.html#139" value="139" class="floor" name1="mps_1_qbsp_ejml1">厨卫</a><a href="/pc/column/products-1-137.html#137" value="137" class="floor" name1="mps_1_qbsp_ejml2">冰箱</a><a href="/pc/column/products-1-191.html#191" value="191" class="floor" name1="mps_1_qbsp_ejml3">彩电影音</a><a href="/pc/column/products-1-138.html#138" value="138" class="floor" name1="mps_1_qbsp_ejml4">空调</a><a href="/pc/column/products-1-410.html#410" value="410" class="floor" name1="mps_1_qbsp_ejml5">热水器</a><a href="/pc/column/products-1-409.html#409" value="409" class="floor" name1="mps_1_qbsp_ejml6">洗衣机</a><a href="/pc/column/products-1-552.html#552" value="552" class="floor" name1="mps_1_qbsp_ejml7">净水设备</a><a href="/pc/column/products-1-617.html#617" value="617" class="floor" name1="mps_1_qbsp_ejml8">爆款预订</a>可以得到二级分类的URL,将/pc/column/products-1-.html 改为
https://ju.suning.com/pc/column/products-1-.html就是精选品牌显示页面,然后再对其进行源码分析
<!-- 精选品牌列表 --><h5 id ="P" class="ju-prodlist-head"><span>精选品牌</span></h5><ul class="ju-prodlist-floor1 ju-prodlist-lazyBrand clearfix"><li class="ju-brandlist-item" name="brandCollect" value="100036641"><a href="/pc/brandComm-100036641-1.html" title="帅康(sacon)" expotype="2" expo="mps_1_qbsp_jxpp1:帅康(sacon)" name1="mps_1_qbsp_jxpp1" target="_blank" shape="" class="brand-link"></a><img orig-src-type="1-4" orig-src="//image3.suning.cn/uimg/nmps/PPZT/1000592621751_2_390x195.jpg" width="390" height="195" class="brand-pic lazy-loading" alt="帅康(sacon)"><div class="sale clearfix"><span class="brand-countdown ju-timer" data-time-now="" name="dateNow" data-time-end="2017-09-06 23:59:57.0"></span><span class="brand-buynum" id="100036641"></span></div><div class="border"></div></li><li class="ju-brandlist-item" name="brandCollect" value="100036910"><a href="/pc/brandComm-100036910-1.html" title="富士通(FUJITSU)" expotype="2" expo="mps_1_qbsp_jxpp2:富士通(FUJITSU)" name1="mps_1_qbsp_jxpp2" target="_blank" shape="" class="brand-link"></a><img orig-src-type="1-4" orig-src="//image1.suning.cn/uimg/nmps/PPZT/1000601630663_2_390x195.jpg" width="390" height="195" class="brand-pic lazy-loading" alt="富士通(FUJITSU)"><div class="sale clearfix"><span class="brand-countdown ju-timer" data-time-now="" name="dateNow" data-time-end="2017-09-06 23:59:55.0"></span><span class="brand-buynum" id="100036910"></span></div><div class="border"></div></li>将/pc/brandComm-100036641-1.html改成
https://ju.suning.com/pc/brandComm-100036641-1.html
就是帅康品牌的所有商品的显示页面,然后再对其进行源码分析
<li class="ju-prodlist-item" id="6494577"><div class="item-wrap"><a title="帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐" expotype="1" expo="mpsblist_100036641_ppsp_mrsp1:0070068619|126962539" name1="mpsblist_100036641_ppsp_mrsp1" href="/pc/jusp/product-00010641eb93529d.html" target="_blank" shape="" class="prd-link"></a><img class="prd-pic lazy-loading" orig-src-type="0-1" orig-src="//image4.suning.cn/uimg/nmps/ZJYDP/100059262126962539picA_1_392x294.jpg" width="390" height="292"><div class="detail"><p class="prd-name fixed-height-name">帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐</p><p class="prd-desp-items fixed-height-desp"><span>17大吸力</span><span>销量TOP</span><span>一级能效</span><span>限时抢烤箱!</span></p></div><div class="sale clearfix"><div class="prd-price clearfix"><div class="sn-price"></div><div class="discount"><p class="full-price"></p></div></div><div class="prd-sale"><p class="prd-quan" id="000000000126962539-0070068619"></p><p class="sale-amount"></p></div></div></div><div class="border"></div></li>将/pc/jusp/product-00010641eb93529d.html改成
https://ju.suning.com/pc/jusp/product-00010641eb93529d.html
就是帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐商品的显示页面,然后再对其进行源码分析就可以提取出商品的信息了,下面讲代码实现。
二、项目实现
我最开始完全按照我分析网页源代码的思路一层一级调用实现
import urllibimport urllib.parseimport urllib.requestimport reimport threadingimport queueimport timeq = queue.Queue()r = re.compile(r'href="(http://ju\.suning\.com/pc/jusp/product.+?)"')urls = []#商品-四级分类def save_products_from_url(contents):category_products = re.findall('href="/pc/jusp/product.+?.html"',contents,re.S)print('所有四级分类')print(category_products)for url_product in category_products: url_product = url_product.replace("\"","") url_product = url_product.replace("href=","") url_product = url_product.replace("/pc","http://ju.suning.com/pc") if url_product in urls: continue else: html = download_page(url_product) get_image(html) #设置sleep否则网站会认为是恶意访问而终止访问 time.sleep(1)return #品牌-三级分类def save_brand_from_url(contents):category_brand = re.findall('href="/pc/brandComm.+?.html"',contents,re.S)print('所有三级分类')print(category_brand)for url_brand in category_brand: url_brand = url_brand.replace("\"","") url_brand = url_brand.replace("href=","") url_brand = url_brand.replace("/pc","http://ju.suning.com/pc") if url_brand in urls: continue else: urls.append(url_brand) q.put(url_brand) print('三级分类--:海信') print(url_brand) opener = urllib.request.urlopen(url_brand) contents = opener.read() contents = contents.decode("utf-8") opener.close() time.sleep(1) save_products_from_url(contents)def save_contents_from_url(contents):#二级分类:空调regx = r'href="/pc/column/products-[\d]{1,3}-[\d][\d][\d].html'pattern = re.compile(regx)category_two = re.findall(pattern,repr(contents))print('所有二级分类')print(category_two)for url_two in category_two: url_two = url_two.replace("\"","") url_two = url_two.replace("href=","") url_two = url_two.replace("/pc","http://ju.suning.com/pc") if url_two in urls: continue else: urls.append(url_two) q.put(url_two) print('二级分类--:空调') print(url_two) opener = urllib.request.urlopen(url_two) contents = opener.read() contents = contents.decode("utf-8") opener.close() time.sleep(1) save_brand_from_url(contents)def set_urls_from_contents(contents):#一级分类:大家电g = re.findall('href="/pc/column/products.+?.html#refresh"',contents,re.S)print('所有一级分类')print(g)for url in g :print('一级分类--:大家电')print(url)url = url.replace("\"","")url = url.replace("#refresh","")url = url.replace("href=","")url = url.replace("/pc","http://ju.suning.com/pc")print(url)if url in urls:continueelse:urls.append(url)q.put(url)opener = urllib.request.urlopen(url)contents = opener.read()contents = contents.decode("utf-8")opener.close()time.sleep(1)save_contents_from_url(contents)def save_contents():url = "https://ju.suning.com/"opener = urllib.request.urlopen(url)contents = opener.read()contents = contents.decode("utf-8")opener.close()print('首页')print(url)set_urls_from_contents(contents)def download_page(url):request = urllib.request.Request(url)response = urllib.request.urlopen(request)data = response.read()return data#下载图片def get_image(html):print('price')regx = r'sn.gbPrice ="\d*?.\d*?";'pattern = re.compile(regx) get_price = re.findall(pattern,repr(html))print(get_price)for title in get_price:myindex = title.index('"')newprice = title[myindex+1:len(title)-2]print(newprice)print('title')regx = r'<title>.*?苏宁大聚惠</title>'pattern = re.compile(regx)html = html.decode('utf-8')get_title = re.findall(pattern,repr(html)) for title in get_title:myindex = title.index('【')newtitle = title[7:myindex]print(newtitle)regx = r'orig-src="//image[\d].suning.cn/uimg/nmps/ZJYDP/[\S]*\.jpg'pattern = re.compile(regx)get_img = re.findall(pattern,repr(html))num = 1for img in get_img:img = img.replace("\"","")img = img.replace("orig-src=","http:")print(img)index = img.index('picA')item_id = img[index-18:index]name = img[index-18:index]+'.jpg'print(name)image = download_page(img)with open(name,'wb') as fp:fp.write(image)print('正在下载第%s张图片'%num)num += 1 #将商品价格、名称、编号id写入文件 with open('items.txt','ab') as files:items = '|'+ newprice +'|'+ newtitle +'|'+ item_id + '\r\n'items = items.encode('utf-8')files.write(items)time.sleep(1) returnq.put("https://ju.suning.com/")ts = []t = threading.Thread(target=save_contents)t.start()这样写写比较清晰明了,便于理解,但太笨了都没有用到爬虫经常使用的递归,所以我后面修改了一版,难点是各层次的正则表达式不同,修改后终于使用递归了!
import urllibimport urllib.parseimport urllib.requestimport reimport threadingimport queueimport timeq = queue.Queue()mylock = threading.RLock() urls = [] level = 0category = 0categorysed = 0#层级与正则表达式映射def numbers_to_strings(argument):switcher = {1: regx_1,2: regx_2,3: regx_3,4: regx_4,}return switcher.get(argument, "nothing")def set_urls_from_contents(contents):global levelglobal categoryglobal categorysed#一级分类regx_1 = r'href="/pc/column/products.+?.html#refresh"'#二级分类regx_2 = r'href="/pc/column/products-[\d]{1,3}-[\d][\d][\d].html'#三级分类:品牌regx_3 = r'href="/pc/brandComm.+?.html"'#四级分类:商品regx_4 = r'href="/pc/jusp/product-.+?.html"' pattern = re.compile(regx_4)g = re.findall(pattern,repr(contents))if len(g) >0:level = 4else:level = 0print('商品分类不匹配')print(str(level)+':1')if level == 0: pattern = re.compile(regx_3)g = re.findall(pattern,repr(contents))if len(g) >0: level = 3else: level = 0 print('品牌分类不匹配')else:print('品牌分类跳过')print(str(level)+':2')if level == 0:pattern = re.compile(regx_2)g = re.findall(pattern,repr(contents))if len(g) >0: level = 2else: level = 0 print('二级分类不匹配')else:print('二级分跳过')print(str(level)+':3')if level == 0:pattern = re.compile(regx_1)g = re.findall(pattern,repr(contents))if len(g) >0: level = 1else: level = 0 print('一级分类不匹配')else:print('一级分类跳过')print(str(level)+':4')print('所有分类明细')print(g)for url in g :#url = url.groups()[0]print(str(level)+'级分类:')print(url)if url.find('#refresh')>0:eindex = url.index('.html')print(eindex)sindex = url.index('s-')category = url[sindex+2:eindex-2]print('一级分类id')print(category)elif url.find('products-')>0:eindex = url.index('.html')#sindex = url.index('s-')categorysed = url[eindex-3:eindex]print('二级分类id')print(categorysed)url = url.replace("\"","")url = url.replace("#refresh","")url = url.replace("href=","")url = url.replace("/pc","http://ju.suning.com/pc") print(url)if url.find('product-') >0:level =4else:level = level -1if url in urls:continueelse:urls.append(url)q.put(url)if level == 4: html = download_page(url)get_image(html,category,categorysed)else: opener = urllib.request.urlopen(url)contents = opener.read()contents = contents.decode("utf-8")opener.close()time.sleep(0.1)set_urls_from_contents(contents)def save_contents():url = "https://ju.suning.com/"opener = urllib.request.urlopen(url)contents = opener.read()contents = contents.decode("utf-8")opener.close()print('首页')print(url)set_urls_from_contents(contents)#下载具体一个商品页面中的信息def download_page(url):request = urllib.request.Request(url)response = urllib.request.urlopen(request)data = response.read()return data#下载图片def get_image(html,category,categorysed):print('price')regx = r'sn.gbPrice ="\d*?.\d*?";'pattern = re.compile(regx) get_price = re.findall(pattern,repr(html))print(get_price)#print(html)for title in get_price:myindex = title.index('"')newprice = title[myindex+1:len(title)-2]print(newprice)print('title')regx = r'<title>.*?苏宁大聚惠</title>'pattern = re.compile(regx)html = html.decode('utf-8')get_title = re.findall(pattern,repr(html))#print(get_title) for title in get_title:myindex = title.index('【')newtitle = title[7:myindex]print(newtitle)regx = r'orig-src="//image[\d].suning.cn/uimg/nmps/ZJYDP/[\S]*\.jpg'pattern = re.compile(regx)get_img = re.findall(pattern,repr(html))num = 1for img in get_img:img = img.replace("\"","")img = img.replace("orig-src=","http:")print(img)index = img.index('pic')item_id = img[index-18:index]name = img[index-18:index]+'.jpg'print(name)image = download_page(img)with open(name,'wb') as fp:fp.write(image)print('正在下载第%s张图片'%num)num += 1with open('items.txt','ab') as files:items = str(category)+'|'+str(categorysed)+'|'+ newprice +'|'+ newtitle +'|'+ item_id + '\r\n'items = items.encode('utf-8')files.write(items) time.sleep(1)return#首页入口q.put("https://ju.suning.com/")ts = []t = threading.Thread(target=save_contents)t.start()我在晚上睡觉前运行程序,第二天查看爬了 近三千条记录,程序没有报错,应该是电脑休眠网络中断了,不过数据已经足够了
阅读全文
0 0
- <四>、python爬虫抓取购物网站商品信息--图片价格名称
- python 实现网站图片抓取小爬虫
- 【Python】爬虫入门--抓取网站图片
- python爬虫抓取图片
- python网络爬虫系列(四) --- 批量抓取并保存图片
- python小爬虫—抓取pixabay网站的图片资源
- Python爬虫之从网站图片中抓取文字
- python网络爬虫抓取图片
- Python爬虫抓取网页图片
- python 网络爬虫抓取图片
- python网络爬虫抓取图片
- python网络爬虫抓取图片
- python爬虫 抓取图片入门
- Python爬虫抓取女演员图片
- python 爬虫抓取页面图片
- python网络爬虫抓取图片
- Python 爬虫抓取图片(分页)
- Python爬虫抓取网页图片
- 两个简单的 sklearn 实例
- UVA 10361
- 【PDF下载】金融技术峰会之云数据库系统容灾架构设计和实战
- ImageLoader的使用
- 用伪类实现 两边横线、中间标题的样式
- <四>、python爬虫抓取购物网站商品信息--图片价格名称
- keeplived离线安装openssl-devel依赖包
- ajax jsonp 跨域请求访问实例
- Inversion of Control Containers and the Dependency Injection pattern
- LTE UE能力
- javascript动态获取显示时间
- Django 中 python 生成验证码
- 传智播客_Springmvc+Mybatis由浅入深全套视频教程+文档
- 2017ACM/ICPC广西邀请赛 K- Query on A Tree trie树合并