菜鸟爬取中关村手机详情页参数及报价
来源:互联网 发布:淘宝怎样用量子统计 编辑:程序博客网 时间:2024/04/26 00:45
菜鸟爬取中关村手机详情页参数及报价
背景介绍:
- 需求是爬取所有手机详情页各个手机型号对应的价位区间及参数;
- 此前没有爬虫经历,套用网上教程屡屡报错,痛定思痛决定自己搜索爬虫框架,参照官方文档并整理网页源码规则,制定适合自己的爬取方案;
- 网上的爬虫框架有scrapy和bs4,个人觉得bs4较易上手,也能满足本次爬取需求,所以选择了bs4;
- 感兴趣的筒子可以研究下scrapy哦,貌似递归爬取很强大的赶脚;
- 废话不多说,下面开始爬取吧~
方案很简单,主要分三步:
- 观察列表页和详情页之间的关系后发现,列表页中
/cell_phone/index375437.shtml
的数字对应详情页url中的数字,由此联想到,可以把这个数字抠出来作为商品id,放入详情页url中,由此获取详情页链接http://detail.zol.com.cn/cell_phone/index375437.shtml
; - 结果数据:
- 第一步,爬取列表页所有的商品id(观察网页后发现,前104页包含了所有有效商品id,104页之后均为空,可以写for循环获取前104页的所有商品id);
- 第二步,将所有商品id带入详情页url获取所有详情页链接 ;
- 第三步,同样循环获取所有详情页源代码,并解析出需要的字段存到csv上。
备注:当然啦,中间涉及要一些调试和解析的过程,是比较费神的,好在都已经解决。
这里获取的详情字段为16个(标题、手机大类名(比如中兴)、中文名称、别名(含英文名称)、上市时间、屏幕尺寸、商家指导价、价格区间、运行内存、存储内存、内核、主屏幕、前置摄像头、后置摄像头、电容、电池类型)+1个(详情页url,方便对照):
参考网址
下面就奉上爬取代码啦!为了方便测试,最后还有爬取其中一个网页的测试代码哦~
建议对照网页源码,更易于理解解析过程。参考网址:
https://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
http://www.jb51.net/article/99453.htm
爬取代码
#--------------------以此为准:爬取zol手机详情页参数,价格等信息#----------------------------爬取列表页url并解出商品id #-*- coding: utf-8 -*-import sysreload(sys)sys.setdefaultencoding('utf-8')#a=soup.get_text().encode('utf-8')import requestsfrom bs4 import BeautifulSoupimport numpy as npimport urllibimport urllib2import reimport osfrom bs4 import BeautifulSoupimport pandas as pdos.chdir('/Users/wyy/Downloads/')print(os.getcwd())if __name__=='__main__': all_pkg=[] for i in range(1,104): url='http://detail.zol.com.cn/cell_phone_index/subcate57_0_list_1_0_9_2_0_'+str(i)+'.html' headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7'}#防止反爬 request=urllib2.Request(url=url,headers=headers) response=urllib2.urlopen(request) content= response.read() soup=BeautifulSoup(content,"lxml") lt=soup.findAll(["a","href"]) for j in lt: pkg_en=j.get('href') # pkg=pkg_en.split('/')[-1] pair=[pkg_en] all_pkg.append(pair) all_pkg1=pd.DataFrame(all_pkg)#/Users/wyy/Downloadsall_pkg1.columns=['url1']t1=all_pkg1.dropna(axis=0,how='any',thresh=None,subset=None,inplace=False)# axis 指 轴,0是行,1是列,# how 是删除条件:any 任意一个为na则删除整行/列,all 整行/列为na才删除# inplace 是否在原DataFrame 上进行删除,false为否t2=t1.loc[(t1['url1'].str.contains('/cell_phone/index'))]#筛选出有用的url1t3=t2.drop_duplicates()#有效url去重后有4913行(个商品id)# 怎样删除list中空字符?# 最简单的方法:new_list = [ x for x in li if x != '' ]#t3.to_csv('phone_url.csv',encoding='utf-8')#处理成手机id(phone_url.csv)#去空格# s = ' rtrt3434'# s.strip()#--------------------------------------------获取详情数据#将手机id传入介绍首页获取详情介绍信息 #-*- coding: utf-8 -*-def map1(x): t=str(x.url1) s=re.findall(r"x(.+?).shtml",t)#正则 return s[0]#取出list中第一个元素t3['url2']=t3.apply(lambda x:map1(x),1)#1是对行操作,默认对列操作t3.head()# Out[23]:# Unnamed: 0 url1 url2# 0 35 /cell_phone/index1164015.shtml 1164015# 1 36 /cell_phone/index375437.shtml 375437# 2 37 /cell_phone/index1164296.shtml 1164296# 3 38 /cell_phone/index1175015.shtml 1175015# 4 39 /cell_phone/index1158842.shtml 1158842#t4=pd.read_csv('phone_id.csv')#if __name__=='__main__':pag=[]data = []for i in t3['url2']: url='http://detail.zol.com.cn/cell_phone/index'+str(i)+'.shtml' headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7'} request=urllib2.Request(url=url,headers=headers) response=urllib2.urlopen(request) content= response.read() soup=[url,BeautifulSoup(content,"lxml")] pag.append(soup) pag1= pd.DataFrame(pag,columns=['url','pag']) #pag1.to_csv('phone_zol_origin.csv',encoding='utf-8')for j in pag: try: Name=[x.split('=')[1].strip() for x in j[1].findAll(text=re.compile("manuName"))[0].split(';') if re.findall('manuName',x)][0][1:-1]#最后一个0表示取出list中的元素,strip()去空格,[1:-1]去掉引号,大类名-中兴 except: Name = '' try: showdate=[x for x in j[1].find_all("span",{"class":"showdate"})[0]][0]#上市时间 except: showdate = '' try: price_range=re.findall(r'\d+',str([x for x in j[1].find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#价格区间 except: price_range = '' try: c_name=re.findall(r"<h1>(.+?)</h1>",unicode([x for x in j[1].find_all("div",{"class":"page-title clearfix"})[0]][1]))[0]#中文全名 except: c_name = '' try: other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in j[1].find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名 except: other_name='' try: title =[x for x in j[1].title][0]#标题 except: title='' try: try: guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#厂商指导价 except: try: guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#厂商指导价 except: guide_price='' #other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in j[1].find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名 #price_range=re.findall(r'\d+',str([x for x in j[1].find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#价格区间 #guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#厂商指导价 ROM =[x for x in j[1].find_all("span",{"class":"price-status"})[0]][0][1:-1]#手机存储容量 screen_size=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value low"})[0]][0]#主屏尺寸 RAM=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value highest"})[0]][0]#运行内存 core_num=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value low"})[1]][0]#核心数 main_screen=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value middle"})[0]][0]#主屏分辨率 camera_back=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value middle"})[1]][0]#后置摄像头 e_capacity=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value middle"})[2]][0]#电池容量 camera_front=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value"})[3]][0]#前置摄像头 battery_type=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value"})[5]][0]#电池类型 except:#有的网页手机参数源码是下面这种规则,如果不满足上面的规则,就执行如下规则哦~替换一个字段列:battery_type换成了extend try: guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#厂商指导价 except: try: guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#厂商指导价 except: guide_price='' try: screen_size=re.findall(r"<em>(.+?)</em>",unicode(j[1].find_all(href=re.compile(j[0]))[1]))[0]#screen_size except: screen_size='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 228.15px"><em>6.44英寸</em></a> try: main_screen=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[3]))[0]#main_screen except: main_screen='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 115.1px"><em>342ppi</em></a> try: e_capacity =re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[5]))[0]#e_capacity except: e_capacity ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 234.8px"><em>4850mAh</em></a> try: camera_front=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[7]))[0]#camera_front except: camera_front ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 170px"><em>1600万</em></a> try: camera_back=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[9]))[0]#camera_back except: camera_back ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 134px"><em>500万</em></a> try: ROM=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[12]))[0]#ROM except: ROM ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 164.28571428571px"><em>64GB</em></a> try: extend =re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[13]))[0]#新字段,内存是否可扩展,#extend except: extend ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 100%"><em>可扩展</em></a> try: RAM =re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[15]))[0]#RAM except: RAM ='' list1=[j[0],Name,title,c_name,other_name,showdate,screen_size,guide_price,price_range,ROM,RAM,main_screen,camera_front,camera_back,e_capacity,extend]#,core_num,battery_type data.append(list1)clean= pd.DataFrame(data,columns=['url','Name','title','c_name','other_name','showdate','screen_size','guide_price','price_range','ROM','RAM','main_screen','camera_front','camera_back','e_capacity','extend'])#,'core_num','battery_type' #clean.to_csv('2017.7.30phone_zol.csv',encoding='gbk')
测试代码
#-*- coding: utf-8 -*-import sysreload(sys)sys.setdefaultencoding('utf-8')#a=soup.get_text().encode('utf-8')import requestsfrom bs4 import BeautifulSoupimport numpy as npimport urllibimport urllib2import reimport osfrom bs4 import BeautifulSoupimport pandas as pdos.chdir('/Users/wyy/Downloads/')print(os.getcwd())#--------------测试(其中一个网页)url='http://detail.zol.com.cn/cell_phone/index1174169.shtml'headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7'}request=urllib2.Request(url=url,headers=headers)response=urllib2.urlopen(request)content= response.read()soup=BeautifulSoup(content,"lxml")#介绍:title+手机名(大类)+机型+别名+报价+上市时间+爬取链接+参数(主屏+尺寸+摄像头+内存)soup.find_all("span",{"class":"showdate"})#上市时间#<span class="showdate">\u4e0a\u5e02\u65f6\u95f4\uff1a2017\u5e7406\u670822\u65e5</span># print u'\u4e0a\u5e02\u65f6\u95f4\uff1a2017\u5e7406\u670822\u65e5'# 上市时间:2017年06月22日soup.find_all("span",{"class":"merchant-price-range"})#价格区间#<span class="merchant-price-range"><a href="/1175/1174169/price.shtml">¥1259<i> 至 </i>1329</a></span>soup.find_all("div",{"class":"page-title clearfix"})#机型中文名+别名# <div class="page-title clearfix"># <h1>中兴小鲜5(4GB RAM/全网通)</h1><span class="num"><a href="/series/57/22380_1.html">( 系列共2款 )</a></span> <h2>别名:ZTE V0840,中兴 V0840</h2><div class="subtitle">2.5D弧面玻璃,后置双摄,指纹解锁,23种语言实时翻译</div># <!-- 当排行是 1-10 的情况,会给a加一个class lt10 --># </div>soup.find_all("b",{"class":"price-type price-retain"})#厂商指导价# <b chart-data='[["7.22","7.23","7.24","7.25","7.28"],[1399,1399,1399,1399,1399],1399,1399]' class="price-type price-retain">1399<i class="icon"></i></b>soup.find_all("span",{"class":"price-status"})# ROM# <span class="price-status">[\u5317\u4eac 32GB\u5382\u5546\u6307\u5bfc\u4ef7]</span># print u'[\u5317\u4eac 32GB\u5382\u5546\u6307\u5bfc\u4ef7]'# [北京 32GB厂商指导价]soup.title#标题#<title>【中兴小鲜5 4GB RAM/全网通】报价_参数_图片_论坛_ZTE ZTE V0840,中兴 V0840中兴手机报价-ZOL中关村在线</title>soup.findAll(text=re.compile("manuName"))#manuName= '中兴';manuId= '642' # var pageType = 'Detail'; # var subPageType = 'Detail'; # var subcateId = '57'; # var manuId = '642'; # var proId = '1174169'; # var seriesId = '22380'; # var subcateName = '手机'; # var manuName = '中兴'; # var pv_subcatid = subcateId; # var requestTuanFlag = '0'; # var ewImg = 'http://qr.fd.zol-img.com.cn/qrcode/qrcodegen.php?sizeNum=2&logotype=pure&url=http%3A%2F%2Fwap.zol.com.cn%2F1175%2F1174169%2Findex.html%3Ffrom%3Dqrcode&token=66fde63e76'; # var tplType = ''; # var dataFrom = '0';soup.find_all("span",{"class":"param-value middle"})#<span>主屏分辨率:</span>#<span>后置摄像头:</span>#<span>电池容量:</span>soup.find_all("span",{"class":"param-value low"})#<span>主屏尺寸:</span>#<span>核心数:</span>soup.find_all("span",{"class":"param-value highest"})#<span>内存:</span>soup.find_all("span",{"class":"param-value"})#<span>主屏尺寸:</span>#<span>主屏分辨率:</span>#<span>后置摄像头:</span>#<span>前置摄像头:</span> ####<span>电池容量:</span>#<span>电池类型:</span> ####<span>核心数:</span>#<span>内存:</span>#停产无参考报价,待上市新机无价格区间,有时候无别名,英文名称直接在中文名中;有时候无厂商指导价(有参考报价,如新机,停产机)try: Name=[x.split('=')[1].strip() for x in soup.findAll(text=re.compile("manuName"))[0].split(';') if re.findall('manuName',x)][0][1:-1]#最后一个0表示取出list中的元素,strip()去空格,[1:-1]去掉引号,大类名-中兴except: Name = ''try: showdate=[x for x in soup.find_all("span",{"class":"showdate"})[0]][0]#上市时间except: showdate = ''try: price_range=re.findall(r'\d+',str([x for x in soup.find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#价格区间except: price_range = ''try: c_name=re.findall(r"<h1>(.+?)</h1>",unicode([x for x in soup.find_all("div",{"class":"page-title clearfix"})[0]][1]))[0]#中文全名except: c_name = ''try: other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in soup.find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名except: other_name=''title =[x for x in soup.title][0]#标题try: try: guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#厂商指导价 except: try: guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#厂商指导价 except: guide_price='' other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in soup.find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名 price_range=re.findall(r'\d+',str([x for x in soup.find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#价格区间 guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#厂商指导价 ROM =[x for x in soup.find_all("span",{"class":"price-status"})[0]][0][1:-1]#手机存储容量 screen_size=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value low"})[0]][0]#主屏尺寸 RAM=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value highest"})[0]][0]#运行内存 core_num=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value low"})[1]][0]#核心数 main_screen=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value middle"})[0]][0]#主屏分辨率 camera_back=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value middle"})[1]][0]#后置摄像头 e_capacity=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value middle"})[2]][0]#电池容量 camera_front=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value"})[3]][0]#前置摄像头 battery_type=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value"})[5]][0]#电池类型except:#有的网页手机参数源码是下面这种规则,如果不满足上面的规则,就执行如下规则哦~替换一个字段列:battery_type换成了extend try: guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#厂商指导价 except: try: guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#厂商指导价 except: guide_price='' try: screen_size=re.findall(r"<em>(.+?)</em>",unicode(soup.find_all(href=re.compile(url))[1]))[0]#screen_size except: screen_size='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 228.15px"><em>6.44英寸</em></a> try: main_screen=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[3]))[0]#main_screen except: main_screen='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 115.1px"><em>342ppi</em></a> try: e_capacity =re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[5]))[0]#e_capacity except: e_capacity ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 234.8px"><em>4850mAh</em></a> try: camera_front=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[7]))[0]#camera_front except: camera_front ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 170px"><em>1600万</em></a> try: camera_back=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[9]))[0]#camera_back except: camera_back ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 134px"><em>500万</em></a> try: ROM=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[12]))[0]#ROM except: ROM ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 164.28571428571px"><em>64GB</em></a> try: extend =re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[13]))[0]#新字段,内存是否可扩展,#extend except: extend ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 100%"><em>可扩展</em></a> try: RAM =re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[15]))[0]#RAM except: RAM ='' #<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 133.33333333333px"><em>3GB</em></a>list=[]list1=[url,Name,title,c_name,other_name,showdate,screen_size,guide_price,price_range,ROM,RAM,core_num,main_screen,camera_front,camera_back,e_capacity,battery_type,extend]list.append(list1)clean= pd.DataFrame(list,columns=['url','Name','title','c_name','other_name','showdate','screen_size','guide_price','price_range','ROM','RAM','core_num','main_screen','camera_front','camera_back','e_capacity','battery_type','extend']) #clean.to_csv('2017.7.30phone_zol.csv',encoding='gbk')
阅读全文
0 0
- 菜鸟爬取中关村手机详情页参数及报价
- 涉及详情页的信息爬取
- Nvidia显卡类型、参数及报价
- 使用webmagic 爬取中关村评论
- 未涉及详情页的信息爬取
- 中关村报价系统上线 可助消费者识别虚报价格
- 爬取chaoshi.tmall商品详情
- 爬取chaoshi.tmall商品详情
- 手机淘宝详情页的最新设计规范
- 五一中关村--买手机历险记
- xpath爬取首页信息,并获取详情页标题与时间
- ASTER数据简介及报价
- ASTER数据介绍及报价
- 事件分发和截获取--详情页常见效果
- 高仿360手机助手应用详情页和贝贝商品详情页的实现
- 如何一键生成手机淘宝详情页?
- 京东手机商品详情页技术解密
- 中关村
- Atitit 软件 开发 与互联网发展趋势 与一些原则 潮流就是社区化 o2o 各种服务化 xaas ##--------信息化建设的理念 1.1.兼容性(不同版本与项目兼容性有利
- Atitit 关于微服务的思考与理解 attilax总结 1.1. 架构的历史 微服务发展历史 Web》soa》msa1 1.2. 微服务最大特点 独立部署1 2. 微服务的优点1 2.1.
- JMock入门
- Atitit 企业文化之道 ---假日文化 attilax总结
- leetcode insertionSortList
- 菜鸟爬取中关村手机详情页参数及报价
- 内核编译
- Atitit 概念还是技术更重要
- API学习TreeMap
- ofbiz实体引擎(七) 检查数据源
- Linux运维学习笔记之二:常用命令
- enum 阐述
- Scala简述、安装scala以及集成开发环境Scala Eclipse
- c++STL pair的基本用法