Python爬取返利网（今日值得买）数据

来源：互联网发布：网络机顶盒系统升级编辑：程序博客网时间：2024/04/29 16:19

双十一还没消停，双十二又来了。看返利网<今日值得买>的数据时时不断的在更新。。。。。。

1.爬取返利网的商品名，分类，推荐人，好评数和差评数

2.商品信息不断更新，查看页面源代码仅可以看见一开始显示的几个商品的代码。

页面加载规律是往下拉页面，便加载5个商品，一页有50个商品。

所以，还是打开谷歌浏览器，按F12，向下拉页面，使数据完全加载完毕。

一开始并不知道数据存在哪，便一个一个点开看，查找数据。

发现红框里的才是所需要的.

3.然后便是找网址的规律

可以发现，每页有10个小块，因为每页打开时第1页页面源代码中是有的，在这里没有显示，但同样也可打开。而且第11页是不需要的。第1个数字代表了第几页，第2个数字代表了每页加载的第几个模块。

这是用相同规律的网址打开的第一个小模块：

这是第11个小模块，里面没有数据：

4.先开始爬取一个小模块的数据

获取网址后，右键查看页面源代码，可以使用BeautifulSoup模块爬取内容

观察这些代码可以发现，每一个商品都是一个div，详细信息包含在各个a标签中。

开始的思路是获取所有a标签，然后通过列表索引取值就可以了。这样的确是可以的，不过仅对小模块里的第一个商品有效。比如：

第一个div里有10个a标签，获取想要的数据可以索引取值比如为a[0],a[3],a[5]

如果以下的各个div也均有10个a标签，且顺序位置相同，则可以按照递增序列加值，那么获取第二个小模块应为a[10],a[13],a[15]

可是，通过看源代码知道，div中a标签的数量是不一致的，有10个的，有12个的，所以写着写着这种方法就被自己否决了。

接着看源代码，发现想要获取的数据所在的每个a标签中均有class，然后就想通过class去匹配a标签。

所以代码为：

<span style="font-size:18px;">#coding:utf-8import urllibimport refrom bs4 import BeautifulSoupdef getHtml(url):    page=urllib.urlopen(url)    html=page.read()    return htmldef getItems(html):    rep=re.compile(' J_tklink_tmall')#匹配这个的原因是想要获取的a标签中有的有这个类名，影响匹配，所以把它替换掉才能都匹配出来    data=rep.sub('',html)    soup=BeautifulSoup(data)    name_list=soup.find_all('a',class_='J-item-track nodelog')#商品名称    fenlei_list=soup.find_all('a',class_='nine')#分类    usr_list=soup.find_all('div', class_='item-user')#推荐人    yes_list=soup.find_all('a',class_='l item-vote-yes J-item-vote-yes')#好评    no_list=soup.find_all('a',class_='l item-vote-no J-item-vote-no')#差评    for i in range(0,5):        print name_list[i].get_text(strip=True).encode("utf-8")+'|'\              +fenlei_list[i].get_text(strip=True).encode("utf-8")+'|'\              +usr_list[i].get_text(strip=True).encode("utf-8")+'|'+\              yes_list[i].get_text(strip=True).encode("utf-8")+'|'+\              no_list[i].get_text(strip=True).encode("utf-8")+'|'+'\n'url='http://zhide.fanli.com/index/ajaxGetItem?cat_id=0&tag=&page=1-1&area=0&tag_id=0&shop_id=0'html=getHtml(url)getItems(html)</span>

5.爬取多页的数据，先爬前50页的数据

因为url中有两个变量，因此相当于一个二维数组，需要两次for循环。

代码：

<span style="font-size:18px;">#coding:utf-8import urllib2import refrom bs4 import BeautifulSoupclass fanli():    def __init__(self):        self.usr_agent='Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'        self.headers={'usr_agent',self.usr_agent}        self.p=1        self.pageIndex=1    def getHtml(self,p,pageIndex):        try:            #只获取第一页的数据，网址只需要传入P参数即可。即：            #url='http://zhide.fanli.com/index/ajaxGetItem?cat_id=0&tag=&page=1-'+str(p)+'&area=0&tag_id=0&shop_id=0'            url='http://zhide.fanli.com/index/ajaxGetItem?cat_id=0&tag=&page='+str(pageIndex)+'-'+str(p)+'&area=0&tag_id=0&shop_id=0'            request=urllib2.Request(url)            page=urllib2.urlopen(request)            html=page.read()            return html        except urllib2.URLError,e:            if hasattr(e,'reason'):                print u'连接失败',e.reason                return None    def getItems(self):        with open('fanli.txt','a') as f:            f.write('商品名称'+'|'+'分类'+'|'+'推荐人'+'|'+'好评数'+'|'+'差评数'+'\n')        for pageIndex in range(1,51):#IndexError：list index out of range在第11页出现的问题            for p in range(1,11):                html=self.getHtml(pageIndex,p)                rep=re.compile(' J_tklink_tmall')                data=rep.sub('',html)                soup=BeautifulSoup(data)                name_list=soup.find_all('a',class_='J-item-track nodelog')#商品名称                fenlei_list=soup.find_all('a',class_='nine')#分类                usr_list=soup.find_all('div', class_='item-user')#推荐人                yes_list=soup.find_all('a',class_='l item-vote-yes J-item-vote-yes')#好评                no_list=soup.find_all('a',class_='l item-vote-no J-item-vote-no')#差评                f=open('fanli.txt','a')                for i in range(0,5):                    f.write(name_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +fenlei_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +usr_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +yes_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +no_list[i].get_text(strip=True).encode("utf-8")+'|'+'\n')                f.close()spider=fanli()spider.getItems()</span>

代码看似正确，不过却报错了，就在27行往下，不知道哪个list超出了索引范围。然后修改了第27行的for循环，改成range(1,11)，则正常运行。一旦改成12，即数据爬到11页时，就会出现错误。于是又根据爬取单页的代码单独爬取了第11页的数据，居然是完整不报错的爬出来了！！

经大神指教，用try……except抛出异常，然后continue.............

结果还是不对！！maybe，，，，，我的try……except用错了！

啊哦~~~~~~~~~~~~

===============================我是华丽丽的分割线=========================================
终于知道错在哪了！！！！啊啊啊啊啊哦~~~~

Python传参是要按照顺序来的！！我把这个居然给忘了。。。

获取网页时参数的性质是这样写的：

而调用函数时参数的顺序写错了：

所以调换顺序即可，这样就可以正常运行了。

如果定义的getHtml()函数的参数不修改的话，要么调换调用时的顺序，要么将函数调用时进行赋值说明，即：

就可以了。

所以最终正确的代码为：

#coding:utf-8import urllib2import refrom bs4 import BeautifulSoupclass fanli():    def __init__(self):        self.usr_agent='Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'        self.headers={'usr_agent',self.usr_agent}        self.p=1        self.pageIndex=1    def getHtml(self,pageIndex,p):        try:            url='http://zhide.fanli.com/index/ajaxGetItem?cat_id=0&tag=&page='+str(pageIndex)+'-'+str(p)+'&area=0&tag_id=0&shop_id=0'            request=urllib2.Request(url)            page=urllib2.urlopen(request)            html=page.read()            return html        except urllib2.URLError,e:            if hasattr(e,'reason'):                print u'连接失败',e.reason                return None    def getItems(self):        with open('fanli.txt','a') as f:            f.write('商品名称'+'|'+'分类'+'|'+'推荐人'+'|'+'好评数'+'|'+'差评数'+'\n')        for pageIndex in range(1,51):            for p in range(1,11):                html=self.getHtml(pageIndex,p)                rep=re.compile(' J_tklink_tmall')                data=rep.sub('',html)                soup=BeautifulSoup(data)                name_list=soup.find_all('a',class_='J-item-track nodelog')#商品名称                fenlei_list=soup.find_all('a',class_='nine')#分类                usr_list=soup.find_all('div', class_='item-user')#推荐人                yes_list=soup.find_all('a',class_='l item-vote-yes J-item-vote-yes')#好评                no_list=soup.find_all('a',class_='l item-vote-no J-item-vote-no')#差评                f=open('fanli.txt','a')                for i in range(0,5):                    f.write(name_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +fenlei_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +usr_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +yes_list[i].get_text(strip=True).encode("utf-8")+'|'\                        +no_list[i].get_text(strip=True).encode("utf-8")+'|'+'\n')                f.close()spider=fanli()spider.getItems()

还有需要注意的是，这次代码中urllib2.Request()没有传入headers参数，因为报错了。

AttributeError: 'set' object has no attribute 'items'

将headers去掉就能正常运行了。

2 0