Python 爬虫学习4

来源：互联网发布：对淘宝客服的理解编辑：程序博客网时间：2024/06/13 13:34

任务：爬取58页面中的正常商品，每个商品详情页中的类目、标题、发帖时间、价格、成色、区域、浏览量信息

注意：多开反爬取、看是个人还是商家（url中的0和1）

详情页中的标题

待改进，'http://bj.58.com/pingbandiannao/{}/pn2，这是第二页的，第一页的每个详情页的链接爬不了（问题出在第一页趴下来的关于详情页链接有问题，第一页不同于后面页，后面页也有不同的，都是截取链接时要注意截取的标志是什么），浏览量还没成功

from bs4 import BeautifulSoupimport requestsheaders = {    'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}# url = 'http://bj.58.com/pingbandiannao/26062681492781x.shtml'# wb_data = requests.get(url)# soup = BeautifulSoup(wb_data.text, 'lxml')def get_links_from(who_sells):      #在列表页获取链接    urls = []    url = 'http://bj.58.com/pingbandiannao/{}x.shtml'    list_view = 'http://bj.58.com/pingbandiannao/{}/pn1/'.format(str(who_sells))    wb_data = requests.get(list_view)    soup = BeautifulSoup(wb_data.text, 'lxml')    for link in soup.select('td.t a.t'):        urls.append(url.format(link.get('href').split('=')[-1].split('_')[0]) )    #指定字符串以？分为两部分,因为第一页和后续页数有不同的格式，所以找到他们之间通用的格式    return urls    # print(urls)'''http://bj.58.com/pingbandiannao/29292515289012x.shtml?psid=159059162195147270219031821&entinfo=29292515289012_0http://bj.58.com/pingbandiannao/26213204416840x.shtml?psid=159059162195147270219031821&entinfo=26213204416840_0&iuType=p_1&PGTID=0d305a36-0000-1fe0-6998-6f094bc793e8&ClickID=1'''def get_views_from(url):    id = url.split('/')[-1].strip('x.shtml')        #获取每一个详情页的唯一特性    api = 'http://jst1.58.com/counter?infoid={}'.format(id)    js = requests.get(api, headers=headers)    views = js.text.split('=')[-1]    return views    # print(views)def get_item_info(who_sells=1):     #详情页，who_sells = 1 是商家    urls = get_links_from(who_sells)    for url in urls:        wb_data = requests.get(url)        soup = BeautifulSoup(wb_data.text, 'lxml')        title = soup.title.text     #直接在标题中找        price = soup.select('span.price')     #大型网络直接copy的select太复杂，采取简化,详情页简单，“#”代表id        print(price)        date = soup.select('.time')        view = soup.select('em#totalcount')        # area = soup.select('span.c_25d')        data = {        #转转网的几个参数会不同,目前不考虑转转网的            'title':title,            'price':soup.select('span.price')[0].text,      #单一元素，一个列表中只有一个元素，取出来的是标签，进行text            'date':date[0].text,            'area':list(soup.select('.c_25d')[0].stripped_strings) if soup.find_all('span','c_25d') else None,            'cata':'个人'if who_sells == 0 else'商家',     #分类，商家还是个人            'views':get_views_from(url)     #浏览量,js请求,浏览量在source中        }        print(data)# get_item_info(url)# get_links_from(1)# get_views_from(url)get_item_info()


 
                                                     0        0           	
					
					   Python 爬虫学习4
	  	   python爬虫学习--pixiv爬虫(4)--代码优化
	  	   学习python爬虫
	  	   python 爬虫学习一
	  	   Python爬虫学习
	  	   Python爬虫学习
	  	   python爬虫学习
	  	   Python学习--爬虫
	  	   python学习爬虫
	  	   Python 爬虫学习1
	  	   Python 爬虫学习2
	  	   python简单爬虫学习
	  	   Python简单爬虫学习
	  	   Python爬虫学习系列
	  	   python +  机器学习 + 爬虫
	  	   python 爬虫 学习
	  	   python爬虫基础学习
	  	   Python爬虫学习总结
	     		  
	  	   直接插入排序算法
	  	   史上最详尽的平衡树(splay)讲解与模板
	  	   redis持久化
	  	   android 串口通讯 JNI
	  	   Codeforces 777E/778C 题解 （贪心）
	  	   Python 爬虫学习4
	  	   POJ 3580 SuperMemo（splay成段更新、区间最小值、反转、插入和删除、区间搬移）
	  	   使用Masonry还是storyboard？
	  	   看门狗框架的原理
	  	   Winform应用程序实现通用遮罩层二
	  	   JSP标准标签库——JSTL
	  	   【LeetCode题解】56. Merge Intervals
	  	   C++Primer第五版 第十三章习题答案（21~30）
	  	   C/S架构的简单文件传输系统的实现