python 爬虫
来源:互联网 发布:淘宝一元抢怎么弄 编辑:程序博客网 时间:2024/06/05 10:46
爬虫之抓取糗事百科的段子(python3.5环境):
1.下载页面
2.解析(xpath方法)
# -*-coding:utf-8 -*-import urllib.requestimport sysimport iofrom lxml import etreefrom urllib.parse import urljoinsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #改变标准输出的默认编码def download(originer_url,p): url=str(originer_url)+str(p) print(url) print (p) #添加header headers={'User-Agent':r'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)','Connection':'keep-alive'} #创建opener opener=urllib.request.build_opener() opener.addheaders=[headers] try: page=opener.open(str(url)).read().decode('utf-8') except urllib.error.HTTPError as e: print (e.reson) return pagedef parser(or_url,p): re=[] page=download(str(or_url),str(p)) if not page: print("页面下载失败") return None content=etree.HTML(page) readline=content.xpath('//div[@id="content-left"]/div') for line in readline: #用户名 user=line.xpath('./div[@class="author clearfix"]//h2/text()')[0] #url link=line.xpath('./a[@class="contentHerf"]/@href')[0] link=urljoin(or_url,link) #正文 detail=line.xpath('./a[1]/div/span/text()')[0] #状态 stats=line.xpath('./div[@class="stats"]/span[1]/i/text()')[0] #评论数 dash=line.xpath('./div[@class="stats"]/span[2]//i/text()')[0] if dash is not None and detail is not None and stats is not None and user is not None: d=dict(user=user,detail=detail,stats=stats,dash=dash,link=link) re.append(d) return reif __name__=="__main__": k=0 result={} url=r'http://www.qiushibaike.com/hot/page/' for i in range(1,3): temp=parser(url,i) print(len(temp)) for v in range(len(temp)): dic=temp[v] result[k]={'user':dic['user'],'detail':dic['detail'],'stats':dic['stats'],'dash':dic['dash'],'link':dic['link']} k+=1 for k,v in result.items(): print(k,v)
爬虫注意点:
1.请求头:headers
2.创建opener
3.cookie(用于登录)
4.捕获异常
知识点:
1.url拼接:
from urllib.parse import urljoinlink=urljoin(url,link)
0 0
- python爬虫-->爬虫基础
- [爬虫] Python爬虫技巧
- Python爬虫
- python 爬虫
- python 爬虫
- python 爬虫
- python爬虫
- Python爬虫
- Python爬虫
- python 爬虫
- Python爬虫
- python爬虫
- python 爬虫
- python 爬虫
- python爬虫
- python爬虫
- python爬虫
- python 爬虫
- spring整合hibernate
- angularJs 中的ng-bind-html指令和$sce服务
- html结合js实现简单的树状目录
- 谷歌AdMob广告接入(插屏广告)
- StartSSL免费SSL证书申请和账户注册完整过程
- python 爬虫
- UML
- 《好设计不简单Ⅱ:UI设计师必须了解的那些事》
- Mac系统安装PHP7详解
- 学生选课Pro
- ATM模拟机续作
- Spring MVC
- 织梦报错提示 DedeCMS Error: (PHP 5.3 and above) Please set 'request_order' ini value to include C,G and P
- seetaface人脸识别引擎的windows java 实现,可用于搭建人脸识别java web服务器