糗事百科爬虫 2017 10/1版本的糗事百科 python3.x

来源：互联网发布：mac装win7镜像下载编辑：程序博客网时间：2024/05/21 06:32

从 http://cuiqingcai.com/990.html处学习并改进

1. 首先下载网页基本信息

a.基本的网页下载模式，出现如下错误

http.client.RemoteDisconnected:Remote end closed connection without response

可能因为么有模拟header

b.需要得到：浏览器的User Agent，则可以在浏览器上输出地址栏上看一下about:version

2. 网页分析器

a.这里利用正则表达式，需要注意的是如果么有，则截取前后，然后判断

b.空格太多，可以用a.strip()消除前后空格和换行符

c.出现只能显示部分的情况，应该找到源页面，然后摘取文档，注意此时有图片也不展示

3.基本代码：

# _*_coding:utf-8 -*-import urllibimport  urllib.requestimport urllib.parseimport  reimport urllib.errorimport http.cookiejar__author__ = "muzp"page = 2url = 'https://www.qiushibaike.com/hot/page/' + str(page)user_agent = 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'headers ={'User-Agent': user_agent}try:    request = urllib.request.Request(url, headers=headers)    response = urllib.request.urlopen(request)    content = response.read().decode("utf-8")    pattern = re.compile('''<div class="author clearfix">.*?<h2>(.*?)</h2>'''+                         '''.*?<a href="(.*?)"''' +                         '''.*?<span>(.*?)</span>'''+                         '''(.*?)</div>'''+                         '''.*?<!-- 图片或gif -->(.*?)<div class="stats">''' +                         '''.*?<i.*?number">(.*?)</i>''', re.S)    items = re.findall(pattern, content)    for item in items:        haveImg = re.search("img", item[4])        havere = re.search("查看全文",item[3])        temp =""        if havere:            url1 ="https://www.qiushibaike.com"+item[1]            print(url1)            request1 = urllib.request.Request(url1, headers=headers)            response1 = urllib.request.urlopen(request1)            content1 = response1.read().decode("utf-8")            pattern1 = re.compile('<div class="content">(.*?)</div>(.*?)</div>', re.S)            items1 = re.findall(pattern1, content1)            for item1 in items1:                haveImg1 = re.search("img", item1[1])                if not haveImg1:                    haveImg = None                    temp = item1[0]                else:                    haveImg = True        if not haveImg:            print("作者："+item[0].strip())            if not havere:                print("内容："+item[2].strip())            else:                print("内容：" + temp.strip())            print("点赞数："+item[5].strip()+"\n")except urllib.request.URLError as e:    if(hasattr(e,"code")):        print(e.code)    if(hasattr(e,'reason')):        print(e.reason)

阅读全文

0 0

糗事百科 爬虫 2017 10/1版本的糗事百科 python3.x

糗事百科爬虫 2017 10/1版本的糗事百科 python3.x