糗事百科 爬虫 2017 10/1版本的糗事百科 python3.x

来源:互联网 发布:mac装win7镜像下载 编辑:程序博客网 时间:2024/05/21 06:32

从  http://cuiqingcai.com/990.html处学习并改进

1.    首先下载网页基本信息

a.基本的网页下载模式,出现如下错误

       http.client.RemoteDisconnected:Remote end closed connection without response

可能因为么有模拟header

b.需要得到:浏览器的User Agent,则可以在浏览器上输出地址栏上看一下about:version

2. 网页分析器

       a.这里利用正则表达式,需要注意的是如果么有,则截取前后,然后判断

       b.空格太多,可以用a.strip()消除前后空格和换行符

c.出现只能显示部分的情况,应该找到源页面,然后摘取文档,注意此时有图片也不展示

3.基本代码:

# _*_coding:utf-8 -*-import urllibimport  urllib.requestimport urllib.parseimport  reimport urllib.errorimport http.cookiejar__author__ = "muzp"page = 2url = 'https://www.qiushibaike.com/hot/page/' + str(page)user_agent = 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'headers ={'User-Agent': user_agent}try:    request = urllib.request.Request(url, headers=headers)    response = urllib.request.urlopen(request)    content = response.read().decode("utf-8")    pattern = re.compile('''<div class="author clearfix">.*?<h2>(.*?)</h2>'''+                         '''.*?<a href="(.*?)"''' +                         '''.*?<span>(.*?)</span>'''+                         '''(.*?)</div>'''+                         '''.*?<!-- 图片或gif -->(.*?)<div class="stats">''' +                         '''.*?<i.*?number">(.*?)</i>''', re.S)    items = re.findall(pattern, content)    for item in items:        haveImg = re.search("img", item[4])        havere = re.search("查看全文",item[3])        temp =""        if havere:            url1 ="https://www.qiushibaike.com"+item[1]            print(url1)            request1 = urllib.request.Request(url1, headers=headers)            response1 = urllib.request.urlopen(request1)            content1 = response1.read().decode("utf-8")            pattern1 = re.compile('<div class="content">(.*?)</div>(.*?)</div>', re.S)            items1 = re.findall(pattern1, content1)            for item1 in items1:                haveImg1 = re.search("img", item1[1])                if not haveImg1:                    haveImg = None                    temp = item1[0]                else:                    haveImg = True        if not haveImg:            print("作者:"+item[0].strip())            if not havere:                print("内容:"+item[2].strip())            else:                print("内容:" + temp.strip())            print("点赞数:"+item[5].strip()+"\n")except urllib.request.URLError as e:    if(hasattr(e,"code")):        print(e.code)    if(hasattr(e,'reason')):        print(e.reason) 


 

 

原创粉丝点击