简单爬取糗事百科

来源：互联网发布：java程序员接私活经验编辑：程序博客网时间：2024/05/19 04:27

刚刚入门，对于爬虫还要折腾很久才行，虽然很多功能还没开始掌握，但是爬取下来就很开心，接下来还会争取进步的。把自己出现的一些错误都加上了注释，我目前还在学习当中，大家一起进步。

期间学了一个新的函数，在这里分享下：

strip()

网上是这么说的

需要注意的是，传入的是一个字符数组，编译器去除两端所有相应的字符，直到没有匹配的字符，比如：

theString = 'saaaay yes no yaaaass'
print theString.strip('say')

运行结果：
yes no

这里的两端，只是指向theString整个字符串两端的字符，即saaaaay 和yaaaass，将这两个单词，前后包含“s”,"a","y"三者之一的字符按顺序一一去掉，对于中间的“yes”是无效的。

如果没有指定的话，就会替换掉前后的空格。当rm为空时，默认删除空白符（包括'\n', '\r', '\t', ' ')

所以如果需要替换掉一些<br/>之类的字符串，可以选择先将其用re.sub替换成空白符，然后在用strip()删除。

import urllib2import reurl = 'http://www.qiushibaike.com/hot/page/'def get_url(url):    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'} #记得引号    req=urllib2.Request(url,headers =headers)    response = urllib2.urlopen(req)    html = response.read().decode('utf-8')    return html        def get_info(url):    html = get_url(url)    re_info = r'h2\>(.+?)\</h2\>.+?\<div class="content"\>(.+?)\</div\>.+?\<i class="number"\>(.+?)\</i'    info_compile = re.compile(re_info,re.S)#多行使用re.S    info = re.findall(info_compile,html)    #return info 放进列表里面的是元组    for story in info:        re_text = re.compile('<br/>')        text = re.sub(re_text,'',story[1])        print u"发布人：%s\t 赞：%s\n%s" %(story[0],story[2],text)#\t为横向制表符#前面加u啊啊啊啊get_info(url)

接下来写循环的部分，其实只是在最后去掉get_info(url)再加上一点点的东西

def get_all(pages):    for i in range(1,pages):        url = start_url + str(i)        get_info(url)        get_all(3)

接下来实现打印加上页码，回车打印每一个段子，更是只是加上一点点的东西，添加raw_input进行判断

def get_info(url,page):    html = get_url(url)    re_info = r'h2\>(.+?)\</h2\>.+?\<div class="content"\>(.+?)\</div\>.+?\<i class="number"\>(.+?)\</i'    info_compile = re.compile(re_info,re.S)    info = re.findall(info_compile,html)    print u'正在读取，回车查看，Q退出'    for story in info:        input = raw_input()        if input == 'Q':            return        re_text = re.compile('<br/>')        text = re.sub(re_text,'',story[1])        print u"第%d页\t发布人：%s\t 赞：%s\n%s" %(page,story[0],story[2],text)

如果没有更改url，会报出如下错误：

Traceback (most recent call last):

File "<ipython-input-17-70dae055a5ee>", line 38, in <module>
get_all(3)

File "<ipython-input-17-70dae055a5ee>", line 34, in get_all
url = url + str(i)

UnboundLocalError: local variable 'url' referenced before assignment

为了区分，我们将url换名为start_url，完整如下

import urllib2import restart_url = 'http://www.qiushibaike.com/hot/page/'def get_url(url):    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'} #记得引号    req=urllib2.Request(url,headers =headers)    response = urllib2.urlopen(req)    html = response.read().decode('utf-8')    return html        def get_info(url,page):    html = get_url(url)    re_info = r'h2\>(.+?)\</h2\>.+?\<div class="content"\>(.+?)\</div\>.+?\<i class="number"\>(.+?)\</i'    info_compile = re.compile(re_info,re.S)    info = re.findall(info_compile,html)    print u'正在读取，回车查看，Q退出'    for story in info:        input = raw_input()        if input == 'Q':            return        re_text = re.compile('<br/>')        text = re.sub(re_text,'',story[1])        print u"第%d页\t发布人：%s\t 赞：%s\n%s" %(page,story[0],story[2],text)        def get_all(pages):    for i in range(1,pages):        url = start_url + str(i)        get_info(url,i)get_all(3)

啊啊啊啊写了下来发现循环部分出现错误，就是每一页都得输出Q才行，得在每一页循环之前加上判断，那么继续更改以下函数，return用来跳出整个代码过程

def get_info(url,page):    html = get_url(url)    re_info = r'h2\>(.+?)\</h2\>.+?\<div class="content"\>(.+?)\</div\>.+?\<i class="number"\>(.+?)\</i'    info_compile = re.compile(re_info,re.S)    info = re.findall(info_compile,html)    for story in info:        input = raw_input()        if input == 'Q':            enable = False            return        re_text = re.compile('<br/>')        text = re.sub(re_text,'',story[1])        print u"第%d页\t发布人：%s\t 赞：%s\n%s" %(page,story[0],story[2],text)        def get_all(pages):    print u'正在读取，回车查看，Q退出'    enable =True    for i in range(1,pages):        if enable == True:            url = start_url + str(i)            get_info(url,i)        return

0 0