scrapy爬取天涯帖子内容

来源：互联网发布：核聚变知乎编辑：程序博客网时间：2024/06/07 04:42

主要代码

import scrapyfrom scrapy import Selectorfrom first.items import TYItemimport reclass TianyaSpider(scrapy.Spider):    name = 'tianya'    allowed_domains=['tianya.cn']    # 对请求的返回进行处理的配置    meta = {        'dont_redirect': True,  # 禁止网页重定向        'handle_httpstatus_list': [301, 302]  # 对哪些异常返回进行处理    }    cookies={}    header={ 'User-Agent': 'Mozilla / 5.0(X11;Linux x86_64) AppleWebKit /537.36(KHTML, likeGecko) Chrome / 54.0.2840.71Safari / 537.36'}    start_urls=[                "http://bbs.tianya.cn/post-funinfo-7049670-1.shtml",                "http://bbs.tianya.cn/post-16-1632189-1.shtml",                "http://bbs.tianya.cn/post-16-1642094-1.shtml",                ]    def get_url(self,pre_url):        url_head = pre_url.split('-')[0:-1]        page = int(pre_url.split('-')[-1].split('.')[0])+1        return '-'.join(url_head)+'-'+str(page)+'.shtml'    def start_requests(self):        for url in self.start_urls:            yield scrapy.Request(url=url,callback=self.parse,headers=self.header,meta=self.meta)    reg = re.compile(r'时间：(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})')    def parse(self, response):        if response.status != 200:            return        item = TYItem()        item['title'] = Selector(response).xpath('//*[@id="post_head"]/h1/span[1]/span/text()').extract()[0]        content_div = Selector(response).xpath('//div[@class="atl-main"]/div[starts-with(@class,"atl-item")]')        for each in content_div:            try:                item['author'] = each.xpath('.//div[@class="atl-info"]/span/a/text()').extract()[0]            except:                item['author'] = ""            item['content'] = each.xpath('.//div[starts-with(@class,"bbs-content")]').xpath('string(.)').extract()[                0]. \                replace('\r\n', '').replace('\t', '').replace('\u3000', '').replace('\n', "")            try:                post_time = each.xpath('.//div[@class="atl-info"]/span[2]/text()').extract()[0]                item['post_time'] = re.findall(self.reg, post_time)[0]            except IndexError:                item['post_time'] = ""            try:                item['floor'] = each.xpath(                    './/div[starts-with(@class,"bbs-content")]/following-sibling::div[1]/span/text()').extract()[0][                                0:-1]            except IndexError:                item['floor'] = ''            yield item        next_url = self.get_url(response.url)        yield scrapy.Request(url=next_url, callback=self.parse, headers=self.header,                             cookies=self.cookies, meta=self.meta)

items.py

class TYItem(scrapy.Item):    title=scrapy.Field()#帖子标题    author = scrapy.Field()#楼主和层主    content=scrapy.Field()#内容,层中层的内容没有爬取    post_time=scrapy.Field()#发表时间    floor=scrapy.Field()#楼数

在爬取帖子内容content的时候,会出现比较奇怪的字符\u3000,或者\xa0等等,可以用replace替换成空格
天涯帖子的格式还是比较规范的,xpath选择起来不是太费事(特此一体,某数字论坛,年代久远,帖子内容首页,前2页,最后几页样式不一样,坑!)
分析天涯帖子的网址,比如
http://bbs.tianya.cn/post-funinfo-7049670-1.shtml
其中7049670可以认为是文章的id,1是页数,值得一提的是,天涯在处理超过页数的请求时,会返回301,
Status Code:301 Moved Permanently重定向到最后一页,我们可以根据response.status判断是否帖子已经结束
至于为什么在parse函数里面有try…exception…因为帖子的第一个content,就是楼主的主帖html样式和回复贴不太一样,根据同样的xpath选取不到,会跑出IndexError,没什么好的解决办法,只能这么的简单粗暴,捕获异样,把内容设置为空字符串
pipelines.py

import pymongofrom datetime import datetimeimport threadingclient=pymongo.MongoClient(host='127.0.0.1',port=27017)db=client['scrapy_data']['tianya']class TYPipeline(object):    def process_item(self,item ,spider):        with open('tianya4.txt','a') as f:            f.write(str(item))            f.write('\n')            # if item['post_time'] !='':            #     item['post_time']=datetime.strptime(item['post_time'],'%Y-%m-%d %H:%M:%S')            # if item['floor'] !='':            #     item['floor']=int(item['floor'])            # self.db.insert(dict(item))        # return item

注释掉的部分是存入mongodb的代码部分,把格式转化是为了在mongodb中便于查询.
讲道理的话可以根据上层的的网址,比如莲蓬鬼话http://bbs.tianya.cn/list-16-1.shtml
爬取到每个帖子的路径,href属性

<a href="/post-16-1632076-1.shtml" target="_blank">亲身经历病房灵异，直接崩溃无神论者！！<span class="art-ico art-ico-3" title="内有7张图片"></span></a>

下一页的超链接“`
/list.jsp?item=16&nextid=1478769822000

用不同的parse函数处理即可,这样理论上可以爬取一个板块所有的帖子,由于天涯板块还有子版块,所以仅仅这样只能爬取某个板块.处理帖子的parse函数可以通用,例子中start_urls三个网址属于两个不同的板块,最后爬取的文件基本是这样的(懒得截图,就这么看把)

{‘author’:” ”,
‘content’: ‘大家。…….家看看就行了。’,
‘floor’: “”,
‘post_time’: ”,
‘title’: ‘[天涯头……..’}
{‘author’: ‘小鉴定师大宝’,
‘content’: ‘@小鉴定师大宝 2016-08-21 ’
‘15:23:16大家好，我是一个DNA鉴定师……’,
‘floor’: ‘2’,
‘post_time’: ‘2016-08-21 15:24:10’,
‘title’: ‘[天涯头条]我是一……..’}
“`
感觉还有待完善

0 0