scrapy抓取腾讯招聘数据并入库mongodb(浅)

来源:互联网 发布:java中定义数组 编辑:程序博客网 时间:2024/04/29 10:48

明确目标:

抓取内容:职位名称、人数,类别、地点、发布时间 以及详情页面的岗位职责、工作要求

1.配置itens.py

既然以及确定目标,开始定义items.py

import scrapyclass TtspiderItem(scrapy.Item):    mc = scrapy.Field()  # 名称    lb = scrapy.Field()  # 类别    rs = scrapy.Field()  # 人数    dd = scrapy.Field()  # 地点    sj = scrapy.Field()  # 时间    zz = scrapy.Field()  # 职责    yq = scrapy.Field()  # 要求

2.配置settings(开启piplines,写入数据库信息)

ITEM_PIPELINES = {    'ttspider.pipelines.TtspiderPipeline': 300,}LOG_LEVEL = "WARNING"
# mongodb 配置MONGODB_HOST = '127.0.0.1'MONGODB_PORT = 27017MONGODB_DBNAME = 'tencent'MONGODB_COLLECTION = 'zhaopin'

3.配置piplines

# -*- coding: utf-8 -*-from pymongo import MongoClientfrom scrapy.conf import settings"""注意点:        1.从pymongo中导入MongoClient(芒果客户端)        2.导入setting文件,from scrapy.conf import settings            2.1 这里的mongodb的登录信息都是写在setting文件下的,所以要导入                MONGODB_HOST = '127.0.0.1'                MONGODB_PORT = 27017    此处应该是int  不是字符串                MONGODB_DBNAME = 'tencent'                MONGODB_COLLECTION = 'zhaopin'            2.2 其实也可以在piplines中直接写入"""class TtspiderPipeline(object):    def __init__(self):        # 连接数据库        con = MongoClient(settings.get('MONGODB_HOST'), settings.get('MONGODB_PORT'))        # 连接数据表        db = con[settings.get('MONGODB_DBNAME')]        # 连接集合        self.collection = db[settings.get('MONGODB_COLLECTION')]    def process_item(self, item, spider):        # 插入数据        self.collection.insert(item)        print(item)        return item

4.spider

# -*- coding: utf-8 -*-import scrapyclass TencentspiderSpider(scrapy.Spider):    name = 'tencentSpider'    allowed_domains = ['hr.tencent.com']    start_urls = ['http://hr.tencent.com/position.php']    def parse(self, response):        tr_list = response.xpath('//tr[@class="even" or @class="odd"]')        for tr in tr_list:            item = {}            item['mc'] = tr.xpath('./td[1]/a/text()').extract_first()            item['lb'] = tr.xpath('./td[2]/text()').extract_first()            item['rs'] = tr.xpath('./td[3]/text()').extract_first()            item['dd'] = tr.xpath('./td[4]/text()').extract_first()            item['sj'] = tr.xpath('./td[5]/text()').extract_first()            yield scrapy.Request(                url='http://hr.tencent.com/' + tr.xpath('./td[1]/a/@href').extract_first(),                callback=self.parse_detail,                meta={'item': item}            )        next_url = response.xpath('//a[@id="next"]/@href').extract_first()        if next_url != 'javascript:;':            next_url = 'http://hr.tencent.com/' + next_url            yield scrapy.Request(                next_url,                callback=self.parse            )    def parse_detail(self, response):        item = response.meta['item']        ul_list = response.xpath('//ul[@class="squareli"]')        if len(ul_list) > 1:            item['zz'] = ul_list[0].xpath('./li/text()').extract()            item['yq'] = ul_list[1].xpath('./li/text()').extract()        else:            item['zz'] = ul_list[0].xpath('./li/text()').extract()            item['yq'] = None        yield item

这样整个爬虫基本就全了,剩下的cookies 和 代理ip ,这个案列就不写了,等我在练练


然后访问数据库











原创粉丝点击