Scrapy爬取拉勾网职位信息

来源：互联网发布：保险从业网络继续教育编辑：程序博客网时间：2024/05/22 04:31

很多网站都用了一种叫做Ajax（异步加载）的技术，通常我们会发现这种网页，打开了，先给你看上面一部分东西，然后剩下的东西再慢慢加载，也就是局部加载。所以你可以看到很多网页，浏览器中的网址没变，但是数据照样是可以更新的。这对我们正确爬取数据造成了一定影响，我们必须要分析出正确的目标地址才能成功爬取信息。

今天要爬取就的就是这种网站，目标网址是：https://www.lagou.com/zhaopin/

这里写图片描述

一、目标地址

通过上篇文件的介绍，以上面的目标地址，我们可以很轻松搭建一个爬虫框架。

我的蜘蛛文件代码：

# -*- coding: utf-8 -*-import scrapyclass PositionSpider(scrapy.Spider):    name = "position"    # allowed_domains = ["lagou.com/zhaopin/"]    start_urls = ['http://lagou.com/zhaopin//']    def parse(self, response):        file =  open("lagou.html", 'w')        file.write(response.body)        file.close()        print response.body

然后打开lagou.html文件，发现页面有点low啊，没关系，能看出一些信息就好。
这里写图片描述

这里的职位信息和上面图片中显示的职位是一致的，我们就这样简单抓取了吗？是的，其实首页是可以按照前面的方式抓取，但是这里不是我们要抓取的数据。我们要抓取特定条件下的职位信息。

这里我们首先打开开发者工具。

当我们选择条件时，再用上面的地址就抓不到信息了，并且地址栏的地址也发生了变化：https://www.lagou.com/jobs/list_?px=new&city=%E6%9D%AD%E5%B7%9E&district=%E8%A5%BF%E6%B9%96%E5%8C%BA#filterBox，然而再去选择其他条件时就不会发生变化了。

这里写图片描述

所以很容易想到这里是通过javascript的ajax技术发送的网络请求事件。

在网络面板下我们尝试在过滤器中输入json，对请求进行过滤下。
这里写图片描述

我们发现了2个资源感觉特别像，其中有个名字直接有position，我们点击右键，在新标签页打开看看。

我们点击open link in new tab。

这里写图片描述

我们对下这里的内容和网页上的内容是一致的。现在我们可以下结论，我们需要的就是这个网址：
http://www.lagou.com/jobs/positionAjax.json。然后后面可以加这些参数：

gj=应届毕业生&xl=大专&jd=成长型&hy=移动互联网&px=new&city=上海

通过修改这些参数，我们就可以获取不同的职位信息。

注意：这里的构造还比较简单，有时候，有些网址的构造远比这个复杂，经常会出现一些你不知道什么意思的id=什么的，这个时候，可能这个id的可能值可能就在别的文件中，你可能还得找一遍，也可能就在网页源代码中的某个地方。

还有一种情况，可能会出现time=什么的，这就是时间戳，这时候，需要用time函数构造。总之，要具体情况具体分析。

import timetime.time()

二、编写爬虫

1、爬第一页

我们来看下返回的json数据结构：

这里写图片描述

我们对照这里的层级关系，编写解析json数据的代码。

首先引入json模块：

import json

蜘蛛文件代码：

# -*- coding: utf-8 -*-# coding=utf-8import jsonimport scrapyclass PositionSpider(scrapy.Spider):    name = "position"    # allowed_domains = ["lagou.com/zhaopin/"]    start_urls = ['https://www.lagou.com/jobs/positionAjax.json?px=new&city=%E6%9D%AD%E5%B7%9E&district=%E8%A5%BF%E6%B9%96%E5%8C%BA&needAddtionalResult=false']    def parse(self, response):        # print response.body        jdict = json.loads(response.body)        jcontent = jdict["content"]        jposresult = jcontent["positionResult"]        jresult = jposresult["result"]        for each in jresult:            print each['city']            print each['companyFullName']            print each['companySize']            print each['positionName']            print each['secondType']            print each['salary']            print ''

运行下看看效果：

这里写图片描述

2、爬取更多页

我们可以爬取第一页的数据了，接下来再来看这个请求的具体情况：

这里写图片描述

通过浏览器的工具提供的信息可以看出，这是一个表单方式提交参数的post请求。下面我们就要模拟这种请求方式。

重写Spider的start_requests方法，并使用FormRequest设置post请求，并且我们可以修改xrang的范围，下载指定范围内页面的数据。代码如下：

# -*- coding: utf-8 -*-import jsonimport scrapyclass PositionSpider(scrapy.Spider):    name = "position"    # allowed_domains = ["lagou.com/zhaopin/"]    start_urls = [        'https://www.lagou.com/jobs/positionAjax.json?px=new&city=%E6%9D%AD%E5%B7%9E&district=%E8%A5%BF%E6%B9%96%E5%8C%BA&needAddtionalResult=false']    city = u'杭州'    district = u'西湖区'    url = 'https://www.lagou.com/jobs/positionAjax.json'    def start_requests(self):        for num in xrange(1, 5):            form_data = {'pn': str(num), 'city': self.city, 'district': self.district}            headers = {                'Host': 'www.jycinema.com',                'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'            }            yield scrapy.FormRequest(self.url, formdata=form_data, callback=self.parse)        # requests = []        # for num in xrange(1, 5):        #     requests.append(scrapy.FormRequest(self.url, method='post', formdata={'pn': str(num), 'city': self.city,'district':self.district},  callback=self.parse))        # return requests    def parse(self, response):        # print response.body        jdict = json.loads(response.body)        jcontent = jdict["content"]        jposresult = jcontent["positionResult"]        jresult = jposresult["result"]        for each in jresult:            print each['city']            print each['companyFullName']            print each['companySize']            print each['positionName']            print each['secondType']            print each['salary']            print ''

运行程序我们可以成功的抓取1-4页的所有职位信息。

这里不提供数据的截图了，因为这里数据是经常变化的。如果你自己去测试一下，肯定和我的数据是不一样的。

3、自动翻页

# -*- coding: utf-8 -*-# coding=utf-8import jsonimport scrapyclass PositionSpider(scrapy.Spider):    name = "position"    # allowed_domains = ["lagou.com/zhaopin/"]    start_urls = [        'https://www.lagou.com/jobs/positionAjax.json']    totalPageCount = 0    curpage = 1    city = u'杭州'    district = u'西湖区'    url = 'https://www.lagou.com/jobs/positionAjax.json'    # 设置下载延时    # download_delay = 10    def start_requests(self):        # for num in xrange(1, 3):        #     form_data = {'pn': str(num), 'city': self.city, 'district': self.district}        #     headers = {        #         'Host': 'www.jycinema.com',        #         'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',        #         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'        #     }        #     yield scrapy.FormRequest(self.url, formdata=form_data, callback=self.parse)        # requests = []        # for num in xrange(1, 5):        #     requests.append(scrapy.FormRequest(self.url, method='post', formdata={'pn': str(num), 'city': self.city,'district':self.district},  callback=self.parse))        # return requests        return [scrapy.FormRequest(self.url,formdata={'pn': str(self.curpage), 'city': self.city,'district':self.district},                                   callback=self.parse)]    def parse(self, response):        # print response.body        # print response.body.decode('utf-8')        print str(self.curpage) + "page"        jdict = json.loads(response.body)        jcontent = jdict['content']        jposresult = jcontent["positionResult"]        pageSize = jcontent["pageSize"]        jresult = jposresult["result"]        self.totalPageCount = jposresult['totalCount'] / pageSize + 1;        for each in jresult:            print each['city']            print each['companyFullName']            print each['companySize']            print each['positionName']            print each['secondType']            print each['salary']            print ''        if self.curpage <= self.totalPageCount:            self.curpage += 1            yield scrapy.http.FormRequest(self.url, formdata={'pn': str(self.curpage), 'city': self.city,'district': self.district},                                          callback=self.parse)

最后如果要保存数据，请参考上篇文章。

这里针对反爬虫也做了一点策略，例如使用USER AGENT池，通过下面方式可以查看请求所使用的user agent。

这里写图片描述

当shell载入后，您将得到一个包含response数据的本地 response 变量以及request变量。输入 response.body 将输出response的包体，输出 request.headers 可以看到request的包头。

这里写图片描述

应对反爬虫策略：

设置download_delay
禁止cookies
使用user agent池
使用IP池
分布式爬取

此工程源码已上传github，点此查看。

阅读全文

0 0