【Python】抓取拉勾网全国Python的招聘信息

来源：互联网发布：java单例模式例子编辑：程序博客网时间：2024/04/27 15:39

分析寻找目标url

打开firebug，切换到 XHR 面板
在拉勾网首页中搜索python关键字，地区选全国
可以看到下图的信息
目标url为：http://www.lagou.com/jobs/positionAjax.json?px=default
要同时post的数据有：first，kd，pn
点击下一页的时候，pn变成2，故pn代表当前页面数
看到首页搜索结果可以发现共有30页，即可以构造所有页面的url

用requests发起请求并分析获取到的数据

import requestspost_data = {'first':'true','kd':'python','pn':'1'}r = requests.post("http://www.lagou.com/jobs/positionAjax.json?px=default", data=post_data)print r.text

从返回的Json数据分析可以得出我们想要的字段：
- positionName
- companyShortName
- city
- workYear
- positionAdvantage
- salary
- education
- financeStage
用于分隔每个公司的关键字是：positionId

构造所有页面的url并开始抓取

由于网站经常更新，所以抓取规则也要经常更新

# spider.py#-*-coding:utf-8-*-import toolsimport requestsimport sys  reload(sys)  sys.setdefaultencoding('utf8') # 构造所有的url，并开始抓取（共30页）for i in range(1,31):    post_data = {'first':'true','kd':'python','pn': i}    r = requests.post("http://www.lagou.com/jobs/positionAjax.json?px=default", data=post_data)    html = r.text    tools.fetch_content(html)

# tools.py#-*-coding:utf-8-*-import time,os,cookielib,urllib2,urllibimport StringIO,gzipf = open('data.txt','wb')def write(positionName,companyShortName,city,workYear,positionAdvantage,salary,education,financeStage):    f.write(positionName)    f.write('\r\n')    f.write(companyShortName)    f.write('\r\n')    f.write(city)    f.write('\r\n')    f.write(workYear)    f.write('\r\n')    f.write(positionAdvantage)    f.write('\r\n')    f.write(salary)    f.write('\r\n')    f.write(education)    f.write('\r\n')    f.write(financeStage)    f.write('\r\n')    f.write('\r\n')def fj_function(url_content,beg_str,end_str,lengths):    str_len=len(beg_str)    start=url_content.find(beg_str,0,lengths)    obj=''    if start>=0:        content=url_content[start+str_len:lengths]        if end_str<>'':            end=content.find(end_str,0,lengths)            obj=content[0:end]            content=content[end:lengths]    else:        content=url_content    return content,objdef fetch_content(url_content):    lengths=len(url_content)    while 1:        beg_str = '"positionId"'        str_len=len(beg_str)        start=url_content.find(beg_str,0,lengths)        if start>=0:            url_content=url_content[start+str_len:lengths]            end_str = '"positionId"'            end=url_content.find(end_str,0,lengths)            obj_content=url_content[:end]            # 分拣具体数据            obj_content,positionName=fj_function(obj_content,'"positionName":"','"',lengths)             obj_content,companyShortName=fj_function(obj_content,'companyShortName":"','"',lengths)            obj_content,city=fj_function(obj_content,'"city":"','"',lengths)            obj_content,workYear=fj_function(obj_content,'workYear":"','"',lengths)            obj_content,positionAdvantage=fj_function(obj_content,'positionAdvantage":"','"',lengths)            obj_content,salary=fj_function(obj_content,'salary":"','"',lengths)            obj_content,education=fj_function(obj_content,'education":"','"',lengths)            obj_content,financeStage=fj_function(obj_content,'financeStage":"','"',lengths)            # 写入文件            write(positionName,companyShortName,city,workYear,positionAdvantage,salary,education,financeStage)           else:            break

最终效果

1 0