python爬取拉勾网数据保存到mysql数据库
来源:互联网 发布:java专业技能 编辑:程序博客网 时间:2024/05/16 07:23
环境:python3
相关包:requests , json , pymysql
思路:1.通过chrome F12找到拉钩请求接口,分析request的各项参数
2.模拟浏览器请求拉钩接口
3.默认返回的json不是标准格式 , 对返回的json数据进行处理转换为标准格式
4.利用pymysql模块进行db操作
#coding:utf-8import randomimport urllibimport jsonimport pymysqlimport requestsUSER_AGENTS = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"]#随机模拟一个浏览器的UAdef get_random_userAgent(): userAgent = random.choice(USER_AGENTS) return userAgent#得到请求拉钩接口返回的json数据def get_job_all_json(pn=1,kd='python',city='上海'): headers = { 'User-Agent': get_random_userAgent(), 'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=', 'Cookie': 'JSESSIONID=ABAAABAAADEAAFID589F81DDA4B135EA73D59382D94193B; _gat=1; user_trace_token=20170918201032-5e70e65e-9c6a-11e7-9196-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; LGUID=20170918201032-5e70e916-9c6a-11e7-9196-5254005c3644; index_location_city=%E5%8C%97%E4%BA%AC; TG-TRACK-CODE=index_search; _gid=GA1.2.1042499452.1505736518; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505736518; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505736559; _ga=GA1.2.2038003268.1505736518; LGSID=20170918201032-5e70e7a7-9c6a-11e7-9196-5254005c3644; LGRID=20170918201112-76a14753-9c6a-11e7-9196-5254005c3644; SEARCH_ID=23d97ca16048467a93241983f07b9f32' } data = { 'first': 'true', 'pn': pn, #page number 'kd': kd } city = urllib.parse.quote(city) res = requests.post( 'https://www.lagou.com/jobs/positionAjax.json?' 'city={0}&' 'needAddtionalResult=false&' 'isSchoolJob=0'.format(city,0), data=data, headers=headers) print('status_code:',res.status_code) print('text:',res.text) return res.text#得到数据库连接def get_db_conn(): conn = pymysql.connect(host='localhost', user='root', passwd='admin', db='lagou', port=3306, charset='utf8') return conn#存入数据库def insert_into_db(conn,jobs): cur = conn.cursor() #cur.execute('truncate spider') #清空现有数据 for job in jobs: positionName = job['positionName'] salary = job['salary'] education = job['education'] companyFullName = job['companyFullName'] workYear = job['workYear'] companyLabelList = str(job['companyLabelList']).replace('\'','') companySize = job['companySize'] #print(positionName, salary, education, companyFullName, workYear, companyLabelList, companySize) sql = 'insert into spider(positionName , salary , education , companyFullName , workYear , companyLabelList , companySize) ' \ 'values(\''+positionName+'\',\''+salary+'\',\''+education+'\',\''+companyFullName+'\',\''+workYear+'\',\''+companyLabelList+'\',\''+companySize+'\')' print('sql:',sql) cur.execute(sql) conn.commit() cur.close() conn.close()#对返回的不标准json进行处理def get_job_result_json(jsonString): job_result = jsonString['content']['positionResult']['result'] # List j1 = str(job_result).replace("'", "\"") j2 = j1.replace("None", "\"None\"") return j2if __name__ =='__main__': job = 'hadoop' city = '北京' for i in range(1,11): pn = i jsonString = json.loads(get_job_all_json(pn,job,city)) job_json = get_job_result_json(jsonString) jobs = json.loads(job_json) conn = get_db_conn() insert_into_db(conn,jobs) print("done ...")
数据库中的数据如图:
数据库表结构:
/*Navicat MySQL Data TransferSource Server : mysqlSource Server Version : 50022Source Host : localhost:3306Source Database : lagouTarget Server Type : MYSQLTarget Server Version : 50022File Encoding : 65001Date: 2017-10-05 10:34:57*/SET FOREIGN_KEY_CHECKS=0;-- ------------------------------ Table structure for spider-- ----------------------------DROP TABLE IF EXISTS `spider`;CREATE TABLE `spider` ( `id` int(11) NOT NULL auto_increment, `positionName` varchar(255) collate utf8_bin default NULL, `salary` varchar(255) collate utf8_bin default NULL, `education` varchar(255) collate utf8_bin default NULL, `companyFullName` varchar(255) collate utf8_bin default NULL, `workYear` varchar(255) collate utf8_bin default NULL, `companyLabelList` varchar(255) collate utf8_bin default NULL, `companySize` varchar(255) collate utf8_bin default NULL, PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
阅读全文
0 0
- python爬取拉勾网数据保存到mysql数据库
- Python 将数据库数据保存到txt
- 用python把随机码保存到MySQL数据库中
- springMVC保存数据到mysql数据库中文乱码问题解决方法
- PHP+MySQL中文数据保存到数据库乱码的解决方法
- Pyspider实例之抓取数据并保存到MySQL数据库
- python抓取省市区的数据并保存到mysql中
- Android 保存数据到数据库
- 保存DataGrid数据到数据库
- scrap 保存数据到数据库
- 保存DataGrid数据到数据库
- python保存数据到本地文件
- MySQL-Python 库插入数据到数据库中看不到数据
- #python学习笔记#使用python爬取网站数据并保存到数据库
- 如何将python中的数据写到mysql数据库中
- 安卓训练-开始-保存数据-保存数据到数据库
- 上传图片保存到MySql数据库
- 二进制文件保存到mysql数据库详解
- js获取地址栏参数
- iOS推送配置手把手指南
- 第四单元笔记整理
- 1001. 害死人不偿命的(3n+1)猜想 (15)
- jquery控制上传文件格式、大小以及图片预览功能
- python爬取拉勾网数据保存到mysql数据库
- [CODE【VS】]江哥的DP题a
- java递归生成树
- 全面理解面向对象的 JavaScript
- SPFA模板
- 51nod 1053 最大M子段和 V2 (链表 对经典dp进行优化)
- 架构模式(Architectural Pattern)、设计模式(Design Pattern)、代码模式(Coding Pattern)
- LeetCode[526]Beautiful Arrangement(Java)
- Delphi ControlState和ControlStyle属性详解