python实践系列之(二)python爬取数据(上)
来源:互联网 发布:服装加工厂办公软件 编辑:程序博客网 时间:2024/06/06 13:25
本系列实践目的:
打算先利用github上的项目huatian-funny,通过python抓取花田网上注册用户的数据,做个小实验,然后上传自己修改后的 huatian-funny 项目。
在 huatian-funny ,我们可以看到该项目的说明:
1.准备
需要 :
requests >=2.7.0,pymongo>=3.2.2,matplotlib>=1.4.3,Pillow>=3.2.0
(1)安装requests 2.7.0
requests是python的一个HTTP客户端库.
源码安装 pip 或者easy_install,
>pip install requests
可以看到安的版本是2.10.0
(2)安装matplotlib
见 python实践之准备 (一)的第4部分内容——安装matplotlib。这里不再赘述。
(3)安装Pillow
>pip install pillow
(4)安装mongodb
可以从这里下载: mongodb下载。
下载完成后,运行 mongodb-win32-x86_64-2008plus-ssl-3.2.6-signed.msi,一路默认选下去,最后完成。
mongodb 默认安装在 C:\Program Files\MongoDB下。
Windows下 MongoDB 的默认目录是C:\data\db,需提前创建该目录。
· 启动mongod 服务,双击运行mongod.exe 即可,或者启动时附加参数,
mongod.exe -journal -rest
如果不想用默认的C:\data\db目录,需要在启动服务器时使用–dbpath选项,如,
mongod.exe --dbpath yourpath
启动参数有:–-dbpath:数据库目录;–-logpath:log目录;--journal:代表要写日志;--rest:代表可以允许客户端通过rest API访问MongoDB Server;
启动后,命令窗口如下图所示:
最后一行显示等待连接。
· 开始连接
双击运行mongo.exe,或者再打开一个命令端,输入mongo.exe
连接数据库,如图,
可进行的操作,更多操作请自行搜索。
show dbsshow databases#显示所有数据库
再看刚才打开的mongod.exe命令窗口,连接数变成了1,如图
(5) 安装pymongo
爬虫爬取的数据放在pymongo中。
安装pymongo
>pip install pymongo
升级pymongo
>pip install --upgrade pymongo
(6)安装mongoDB可视化工具——Robomongo
Robomongo是MongoDB/GUI管理工具。
下载地址为 Robomongo,我下的是robomongo-0.9.0-rc8-windows-x86_64-c113244.exe ,双击运行,选择安装目录,我的是D:\softwares_diy\Robomongo 0.9.0-RC8\,继续,只有几步,最后选立即运行robomongo,出现下图,点击create,新建一个连接,确保启动了mongod服务(执行了mongod.exe)的前提下点击Test:
上图最后一行是 等待连接端口27017,然后回到robomongo,点击Test:
连接成功。如果连接的是本地的mongodb,直接点“close”,然后“save” 即可。
在robomongo管理页面上,点击 file->connect,出现刚才建立的连接:
选中连接,点“ connect”,可对该连接进行管理:
如果不是连接本地的mongo,那么通过SSH连接即可,输入IP 、用户名、密码即可:
2.爬取数据
好的,现在我们已经成功安好了需要的组件,而且也打开了mongo数据库连接。
下载github 上的 huatian-funny 项目,解压缩后放到一个目录下,例如我的是D:\pythonExperiments\huatian-funny-master。
我做的修改:
spider.py 和 mark.py
由于我的python环境是python3.4 ,而该项目作者使用的是python2.x,而python2.x 和 python3.x的语法和库名有些不一样,因此我对spider.py mark.py 等py文件做了些许修改,使其可以正常运行。该项目作者写的spider.py文件一次抓取很快就完成并停止了,经过修改后,spider.py 可以每隔5分钟自动执行一次,达到自动持续抓取数据的目的。
修改后的 spider.py ——爬取数据程序:
# -*- coding=utf-8 -*-import urllib,urllib.parsefrom apscheduler.schedulers.blocking import BlockingSchedulerimport osfrom requests import Sessionfrom extension import mongo_collectionsession = Session()LOGIN_HEADERS = { 'Host': 'reg.163.com', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,' 'image/webp,*/*;q=0.8', 'Origin': 'http://love.163.com', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/49.0.2623.110 Safari/537.36', 'Content-Type': 'application/x-www-form-urlencoded', 'Referer': 'http://love.163.com/', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4', 'Cookie': '_ntes_nnid=d53195032b58604628528cd6a374d63f,1460206631682; ' '_ntes_nuid=d53195032b58604628528cd6a374d63f',}SEARCH_HEADERS = { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Host': 'love.163.com', 'Origin': 'http://love.163.com', 'Pragma': 'no-cache', 'Referer': 'http://love.163.com/search/user', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/49.0.2623.110 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest',}def login(): """登陆花田""" data = { 'username': '18842602662@163.com', 'password': 'wangyi887', 'url': 'http://love.163.com/?checkUser=1&vendor=love.pLogin', 'product': 'ht', 'type': '1', 'append': '1', 'savelogin': '1', } response = session.post('https://reg.163.com/logins.jsp', headers=LOGIN_HEADERS, data=urllib.parse.urlencode(data)) assert response.okdef search(): """按照上海各个区和年龄段进行搜索""" for city in range(1, 20): for age in range(22, 27, 2): data = { 'province': '2', 'city': str(city), 'age': '{}-{}'.format(age, age + 1), 'condition': '1', } response = session.post('http://love.163.com/search/user/list', headers=SEARCH_HEADERS, data=urllib.parse.urlencode(data)) if not response.ok: print ('city:{} age:{} failed').format(city, age) continue users = response.json()['list'] for user in users: mongo_collection.update({'id': user['id']}, user, upsert=True)def loginAndSearch(): login() search()if __name__ == '__main__': #每隔 5 分钟执行一次,你可以根据需要修改 interval。 scheduler = BlockingScheduler() scheduler.add_job(loginAndSearch,'interval', minutes=5) print ('Press Ctrl+{0} to exit'.format('Pause/Break' if os.name == 'nt' else 'C')) try: scheduler.start() except (KeyboardInterrupt,SystemExit): scheduler.shutdown()
修改后的 mark.py ——主观打分程序:
# -*- coding=utf-8 -*-"""打分程序"""import iofrom urllib import requestfrom tkinter import messagebox,Tk, font, Label, Button, Radiobutton, IntVar#import tkinter.font as Font#from tkinter import *from PIL import Image, ImageTkfrom extension import mongo_collection, BUY_HOUSE, BUY_CAR,\ EDUCATION, INDUSTRY, SALARY, POSITIONmaster = Nonetk_image = Noneoffset = 0user, photo, url, buy_house, buy_car, age, height, salary, education, company, \industry, school, position, satisfy, appearance = [None for i in range(15)]def get_user(offset=0): """mongo中读取用户信息""" global user user = mongo_collection.find_one({}, skip=offset, limit=1, sort=[('url', -1)])def init_master(): """初始化主窗口""" global master master = Tk() master.title(u'花田') master.geometry(u'630x530') master.resizable(width=False, height=False)def place_image(image_ur): """获取用户头像""" global tk_image image_bytes = request.urlopen(image_ur).read() data_stream = io.BytesIO(image_bytes) pil_image = Image.open(data_stream) tk_image = ImageTk.PhotoImage(pil_image)def set_appearance(): """设置头像评分""" mongo_collection.update({'url': user['url']}, {'$set': {'appearance': appearance.get()}})def set_satisfy(): """设置是否满意""" mongo_collection.update({'url': user['url']}, {'$set': {'satisfy': satisfy.get()}})def update(): """更新页面""" global user, offset, photo, url, buy_house, buy_car, age, height, salary, \ education, company, industry, school, position, satisfy, appearance image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar']) place_image(image_url) print (offset) photo['image'] = tk_image url['text'] = user['url'] buy_house['text'] = BUY_HOUSE.get(user['house']) or user['house'] buy_car['text'] = BUY_CAR.get(user['car']) or user['car'] age['text'] = user['age'] height['text'] = user['height'] salary['text'] = SALARY.get(user['salary']) or user['salary'] education['text'] = EDUCATION.get(user['education']) or user['education'] company['text'] = user['company'] if user['company'] else u'--' industry['text'] = INDUSTRY.get(user['industry']) or user['industry'] school['text'] = user['school'] if user['school'] else u'--' position = POSITION.get(user['position']) or user['position'] satisfy.set(int(user.get(u'satisfy', -1))) appearance.set(int(user.get(u'appearance', -1)))def init(): """初始化页面""" global user, offset, photo, url, buy_house, buy_car, age, height, salary, \ education, company, industry, school, position, satisfy, appearance get_user(offset) image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar']) place_image(image_url) photo = Label(master, image=tk_image) photo.place(anchor=u'nw', x=10, y=40) #url = Label(master, text=user['url'],font=Font(size=20, weight='bold')) url = Label(master, font=("20"), text=user['url']) url.place(anchor=u'nw', x=10, y=5) buy_house = Label(master, text=BUY_HOUSE.get(user['house']) or user['house']) buy_house.place(anchor=u'nw', x=490, y=50) buy_car = Label(master, text=BUY_CAR.get(user['car']) or user['car']) buy_car.place(anchor=u'nw', x=490, y=75) age = Label(master, text=user['age']) age.place(anchor=u'nw', x=490, y=100) height = Label(master, text=user['height']) height.place(anchor=u'nw', x=490, y=125) salary = Label(master, text=SALARY.get(user['salary']) or user['salary']) salary.place(anchor=u'nw', x=490, y=150) education = Label(master, text=EDUCATION.get(user['education']) or user['education']) education.place(anchor=u'nw', x=490, y=175) company = Label(master, text=user['company'] if user['company'] else u'--') company.place(anchor=u'nw', x=490, y=200) industry = Label(master, text=INDUSTRY.get(user['industry']) or user['industry']) industry.place(anchor=u'nw', x=490, y=225) school = Label(master, text=user['school'] if user['school'] else u'--') school.place(anchor=u'nw', x=490, y=250) position = Label(master, text=POSITION.get(user['position']) or user['position']) position.place(anchor=u'nw', x=490, y=275) satisfy = IntVar() satisfy.set(int(user.get(u'satisfy', -1))) satisfied = Radiobutton(master, text=u"满意", variable=satisfy, value=1, command=set_satisfy) not_satisfied = Radiobutton(master, text=u"不满意", variable=satisfy, value=0, command=set_satisfy) satisfied.place(anchor=u'nw', x=450, y=10) not_satisfied.place(anchor=u'nw', x=510, y=10) appearance = IntVar() appearance.set(int(user.get(u'appearance', -1))) for i in range(1, 11): score_i = Radiobutton(master, text=str(i), variable=appearance, value=i, command=set_appearance) score_i.place(anchor=u'nw', x=i * 40 - 30, y=460)def handle_previous(): """上一个用户""" global offset if offset <= 0: showwarning(u'error', u'已经是第一个') offset -= 1 get_user(offset) update()def handle_next(): """下一个用户""" global offset offset += 1 get_user(offset) if not user: showwarning(u'error', u'已经是第后一个') return update()def add_assembly(): """添加组件""" init() #buy_house_static = Label(master, text=u'购房: ', fontt=font(size=15)) buy_house_static = Label(master, font=("15"), text=u'购房: ') buy_house_static.place(anchor=u'nw', x=440, y=50) buy_car_static = Label(master, font=("15"), text=u'购车: ') buy_car_static.place(anchor=u'nw', x=440, y=75) age_static = Label(master, font=("15"), text=u'年龄: ') age_static.place(anchor=u'nw', x=440, y=100) height_static = Label(master, font=("15"), text=u'身高: ') height_static.place(anchor=u'nw', x=440, y=125) salary_static = Label(master, font=("15"), text=u'工资: ') salary_static.place(anchor=u'nw', x=440, y=150) education_static = Label(master, font=("15"), text=u'学历: ') education_static.place(anchor=u'nw', x=440, y=175) company_static = Label(master, font=("15"), text=u'公司: ') company_static.place(anchor=u'nw', x=440, y=200) industry_static = Label(master, font=("15"), text=u'行业: ') industry_static.place(anchor=u'nw', x=440, y=225) school_static = Label(master, font=("15"), text=u'学校: ') school_static.place(anchor=u'nw', x=440, y=250) position_static = Label(master, font=("15"), text=u'职位: ') position_static.place(anchor=u'nw', x=440, y=275) previous = Button(master, text=u'上一个', command=handle_previous) previous.place(anchor=u'nw', x=10, y=490) next = Button(master, text=u'下一个', command=handle_next) next.place(anchor=u'nw', x=520, y=490)if __name__ == '__main__': init_master() add_assembly() master.mainloop()
对于train.py我还木有进行修改调试,所以关于训练决策树的部分还木有实践。
参考:
1. MongoDB与PyMongo的安装(Linux/Windows XP)
- python实践系列之(二)python爬取数据(上)
- python爬虫系列之爬取百度文库(二)
- Python系列(二)之Python函数
- python数据分析实践(二)
- python爬虫系列之爬取百度文库(一)
- python爬虫系列之爬取百度文库(三)
- python爬虫系列之爬取百度文库(四)
- #python学习笔记#使用python爬取拉勾网职位信息(二):爬取数据
- 好玩系列之python爬取图片
- python爬取数据练习(一)
- Python爬虫系列(二)Quotes to Scrape(谚语网站的爬取实战)
- python爬取数据练习(二)---lxml数据爬取后存储在数据库mysql中
- python+django实践(二)
- Python数据存储之MySQL(上)
- python爬取数据
- 基于python的POI数据爬取、处理和使用(二)
- 使用python及百度API对百度poi数据进行爬取(二)
- python爬取百度音乐(二)——保存数据到mysql中
- ACM--抛物线和直线围成的面积–-HDOJ 1071--The area--水
- 如何将txt转换成pdf格式
- QT类之事件mousePressEvent以及mouseMoveEvent
- Citrix XenServer版本演变
- Tensorflow source build on MAC EI Capitain
- python实践系列之(二)python爬取数据(上)
- 等周不等式
- 迷宫算法,step by step
- 实际项目中的常见算法
- linux基础学习笔记-4-shell编程
- cocos2dx UI性能优化
- NSDate的常用用法
- C语言第十五篇:C语言中.h和.c文件解析(很精彩)
- 在Struts2的Action中取得请求参数值的几种方法