python实践系列之（二）python爬取数据（上）

来源：互联网发布：服装加工厂办公软件编辑：程序博客网时间：2024/06/06 13:25

本系列实践目的：

打算先利用github上的项目huatian-funny，通过python抓取花田网上注册用户的数据，做个小实验，然后上传自己修改后的 huatian-funny 项目。

在 huatian-funny ，我们可以看到该项目的说明：

这里写图片描述

1.准备

需要 :

requests >=2.7.0，pymongo>=3.2.2，matplotlib>=1.4.3，Pillow>=3.2.0

(1)安装requests 2.7.0

requests是python的一个HTTP客户端库.
源码安装 pip 或者easy_install，

>pip install requests

这里写图片描述

可以看到安的版本是2.10.0

(2)安装matplotlib

见 python实践之准备（一）的第4部分内容——安装matplotlib。这里不再赘述。

(3)安装Pillow

>pip install pillow

这里写图片描述

(4)安装mongodb

可以从这里下载： mongodb下载。
下载完成后，运行 mongodb-win32-x86_64-2008plus-ssl-3.2.6-signed.msi，一路默认选下去，最后完成。
mongodb 默认安装在 C:\Program Files\MongoDB下。
Windows下 MongoDB 的默认目录是C:\data\db，需提前创建该目录。

· 启动mongod 服务，双击运行mongod.exe 即可，或者启动时附加参数，

mongod.exe -journal -rest

如果不想用默认的C:\data\db目录，需要在启动服务器时使用–dbpath选项，如，

mongod.exe --dbpath yourpath

启动参数有：–-dbpath：数据库目录；–-logpath：log目录；--journal：代表要写日志；--rest：代表可以允许客户端通过rest API访问MongoDB Server；

启动后，命令窗口如下图所示：

这里写图片描述

最后一行显示等待连接。

· 开始连接

双击运行mongo.exe，或者再打开一个命令端，输入mongo.exe 连接数据库，如图，

这里写图片描述

可进行的操作，更多操作请自行搜索。

show dbsshow databases#显示所有数据库

再看刚才打开的mongod.exe命令窗口，连接数变成了1，如图

这里写图片描述

(5) 安装pymongo

爬虫爬取的数据放在pymongo中。
安装pymongo

>pip install pymongo

升级pymongo

>pip install --upgrade pymongo

这里写图片描述

(6)安装mongoDB可视化工具——Robomongo

Robomongo是MongoDB/GUI管理工具。
下载地址为 Robomongo，我下的是robomongo-0.9.0-rc8-windows-x86_64-c113244.exe ，双击运行，选择安装目录，我的是D:\softwares_diy\Robomongo 0.9.0-RC8\，继续，只有几步，最后选立即运行robomongo，出现下图，点击create，新建一个连接，确保启动了mongod服务（执行了mongod.exe）的前提下点击Test：

这里写图片描述

上图最后一行是等待连接端口27017，然后回到robomongo，点击Test:

这里写图片描述

这里写图片描述
连接成功。如果连接的是本地的mongodb，直接点“close”，然后“save” 即可。
在robomongo管理页面上，点击 file->connect，出现刚才建立的连接：

这里写图片描述

选中连接，点“ connect”，可对该连接进行管理：

这里写图片描述

如果不是连接本地的mongo，那么通过SSH连接即可，输入IP 、用户名、密码即可：

这里写图片描述

2.爬取数据

好的，现在我们已经成功安好了需要的组件，而且也打开了mongo数据库连接。

下载github 上的 huatian-funny 项目，解压缩后放到一个目录下，例如我的是D:\pythonExperiments\huatian-funny-master。

我做的修改：

spider.py 和 mark.py
由于我的python环境是python3.4 ，而该项目作者使用的是python2.x，而python2.x 和 python3.x的语法和库名有些不一样，因此我对spider.py mark.py 等py文件做了些许修改，使其可以正常运行。
该项目作者写的spider.py文件一次抓取很快就完成并停止了，经过修改后，spider.py 可以每隔5分钟自动执行一次，达到自动持续抓取数据的目的。

修改后的 spider.py ——爬取数据程序：

# -*- coding=utf-8 -*-import urllib,urllib.parsefrom apscheduler.schedulers.blocking import BlockingSchedulerimport osfrom requests import Sessionfrom extension import mongo_collectionsession = Session()LOGIN_HEADERS = {    'Host': 'reg.163.com',    'Connection': 'keep-alive',    'Pragma': 'no-cache',    'Cache-Control': 'no-cache',    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,'              'image/webp,*/*;q=0.8',    'Origin': 'http://love.163.com',    'Upgrade-Insecure-Requests': '1',    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '                  'AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/49.0.2623.110 Safari/537.36',    'Content-Type': 'application/x-www-form-urlencoded',    'Referer': 'http://love.163.com/',    'Accept-Encoding': 'gzip, deflate',    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',    'Cookie': '_ntes_nnid=d53195032b58604628528cd6a374d63f,1460206631682; '              '_ntes_nuid=d53195032b58604628528cd6a374d63f',}SEARCH_HEADERS = {    'Accept': '*/*',    'Accept-Encoding': 'gzip, deflate',    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',    'Cache-Control': 'no-cache',    'Connection': 'keep-alive',    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',    'Host': 'love.163.com',    'Origin': 'http://love.163.com',    'Pragma': 'no-cache',    'Referer': 'http://love.163.com/search/user',    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '                  'AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/49.0.2623.110 Safari/537.36',    'X-Requested-With': 'XMLHttpRequest',}def login():    """登陆花田"""    data = {        'username': '18842602662@163.com',        'password': 'wangyi887',        'url': 'http://love.163.com/?checkUser=1&vendor=love.pLogin',        'product': 'ht',        'type': '1',        'append': '1',        'savelogin': '1',    }    response = session.post('https://reg.163.com/logins.jsp',                            headers=LOGIN_HEADERS, data=urllib.parse.urlencode(data))    assert response.okdef search():    """按照上海各个区和年龄段进行搜索"""    for city in range(1, 20):        for age in range(22, 27, 2):            data = {                'province': '2',                'city': str(city),                'age': '{}-{}'.format(age, age + 1),                'condition': '1',            }            response = session.post('http://love.163.com/search/user/list',                                    headers=SEARCH_HEADERS, data=urllib.parse.urlencode(data))            if not response.ok:                print ('city:{} age:{} failed').format(city, age)                continue            users = response.json()['list']            for user in users:                mongo_collection.update({'id': user['id']}, user, upsert=True)def loginAndSearch():    login()    search()if __name__ == '__main__':    #每隔 5 分钟执行一次，你可以根据需要修改 interval。    scheduler = BlockingScheduler()    scheduler.add_job(loginAndSearch,'interval', minutes=5)    print ('Press Ctrl+{0} to exit'.format('Pause/Break' if os.name == 'nt' else 'C'))    try:        scheduler.start()    except (KeyboardInterrupt,SystemExit):        scheduler.shutdown()

修改后的 mark.py ——主观打分程序：

# -*- coding=utf-8 -*-"""打分程序"""import iofrom urllib import requestfrom tkinter import messagebox,Tk, font, Label, Button, Radiobutton, IntVar#import tkinter.font as Font#from tkinter import *from PIL import Image, ImageTkfrom extension import mongo_collection, BUY_HOUSE, BUY_CAR,\    EDUCATION, INDUSTRY, SALARY, POSITIONmaster = Nonetk_image = Noneoffset = 0user, photo, url, buy_house, buy_car, age, height, salary, education, company, \industry, school, position, satisfy, appearance = [None for i in range(15)]def get_user(offset=0):    """mongo中读取用户信息"""    global user    user = mongo_collection.find_one({}, skip=offset, limit=1, sort=[('url', -1)])def init_master():    """初始化主窗口"""    global master    master = Tk()    master.title(u'花田')    master.geometry(u'630x530')    master.resizable(width=False, height=False)def place_image(image_ur):    """获取用户头像"""    global tk_image    image_bytes = request.urlopen(image_ur).read()    data_stream = io.BytesIO(image_bytes)    pil_image = Image.open(data_stream)    tk_image = ImageTk.PhotoImage(pil_image)def set_appearance():    """设置头像评分"""    mongo_collection.update({'url': user['url']},                            {'$set': {'appearance': appearance.get()}})def set_satisfy():    """设置是否满意"""    mongo_collection.update({'url': user['url']},                            {'$set': {'satisfy': satisfy.get()}})def update():    """更新页面"""    global user, offset, photo, url, buy_house, buy_car, age, height, salary, \        education, company, industry, school, position, satisfy, appearance    image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])    place_image(image_url)    print (offset)    photo['image'] = tk_image    url['text'] = user['url']    buy_house['text'] = BUY_HOUSE.get(user['house']) or user['house']    buy_car['text'] = BUY_CAR.get(user['car']) or user['car']    age['text'] = user['age']    height['text'] = user['height']    salary['text'] = SALARY.get(user['salary']) or user['salary']    education['text'] = EDUCATION.get(user['education']) or user['education']    company['text'] = user['company'] if user['company'] else u'--'    industry['text'] = INDUSTRY.get(user['industry']) or user['industry']    school['text'] = user['school'] if user['school'] else u'--'    position = POSITION.get(user['position']) or user['position']    satisfy.set(int(user.get(u'satisfy', -1)))    appearance.set(int(user.get(u'appearance', -1)))def init():    """初始化页面"""    global user, offset, photo, url, buy_house, buy_car, age, height, salary, \        education, company, industry, school, position, satisfy, appearance    get_user(offset)    image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])    place_image(image_url)    photo = Label(master, image=tk_image)    photo.place(anchor=u'nw', x=10, y=40)    #url = Label(master, text=user['url'],font=Font(size=20, weight='bold'))    url = Label(master, font=("20"), text=user['url'])    url.place(anchor=u'nw', x=10, y=5)    buy_house = Label(master, text=BUY_HOUSE.get(user['house']) or user['house'])    buy_house.place(anchor=u'nw', x=490, y=50)    buy_car = Label(master, text=BUY_CAR.get(user['car']) or user['car'])    buy_car.place(anchor=u'nw', x=490, y=75)    age = Label(master, text=user['age'])    age.place(anchor=u'nw', x=490, y=100)    height = Label(master, text=user['height'])    height.place(anchor=u'nw', x=490, y=125)    salary = Label(master, text=SALARY.get(user['salary']) or user['salary'])    salary.place(anchor=u'nw', x=490, y=150)    education = Label(master, text=EDUCATION.get(user['education']) or user['education'])    education.place(anchor=u'nw', x=490, y=175)    company = Label(master, text=user['company'] if user['company'] else u'--')    company.place(anchor=u'nw', x=490, y=200)    industry = Label(master, text=INDUSTRY.get(user['industry']) or user['industry'])    industry.place(anchor=u'nw', x=490, y=225)    school = Label(master, text=user['school'] if user['school'] else u'--')    school.place(anchor=u'nw', x=490, y=250)    position = Label(master, text=POSITION.get(user['position']) or user['position'])    position.place(anchor=u'nw', x=490, y=275)    satisfy = IntVar()    satisfy.set(int(user.get(u'satisfy', -1)))    satisfied = Radiobutton(master, text=u"满意", variable=satisfy,                            value=1, command=set_satisfy)    not_satisfied = Radiobutton(master, text=u"不满意", variable=satisfy,                                value=0, command=set_satisfy)    satisfied.place(anchor=u'nw', x=450, y=10)    not_satisfied.place(anchor=u'nw', x=510, y=10)    appearance = IntVar()    appearance.set(int(user.get(u'appearance', -1)))    for i in range(1, 11):        score_i = Radiobutton(master, text=str(i), variable=appearance,                              value=i, command=set_appearance)        score_i.place(anchor=u'nw', x=i * 40 - 30, y=460)def handle_previous():    """上一个用户"""    global offset    if offset <= 0:        showwarning(u'error', u'已经是第一个')    offset -= 1    get_user(offset)    update()def handle_next():    """下一个用户"""    global offset    offset += 1    get_user(offset)    if not user:        showwarning(u'error', u'已经是第后一个')        return    update()def add_assembly():    """添加组件"""    init()    #buy_house_static = Label(master, text=u'购房: ', fontt=font(size=15))    buy_house_static = Label(master, font=("15"), text=u'购房: ')    buy_house_static.place(anchor=u'nw', x=440, y=50)    buy_car_static = Label(master, font=("15"), text=u'购车: ')    buy_car_static.place(anchor=u'nw', x=440, y=75)    age_static = Label(master, font=("15"), text=u'年龄: ')    age_static.place(anchor=u'nw', x=440, y=100)    height_static = Label(master, font=("15"), text=u'身高: ')    height_static.place(anchor=u'nw', x=440, y=125)    salary_static = Label(master, font=("15"), text=u'工资: ')    salary_static.place(anchor=u'nw', x=440, y=150)    education_static = Label(master, font=("15"), text=u'学历: ')    education_static.place(anchor=u'nw', x=440, y=175)    company_static = Label(master, font=("15"), text=u'公司: ')    company_static.place(anchor=u'nw', x=440, y=200)    industry_static = Label(master, font=("15"), text=u'行业: ')    industry_static.place(anchor=u'nw', x=440, y=225)    school_static = Label(master, font=("15"), text=u'学校: ')    school_static.place(anchor=u'nw', x=440, y=250)    position_static = Label(master, font=("15"), text=u'职位: ')    position_static.place(anchor=u'nw', x=440, y=275)    previous = Button(master, text=u'上一个', command=handle_previous)    previous.place(anchor=u'nw', x=10, y=490)    next = Button(master, text=u'下一个', command=handle_next)    next.place(anchor=u'nw', x=520, y=490)if __name__ == '__main__':    init_master()    add_assembly()    master.mainloop()

对于train.py我还木有进行修改调试，所以关于训练决策树的部分还木有实践。

参考：
1. MongoDB与PyMongo的安装（Linux/Windows XP）

0 0