Python3 大型网络爬虫实战 001 --- 搭建开发环境

来源:互联网 发布:横新软件怎么样 编辑:程序博客网 时间:2024/06/05 10:09

(转载)http://www.aobosir.com/blog/2016/11/26/python3-large-web-crawler-001-Build-development-environment/

前言

开发Python爬虫有很多种方式,从程序的复杂程度的角度来说,可以分为:爬虫项目和爬虫文件。 相信有些朋友玩过Python的urllib模块,一般我们可以用该模块写一些爬虫文件,实现起来非常方便,但做大型项目的时候,会发现效率不是太好、并且程序的稳定性也不是太好。 Scrapy是一个Python的爬虫框架,使用Scrapy可以提高开发效率,并且非常适合做一些中大型爬虫项目。 简单来说,urllib库更适合写爬虫文件,scrapy更适合做爬虫项目。

本套专栏,就来讲解如何做爬虫项目。本篇博客是第一篇博客:搭建开发环境。

1 . 安装Python3

到官网下载就可以了,下载一个Python3.5版本就可以,傻瓜式安装。

Python 3 被默认安装在:C:\Users\[Username]\AppData\Local\Programs\Python\Python35 这个路径里面。

2 . 安装Python程序开发集成开发环境 — PyCharm IDE 2016.1.4

软件下载:https://www.jetbrains.com/pycharm/download/#section=windows

注意:

Professional是完整版的,但是需要注册码

注册方法:http://blog.csdn.net/tianzhaixing2013/article/details/44997881

我这次安装的是PyCharm 2016。

Community是免费版的,但是软件里面的Terminal是不能使用的。

3 . 安装 Visual Studio 2015 软件

要知道:为什么需要 Visual Studio 软件了。(参考这个网站)

如果不安装,当中你执行pip install third-package-name时,有时会出现下面这个错误:error: Unable to find vcvarsall.bat

Alt text

安装Visual Studio 2015 软件是为了安装里面的Python Tools 2.2.5 for Visual Studio 2015软件。

下载和安装 Visual Studio 2015 软件 的方法在这里

4 . 升级 pip 工具

在DOS窗口中执行下面的命令来升级pip工具。

1
python -m pip install --upgrade pip

5 . 安装一些第三方库

lxml、Twisted、pywin32、scrapy

lxml是一种可以迅速、灵活地处理 XML。 Twisted是用Python实现的基于事件驱动的网络引擎框架。 pywin32提供win32api。 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。


我们安装的是python3.5,并且我的电脑是64位的,所以:下载:

lxml‑3.6.4‑cp35‑cp35m‑win_amd64.whl

Twisted‑16.5.0‑cp35‑cp35m‑win_amd64.whl

pywin32‑220.1‑cp35‑cp35m‑win_amd64.whl

scrapy(直接使用命令:pip.exe install scrapy 来安装。)


Python安装第三方库的方法:http://blog.csdn.net/github_35160620/article/details/52203682

注意:如果你的电脑之前安装了Python2,那么Python2 有自己的pip工具,Python3 也是有自己的pip工具,所以,如果你在DOS命令行上执行pip install some-package-name命令的时候,系统会使用哪个pip工具呢?是python2的pip,还是python3的pip?

这个问题,你可以在这篇博客里得到解决答案:http://www.aobosir.com/blog/2016/11/23/pip-install-python2-python3/


下载后,在我的电脑上是这样安装:

安装 lxml:

12345
C:\Users\AOBO>cd C:\Users\AOBO\AppData\Local\Programs\Python\Python35\ScriptsC:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whlProcessing d:\software_install_package_win\python\some-python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whlInstalling collected packages: lxmlSuccessfully installed lxml-3.6.4

安装 Twisted :(执行到Collecting constantly>=15.1 (from Twisted==16.5.0)这句时,卡住了,我按了 Ctrl+C 才继续执行下去。自动下载了下面的:constantly、incremental、zope.interface 这三个依赖库)

12345678910111213
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\Twisted-16.5.0-cp35-cp35m-win_amd64.whlProcessing d:\software_install_package_win\python\some-python-third-packages\twisted-16.5.0-cp35-cp35m-win_amd64.whlCollecting constantly>=15.1 (from Twisted==16.5.0)#(执行到这卡住了,我按了 Ctrl+C 才继续执行下去。自动下载了下面的:constantly、incremental、zope.interface 这三个依赖库)  Downloading constantly-15.1.0-py2.py3-none-any.whlCollecting incremental>=16.10.1 (from Twisted==16.5.0)  Downloading incremental-16.10.1-py2.py3-none-any.whlCollecting zope.interface>=4.0.2 (from Twisted==16.5.0)  Downloading zope.interface-4.3.2-cp35-cp35m-win_amd64.whl (136kB)    100% |████████████████████████████████| 143kB 7.1kB/sRequirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted==16.5.0)Installing collected packages: constantly, incremental, zope.interface, TwistedSuccessfully installed Twisted-16.5.0 constantly-15.1.0 incremental-16.10.1 zope.interface-4.3.2

安装pywin32:

1234
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whlProcessing d:\software_install_package_win\python\some-python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whlInstalling collected packages: pywin32Successfully installed pywin32-220.1

安装scropy:

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install scrapyCollecting scrapy  Downloading Scrapy-1.2.1-py2.py3-none-any.whl (294kB)    100% |████████████████████████████████| 296kB 338kB/sCollecting service-identity (from scrapy)  Downloading service_identity-16.0.0-py2.py3-none-any.whlCollecting six>=1.5.2 (from scrapy)  Downloading six-1.10.0-py2.py3-none-any.whlCollecting w3lib>=1.15.0 (from scrapy)  Downloading w3lib-1.16.0-py2.py3-none-any.whlCollecting PyDispatcher>=2.0.5 (from scrapy)  Downloading PyDispatcher-2.0.5.tar.gzRequirement already satisfied: Twisted>=10.0.0 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)Requirement already satisfied: lxml in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)Collecting cssselect>=0.9 (from scrapy)  Downloading cssselect-1.0.0-py2.py3-none-any.whlCollecting parsel>=0.9.3 (from scrapy)  Downloading parsel-1.1.0-py2.py3-none-any.whlCollecting queuelib (from scrapy)  Downloading queuelib-1.4.2-py2.py3-none-any.whlCollecting pyOpenSSL (from scrapy)  Downloading pyOpenSSL-16.2.0-py2.py3-none-any.whl (43kB)    100% |████████████████████████████████| 51kB 4.7MB/sCollecting pyasn1 (from service-identity->scrapy)  Downloading pyasn1-0.1.9-py2.py3-none-any.whlCollecting pyasn1-modules (from service-identity->scrapy)  Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whlCollecting attrs (from service-identity->scrapy)  Downloading attrs-16.2.0-py2.py3-none-any.whlRequirement already satisfied: constantly>=15.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)Requirement already satisfied: zope.interface>=4.0.2 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)Requirement already satisfied: incremental>=16.10.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)Collecting cryptography>=1.3.4 (from pyOpenSSL->scrapy)  Downloading cryptography-1.6-cp35-cp35m-win_amd64.whl (1.3MB)    100% |████████████████████████████████| 1.3MB 257kB/sRequirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted>=10.0.0->scrapy)Collecting cffi>=1.4.1 (from cryptography>=1.3.4->pyOpenSSL->scrapy)  Downloading cffi-1.9.1-cp35-cp35m-win_amd64.whl (158kB)    100% |████████████████████████████████| 163kB 322kB/sCollecting idna>=2.0 (from cryptography>=1.3.4->pyOpenSSL->scrapy)  Downloading idna-2.1-py2.py3-none-any.whl (54kB)    100% |████████████████████████████████| 61kB 4.4MB/sCollecting pycparser (from cffi>=1.4.1->cryptography>=1.3.4->pyOpenSSL->scrapy)  Downloading pycparser-2.17.tar.gz (231kB)    100% |████████████████████████████████| 235kB 311kB/sInstalling collected packages: six, pycparser, cffi, pyasn1, idna, cryptography, pyOpenSSL, pyasn1-modules, attrs, service-identity, w3lib, PyDispatcher, cssselect, parsel, queuelib, scrapy  Running setup.py install for pycparser ... done  Running setup.py install for PyDispatcher ... doneSuccessfully installed PyDispatcher-2.0.5 attrs-16.2.0 cffi-1.9.1 cryptography-1.6 cssselect-1.0.0 idna-2.1 parsel-1.1.0 pyOpenSSL-16.2.0 pyasn1-0.1.9 pyasn1-modules-0.0.8 pycparser-2.17 queuelib-1.4.2 scrapy-1.2.1 service-identity-16.0.0 six-1.10.0 w3lib-1.16.0

查看 scrapy 是否安装成功:(执行scrapy -h 命令,如果能输出信息,说明安装成功)

1234567891011121314151617181920212223
C:\Users\AOBO>scrapy -hScrapy 1.2.1 - no active projectUsage:  scrapy <command> [options] [args]Available commands:  bench         Run quick benchmark test  commands  fetch         Fetch a URL using the Scrapy downloader  genspider     Generate new spider using pre-defined templates  runspider     Run a self-contained spider (without creating a project)  settings      Get settings values  shell         Interactive scraping console  startproject  Create new project  version       Print Scrapy version  view          Open URL in browser, as seen by Scrapy  [ more ]      More commands available when run from project directoryUse "scrapy <command> -h" to see more info about a commandC:\Users\AOBO>

检查所有刚刚安装的库是否安装成功:

启动PyCharm 软件,新建一个工程:

Alt text

Alt text

刚刚安装的库在这里可以看到:

Alt text

安装成功。


6 . 一个超好的命令行串口软件 — PowerCmd

PowerCmd 是一款Windows CMD 的增强工具。

下载安装地址:http://www.aobosir.com/blog/2016/11/23/powercmd-install/

这个软件真的很喽,像我执行scrapy -h 这样的命令,都打印不出信息,在DOS窗口里面是有信息打印出来的。



测试环境

1 . 执行 scrapy -h,如果有打印出来信息,说明Scrapy 安装成功。

2 . 执行 scrapy bench ,如果遇到问题,说明pywin32库还有需要完成的步骤。(解决问题: import win32api ImportError: DLL load failed,到这里查看解决办法。)


接下来,我们学习 Scrapy 的命令。了解了Scrapy 命令后,我学习:scrapy 爬虫项目的创建及爬虫的创建 — 实例:爬取百度标题和CSDN博客。



阅读全文
0 0
原创粉丝点击