(转载)http://www.aobosir.com/blog/2016/11/26/python3-large-web-crawler-001-Build-development-environment/
前言
开发Python爬虫有很多种方式,从程序的复杂程度的角度来说,可以分为:爬虫项目和爬虫文件。 相信有些朋友玩过Python的urllib模块,一般我们可以用该模块写一些爬虫文件,实现起来非常方便,但做大型项目的时候,会发现效率不是太好、并且程序的稳定性也不是太好。 Scrapy是一个Python的爬虫框架,使用Scrapy可以提高开发效率,并且非常适合做一些中大型爬虫项目。 简单来说,urllib库更适合写爬虫文件,scrapy更适合做爬虫项目。
本套专栏,就来讲解如何做爬虫项目。本篇博客是第一篇博客:搭建开发环境。
1 . 安装Python3
到官网下载就可以了,下载一个Python3.5版本就可以,傻瓜式安装。
Python 3 被默认安装在:C:\Users\[Username]\AppData\Local\Programs\Python\Python35
这个路径里面。
2 . 安装Python程序开发集成开发环境 — PyCharm IDE 2016.1.4
软件下载:https://www.jetbrains.com/pycharm/download/#section=windows
注意:
Professional是完整版的,但是需要注册码
注册方法:http://blog.csdn.net/tianzhaixing2013/article/details/44997881
我这次安装的是PyCharm 2016。
Community是免费版的,但是软件里面的Terminal是不能使用的。
3 . 安装 Visual Studio 2015 软件
要知道:为什么需要 Visual Studio 软件了。(参考这个网站)
如果不安装,当中你执行pip install third-package-name
时,有时会出现下面这个错误:error: Unable to find vcvarsall.bat
安装Visual Studio 2015 软件是为了安装里面的Python Tools 2.2.5 for Visual Studio 2015软件。
下载和安装 Visual Studio 2015 软件 的方法在这里。
4 . 升级 pip 工具
在DOS窗口中执行下面的命令来升级pip工具。
1
python -m pip install --upgrade pip
5 . 安装一些第三方库
lxml、Twisted、pywin32、scrapy
lxml是一种可以迅速、灵活地处理 XML。 Twisted是用Python实现的基于事件驱动的网络引擎框架。 pywin32提供win32api。 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。
我们安装的是python3.5,并且我的电脑是64位的,所以:下载:
lxml‑3.6.4‑cp35‑cp35m‑win_amd64.whl
Twisted‑16.5.0‑cp35‑cp35m‑win_amd64.whl
pywin32‑220.1‑cp35‑cp35m‑win_amd64.whl
scrapy(直接使用命令:pip.exe install scrapy
来安装。)
Python安装第三方库的方法:http://blog.csdn.net/github_35160620/article/details/52203682
注意:如果你的电脑之前安装了Python2,那么Python2 有自己的pip工具,Python3 也是有自己的pip工具,所以,如果你在DOS命令行上执行pip install some-package-name
命令的时候,系统会使用哪个pip工具呢?是python2的pip,还是python3的pip?
这个问题,你可以在这篇博客里得到解决答案:http://www.aobosir.com/blog/2016/11/23/pip-install-python2-python3/
下载后,在我的电脑上是这样安装:
安装 lxml:
12345
C:\Users\AOBO>cd C:\Users\AOBO\AppData\Local\Programs\Python\Python35\ScriptsC:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whlProcessing d:\software_install_package_win\python\some-python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whlInstalling collected packages: lxmlSuccessfully installed lxml-3.6.4
安装 Twisted :(执行到Collecting constantly>=15.1 (from Twisted==16.5.0)
这句时,卡住了,我按了 Ctrl+C 才继续执行下去。自动下载了下面的:constantly、incremental、zope.interface 这三个依赖库)
12345678910111213
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\Twisted-16.5.0-cp35-cp35m-win_amd64.whlProcessing d:\software_install_package_win\python\some-python-third-packages\twisted-16.5.0-cp35-cp35m-win_amd64.whlCollecting constantly>=15.1 (from Twisted==16.5.0)#(执行到这卡住了,我按了 Ctrl+C 才继续执行下去。自动下载了下面的:constantly、incremental、zope.interface 这三个依赖库) Downloading constantly-15.1.0-py2.py3-none-any.whlCollecting incremental>=16.10.1 (from Twisted==16.5.0) Downloading incremental-16.10.1-py2.py3-none-any.whlCollecting zope.interface>=4.0.2 (from Twisted==16.5.0) Downloading zope.interface-4.3.2-cp35-cp35m-win_amd64.whl (136kB) 100% |████████████████████████████████| 143kB 7.1kB/sRequirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted==16.5.0)Installing collected packages: constantly, incremental, zope.interface, TwistedSuccessfully installed Twisted-16.5.0 constantly-15.1.0 incremental-16.10.1 zope.interface-4.3.2
安装pywin32:
1234
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whlProcessing d:\software_install_package_win\python\some-python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whlInstalling collected packages: pywin32Successfully installed pywin32-220.1
安装scropy:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install scrapyCollecting scrapy Downloading Scrapy-1.2.1-py2.py3-none-any.whl (294kB) 100% |████████████████████████████████| 296kB 338kB/sCollecting service-identity (from scrapy) Downloading service_identity-16.0.0-py2.py3-none-any.whlCollecting six>=1.5.2 (from scrapy) Downloading six-1.10.0-py2.py3-none-any.whlCollecting w3lib>=1.15.0 (from scrapy) Downloading w3lib-1.16.0-py2.py3-none-any.whlCollecting PyDispatcher>=2.0.5 (from scrapy) Downloading PyDispatcher-2.0.5.tar.gzRequirement already satisfied: Twisted>=10.0.0 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)Requirement already satisfied: lxml in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)Collecting cssselect>=0.9 (from scrapy) Downloading cssselect-1.0.0-py2.py3-none-any.whlCollecting parsel>=0.9.3 (from scrapy) Downloading parsel-1.1.0-py2.py3-none-any.whlCollecting queuelib (from scrapy) Downloading queuelib-1.4.2-py2.py3-none-any.whlCollecting pyOpenSSL (from scrapy) Downloading pyOpenSSL-16.2.0-py2.py3-none-any.whl (43kB) 100% |████████████████████████████████| 51kB 4.7MB/sCollecting pyasn1 (from service-identity->scrapy) Downloading pyasn1-0.1.9-py2.py3-none-any.whlCollecting pyasn1-modules (from service-identity->scrapy) Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whlCollecting attrs (from service-identity->scrapy) Downloading attrs-16.2.0-py2.py3-none-any.whlRequirement already satisfied: constantly>=15.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)Requirement already satisfied: zope.interface>=4.0.2 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)Requirement already satisfied: incremental>=16.10.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)Collecting cryptography>=1.3.4 (from pyOpenSSL->scrapy) Downloading cryptography-1.6-cp35-cp35m-win_amd64.whl (1.3MB) 100% |████████████████████████████████| 1.3MB 257kB/sRequirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted>=10.0.0->scrapy)Collecting cffi>=1.4.1 (from cryptography>=1.3.4->pyOpenSSL->scrapy) Downloading cffi-1.9.1-cp35-cp35m-win_amd64.whl (158kB) 100% |████████████████████████████████| 163kB 322kB/sCollecting idna>=2.0 (from cryptography>=1.3.4->pyOpenSSL->scrapy) Downloading idna-2.1-py2.py3-none-any.whl (54kB) 100% |████████████████████████████████| 61kB 4.4MB/sCollecting pycparser (from cffi>=1.4.1->cryptography>=1.3.4->pyOpenSSL->scrapy) Downloading pycparser-2.17.tar.gz (231kB) 100% |████████████████████████████████| 235kB 311kB/sInstalling collected packages: six, pycparser, cffi, pyasn1, idna, cryptography, pyOpenSSL, pyasn1-modules, attrs, service-identity, w3lib, PyDispatcher, cssselect, parsel, queuelib, scrapy Running setup.py install for pycparser ... done Running setup.py install for PyDispatcher ... doneSuccessfully installed PyDispatcher-2.0.5 attrs-16.2.0 cffi-1.9.1 cryptography-1.6 cssselect-1.0.0 idna-2.1 parsel-1.1.0 pyOpenSSL-16.2.0 pyasn1-0.1.9 pyasn1-modules-0.0.8 pycparser-2.17 queuelib-1.4.2 scrapy-1.2.1 service-identity-16.0.0 six-1.10.0 w3lib-1.16.0
查看 scrapy
是否安装成功:(执行scrapy -h
命令,如果能输出信息,说明安装成功)
1234567891011121314151617181920212223
C:\Users\AOBO>scrapy -hScrapy 1.2.1 - no active projectUsage: scrapy <command> [options] [args]Available commands: bench Run quick benchmark test commands fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directoryUse "scrapy <command> -h" to see more info about a commandC:\Users\AOBO>
检查所有刚刚安装的库是否安装成功:
启动PyCharm 软件,新建一个工程:
刚刚安装的库在这里可以看到:
安装成功。
6 . 一个超好的命令行串口软件 — PowerCmd
PowerCmd 是一款Windows CMD 的增强工具。
下载安装地址:http://www.aobosir.com/blog/2016/11/23/powercmd-install/
这个软件真的很喽,像我执行scrapy -h
这样的命令,都打印不出信息,在DOS窗口里面是有信息打印出来的。
测试环境
1 . 执行 scrapy -h
,如果有打印出来信息,说明Scrapy 安装成功。
2 . 执行 scrapy bench
,如果遇到问题,说明pywin32库还有需要完成的步骤。(解决问题: import win32api ImportError: DLL load failed,到这里查看解决办法。)
接下来,我们学习 Scrapy 的命令。了解了Scrapy 命令后,我学习:scrapy 爬虫项目的创建及爬虫的创建 — 实例:爬取百度标题和CSDN博客。