python2.7之学习scrapy
来源:互联网 发布:zepto.js swipeleft 编辑:程序博客网 时间:2024/05/29 15:10
安装pip和scrapy教程
环境:win10+python2.7+wamp(这个提供mysql环境)+navicat for mysql(这个管理mysql)
scrapy第一个例子(爬取糗百段子)
1.安装pymysql:pip install PyMySQL
2.用navicat for mysql或者相同软件新建数据库“news”和表“qiubai”
CREATE TABLE `qiubai` ( `name` varchar(255) DEFAULT NULL, `news_id` varchar(20) DEFAULT '', `url` varchar(255) DEFAULT NULL, `text_content` varchar(255) DEFAULT NULL, `has_image` varchar(1) DEFAULT NULL) ENGINE=MyISAM DEFAULT CHARSET=utf8;
3.创建爬虫(在桌面cmd执行):scrapy startproject tutorial
4.桌面会出现tutorial文件夹
tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
这些文件分别是:
scrapy.cfg: 项目的配置文件
tutorial/: 该项目的python模块。
tutorial/items.py: 项目中的item文件.
tutorial/pipelines.py: 项目中的pipelines文件.
tutorial/settings.py: 项目的设置文件.
tutorial/spiders/: 放置spider代码的目录.
5.定义爬取的item:
编辑 tutorial 目录中的 items.py 文件:
#items.pyimport scrapyclass TutorialItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() news_id=scrapy.Field() url=scrapy.Field() #title = scrapy.Field() text_content = scrapy.Field() #key_words = scrapy.Field() has_image=scrapy.Field()
6.添加UserAgent
编辑 tutorial 目录中的 middlewares.py 文件(在文件末尾增加如下代码):
import randomfrom scrapy.downloadermiddlewares.useragent import UserAgentMiddlewareclass RotateUserAgentMiddleware(UserAgentMiddleware): def __init__(self, user_agent=''): self.user_agent = user_agent def process_request(self, request, spider): ua = random.choice(self.user_agent_list) if ua: print ua, '-----------------' request.headers.setdefault('User-Agent', ua) # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
7.将item插入数据库
编辑 tutorial 目录中的 pipelines.py 文件(在文件末尾增加如下代码):
import pymysqldef dbHandle(): conn = pymysql.connect( host='localhost', port=3306, db='news', user='root', passwd='', charset='utf8', ) return connclass TutorialPipeline(object): def __init__(self): self.dbObject = dbHandle() self.cursor = self.dbObject.cursor() def process_item(self, item, spider): sql = 'insert into qiubai(name,news_id,url,text_content,has_image) values (%s,%s,%s,%s,%s)' try: self.cursor.execute(sql, (item['name'].encode("utf-8"), (item['news_id'].encode("utf-8"))[7:15], ("http://www.qiushibaike.com"+item['url'].encode("utf-8")), item['text_content'].encode("utf-8"), item['has_image'].encode("utf-8"))) self.dbObject.commit() except Exception, e: print e self.dbObject.rollback() return item def __del__(self): self.cursor.close() self.dbObject.close()
8.注册
编辑 tutorial 目录中的settings.py 文件(在文件末尾增加如下代码):
ITEM_PIPELINES = { 'tutorial.pipelines.TutorialPipeline': 300,}DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None, 'tutorial.middlewares.RotateUserAgentMiddleware':400,}FEED_EXPORT_ENCODING = 'utf-8'
9.建立爬虫
在tutorial/spider 目录中新建myspider.py 文件并添加如下代码:
# encoding=utf8import scrapyfrom tutorial.items import TutorialItemclass DmozSpider(scrapy.Spider): name = "qiubai" start_urls = [ "http://www.qiushibaike.com/" ] def parse(self, response): qiubai = TutorialItem() for item in response.xpath('//div[@id="content-left"]/div[@class="article block untagged mb15"]'): name = item.xpath('./div[@class="author clearfix"]/a[2]/h2/text()').extract() if name: qiubai['name'] = name[0] news_id= item.xpath('./div[@class="author clearfix"]/a[1]/@href').extract() if news_id: qiubai['news_id']=news_id[0] url = item.xpath('./a[@class="contentHerf"]/@href').extract() if url: qiubai['url'] = url[0] text_content = item.xpath('./a[@class="contentHerf"]/div/span/text()').extract() if text_content: qiubai['text_content'] = text_content[0] has_image = item.xpath('./div[@class="thumb"]/a/img[@src]').extract() if has_image: qiubai['has_image'] = '1' else: qiubai['has_image'] = '0' yield qiubai
到此爬虫就写完了。关于运行有两种方式:
方式一:在tutorial用cmd执行scrapy crawl qiubai
或者 scrapy crawl dmoz -o items.json
(前者只执行,后者吧数据保存为items.json)
(qiubai为第9步中name,是唯一区别爬虫的,请勿重复)
方式二:对于用pycharm的朋友可以在tutorial新建文件runspider.py并填入下面代码:
# encoding=utf8from scrapy import cmdlinecmdline.execute("scrapy crawl qiubai".split())
就可以直接用pycharm跑爬虫
scrapy入门手册
一些说明:添加UserAgent 是防止网站有反爬,最好再加上header;
items.py文件中的name = scrapy.Field()
name对应你需要爬的每一个项目名;
settings.py文件是用于注册自己写的一些class;
myspider.py文件中的可以用start_urls
, 也可以重写def start_requests(self)
(二选一); response.xpath()
给几个简单说明:
/html/head/title: 选择HTML文档<head>元素下面的<title> 标签。/html/head/title/text(): 选择前面提到的<title> 元素下面的文本内容//td: 选择所有 <td> 元素//div[@class="mine"]: 选择所有包含 class="mine"属性的div 标签元素
更多请点击这里。
推荐是用火狐浏览器的插件firebug,方便看网页标签
PS:以上代码部分来源网络,感谢分享
增加部分:
上面代码只能爬去第一页,现在增加爬取指定页数的代码:
1.将myspider.py改为下面代码:
# encoding=utf8import scrapyfrom scrapy.http.request import Requestfrom tutorial.items import TutorialItemclass DmozSpider(scrapy.Spider): allowed_domains = ["qiushibaike.com"] name = "qiubai" start_urls = [ "http://www.qiushibaike.com/" ] def start_requests(self): for url in self.start_urls: yield Request(url,callback=self.parse) def parse(self, response): qiubai = TutorialItem() for item in response.xpath('//div[@id="content-left"]/div[@class="article block untagged mb15"]'): name = item.xpath('./div[@class="author clearfix"]/a[2]/h2/text()').extract() if name: qiubai['name'] = name[0] news_id= item.xpath('./div[@class="author clearfix"]/a[1]/@href').extract() if news_id: qiubai['news_id']=news_id[0] url = item.xpath('./a[@class="contentHerf"]/@href').extract() if url: qiubai['url'] = url[0] text_content = item.xpath('./a[@class="contentHerf"]/div/span/text()').extract() if text_content: qiubai['text_content'] = text_content[0] has_image = item.xpath('./div[@class="thumb"]/a/img[@src]').extract() if has_image: qiubai['has_image'] = '1' else: qiubai['has_image'] = '0' yield qiubai for x in range(2,10): page="http://www.qiushibaike.com/8hr/page/"+str(x) yield Request(page, callback=self.sub_parse) def sub_parse(self, response): qiubai = TutorialItem() for item in response.xpath('//div[@id="content-left"]/div[@class="article block untagged mb15"]'): name = item.xpath('./div[@class="author clearfix"]/a[2]/h2/text()').extract() if name: qiubai['name'] = name[0] news_id = item.xpath('./div[@class="author clearfix"]/a[1]/@href').extract() if news_id: qiubai['news_id'] = news_id[0] url = item.xpath('./a[@class="contentHerf"]/@href').extract() if url: qiubai['url'] = url[0] text_content = item.xpath('./a[@class="contentHerf"]/div/span/text()').extract() if text_content: qiubai['text_content'] = text_content[0] has_image = item.xpath('./div[@class="thumb"]/a/img[@src]').extract() if has_image: qiubai['has_image'] = '1' else: qiubai['has_image'] = '0' yield qiubai
2.在settings.py增加如下代码:
DOWNLOAD_DELAY = 1#爬取间隔时间
- python2.7之学习scrapy
- python2.7 之centos7 安装 pip, Scrapy
- python2.7安装Scrapy
- 安装Scrapy(Python2.7)
- Scrapy win7 64 python2.7安装之坑
- python2.7下安装scrapy
- window7 python2.7 安装scrapy
- 安装 Scrapy for Python2.7
- python2.7 安装scrapy后报错
- Win7 python2.7 Scrapy的安装
- Python2.7下安装Scrapy框架
- 基于Python2.7的Scrapy安装步骤
- Python2.7下安装Scrapy框架
- 在Windows下 python2.7安装scrapy
- scrapy 学习之路
- python2中scrapy安装
- win7(64) python2.7(64) scrapy框架的搭建
- windows和python2.7环境下scrapy安装步骤
- Java面试题集(86-115)
- 使用win10自带IIS发布ASP.NET网站
- RTTI
- Python numpy中的对象传递问题
- Hadoop入门基础教程 Hadoop之伪分布式环境搭建
- python2.7之学习scrapy
- Codeforces Round #411(A. Fake NP; B. 3-palindrome; C. Find Amir; D.Minimum number of steps)
- 存储过程调试
- Accept-Encoding学习
- #411 Div.2 D. Minimum number of steps
- 各种内部排序算法的比较和选择
- Filter的入门
- 日志篇:使用Qt開發屬於自己的簡單雲日記(1)
- Vue中状态管理——Vuex