scrapy爬虫框架将数据保存Mysql数据库中
来源:互联网 发布:网络作家协会好加入吗 编辑:程序博客网 时间:2024/05/21 21:42
scrapy爬虫框架简单Demo
github地址:https://github.com/lawlite19/PythonCrawler-Scrapy-Mysql-File-Template
使用scrapy爬虫框架将数据保存MySQL数据库和文件中
settings.py
- 修改Mysql的配置信息
<code class="language-stylus hljs bash has-numbering"><span class="hljs-comment">#Mysql数据库的配置信息</span>MYSQL_HOST = <span class="hljs-string">'127.0.0.1'</span>MYSQL_DBNAME = <span class="hljs-string">'testdb'</span> <span class="hljs-comment">#数据库名字,请修改</span>MYSQL_USER = <span class="hljs-string">'root'</span> <span class="hljs-comment">#数据库账号,请修改 </span>MYSQL_PASSWD = <span class="hljs-string">'123456'</span> <span class="hljs-comment">#数据库密码,请修改</span>MYSQL_PORT = <span class="hljs-number">3306</span> <span class="hljs-comment">#数据库端口,在dbhelper中使用</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li></ul>
- 指定pipelines
<code class="language-stylus hljs bash has-numbering">ITEM_PIPELINES = { <span class="hljs-string">'webCrawler_scrapy.pipelines.WebcrawlerScrapyPipeline'</span>: <span class="hljs-number">300</span>,<span class="hljs-comment">#保存到mysql数据库</span> <span class="hljs-string">'webCrawler_scrapy.pipelines.JsonWithEncodingPipeline'</span>: <span class="hljs-number">300</span>,<span class="hljs-comment">#保存到文件中</span>}</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li></ul>
items.py
- 声明需要格式化处理的字段
<code class="language-stylus hljs python has-numbering"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WebcrawlerScrapyItem</span><span class="hljs-params">(scrapy.Item)</span>:</span> <span class="hljs-string">'''定义需要格式化的内容(或是需要保存到数据库的字段)'''</span> <span class="hljs-comment"># define the fields for your item here like:</span> <span class="hljs-comment"># name = scrapy.Field()</span> name = scrapy.Field() <span class="hljs-comment">#修改你所需要的字段</span> url = scrapy.Field()</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>
pipelines.py
一、保存到数据库的类WebcrawlerScrapyPipeline
(在settings中声明)
- 定义一个类方法
from_settings
,得到settings中的Mysql数据库配置信息,得到数据库连接池dbpool
<code class="language-stylus hljs python has-numbering"> <span class="hljs-decorator">@classmethod</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">from_settings</span><span class="hljs-params">(cls,settings)</span>:</span> <span class="hljs-string">'''1、@classmethod声明一个类方法,而对于平常我们见到的则叫做实例方法。 2、类方法的第一个参数cls(class的缩写,指这个类本身),而实例方法的第一个参数是self,表示该类的一个实例 3、可以通过类来调用,就像C.f(),相当于java中的静态方法'''</span> dbparams=dict( host=settings[<span class="hljs-string">'MYSQL_HOST'</span>],<span class="hljs-comment">#读取settings中的配置</span> db=settings[<span class="hljs-string">'MYSQL_DBNAME'</span>], user=settings[<span class="hljs-string">'MYSQL_USER'</span>], passwd=settings[<span class="hljs-string">'MYSQL_PASSWD'</span>], charset=<span class="hljs-string">'utf8'</span>,<span class="hljs-comment">#编码要加上,否则可能出现中文乱码问题</span> cursorclass=MySQLdb.cursors.DictCursor, use_unicode=<span class="hljs-keyword">False</span>, ) dbpool=adbapi.ConnectionPool(<span class="hljs-string">'MySQLdb'</span>,**dbparams)<span class="hljs-comment">#**表示将字典扩展为关键字参数,相当于host=xxx,db=yyy....</span> <span class="hljs-keyword">return</span> cls(dbpool)<span class="hljs-comment">#相当于dbpool付给了这个类,self中可以得到</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li></ul>
__init__
中会得到连接池dbpool
<code class="language-stylus hljs python has-numbering"> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self,dbpool)</span>:</span> self.dbpool=dbpool</code><ul style="" class="pre-numbering"><li>1</li><li>2</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li></ul>
process_item
方法是pipeline默认调用的,进行数据库操作
<code class="language-stylus hljs python has-numbering"> <span class="hljs-comment">#pipeline默认调用</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_item</span><span class="hljs-params">(self, item, spider)</span>:</span> query=self.dbpool.runInteraction(self._conditional_insert,item)<span class="hljs-comment">#调用插入的方法</span> query.addErrback(self._handle_error,item,spider)<span class="hljs-comment">#调用异常处理方法</span> <span class="hljs-keyword">return</span> item</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul>
- 插入数据库方法
_conditional_insert
<code class="language-stylus hljs python has-numbering"> <span class="hljs-comment">#写入数据库中</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_conditional_insert</span><span class="hljs-params">(self,tx,item)</span>:</span> <span class="hljs-comment">#print item['name']</span> sql=<span class="hljs-string">"insert into testpictures(name,url) values(%s,%s)"</span> params=(item[<span class="hljs-string">"name"</span>],item[<span class="hljs-string">"url"</span>]) tx.execute(sql,params)</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>
- 错误处理方法
_handle_error
<code class="language-stylus hljs python has-numbering"> <span class="hljs-comment">#错误处理方法</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_handle_error</span><span class="hljs-params">(self, failue, item, spider)</span>:</span> <span class="hljs-keyword">print</span> failue</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li></ul>
二、保存到文件中的类JsonWithEncodingPipeline
(在settings中声明)
- 保存为json格式的文件,比较简单,代码如下
<code class="language-stylus hljs python has-numbering"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JsonWithEncodingPipeline</span><span class="hljs-params">(object)</span>:</span> <span class="hljs-string">'''保存到文件中对应的class 1、在settings.py文件中配置 2、在自己实现的爬虫类中yield item,会自动执行'''</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self)</span>:</span> self.file = codecs.open(<span class="hljs-string">'info.json'</span>, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>)<span class="hljs-comment">#保存为json文件</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_item</span><span class="hljs-params">(self, item, spider)</span>:</span> line = json.dumps(dict(item)) + <span class="hljs-string">"\n"</span><span class="hljs-comment">#转为json的</span> self.file.write(line)<span class="hljs-comment">#写入文件中</span> <span class="hljs-keyword">return</span> item <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">spider_closed</span><span class="hljs-params">(self, spider)</span>:</span><span class="hljs-comment">#爬虫结束时关闭文件</span> self.file.close()</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li></ul>
dbhelper.py
- 自己实现的操作Mysql数据库的类
- init方法,获取settings配置文件中的信息
<code class="language-stylus hljs ruby has-numbering"> <span class="hljs-function"><span class="hljs-keyword">def</span> </span>__init_<span class="hljs-number">_</span>(<span class="hljs-keyword">self</span>)<span class="hljs-symbol">:</span> <span class="hljs-keyword">self</span>.settings=get_project_settings() <span class="hljs-comment">#获取settings配置,设置需要的信息</span> <span class="hljs-keyword">self</span>.host=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_HOST'</span>] <span class="hljs-keyword">self</span>.port=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_PORT'</span>] <span class="hljs-keyword">self</span>.user=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_USER'</span>] <span class="hljs-keyword">self</span>.passwd=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_PASSWD'</span>] <span class="hljs-keyword">self</span>.db=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_DBNAME'</span>]</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li></ul>
- 连接到Mysql
<code class="language-stylus hljs python has-numbering"> <span class="hljs-comment">#连接到mysql,不是连接到具体的数据库</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">connectMysql</span><span class="hljs-params">(self)</span>:</span> conn=MySQLdb.connect(host=self.host, port=self.port, user=self.user, passwd=self.passwd, <span class="hljs-comment">#db=self.db,不指定数据库名</span> charset=<span class="hljs-string">'utf8'</span>) <span class="hljs-comment">#要指定编码,否则中文可能乱码</span> <span class="hljs-keyword">return</span> conn</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul>
- 连接到settings配置文件中的数据库名(MYSQL_DBNAME)
<code class="language-stylus hljs python has-numbering"> <span class="hljs-comment">#连接到具体的数据库(settings中设置的MYSQL_DBNAME)</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">connectDatabase</span><span class="hljs-params">(self)</span>:</span> conn=MySQLdb.connect(host=self.host, port=self.port, user=self.user, passwd=self.passwd, db=self.db, charset=<span class="hljs-string">'utf8'</span>) <span class="hljs-comment">#要指定编码,否则中文可能乱码</span> <span class="hljs-keyword">return</span> conn </code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul>
- 创建数据库(settings文件中配置的数据库名)
<code class="language-stylus hljs python has-numbering"> <span class="hljs-comment">#创建数据库</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">createDatabase</span><span class="hljs-params">(self)</span>:</span> <span class="hljs-string">'''因为创建数据库直接修改settings中的配置MYSQL_DBNAME即可,所以就不要传sql语句了'''</span> conn=self.connectMysql()<span class="hljs-comment">#连接数据库</span> sql=<span class="hljs-string">"create database if not exists "</span>+self.db cur=conn.cursor() cur.execute(sql)<span class="hljs-comment">#执行sql语句</span> cur.close() conn.close()</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li></ul>
- 还有一些数据库操作方法传入sql语句和参数即可(具体看代码)
实现具体的爬虫.py(即模板中的pictureSpider_demo.py
文件)
- 继承
scrapy.spiders.Spider
类 - 声明三个属性
<code class="language-stylus hljs applescript has-numbering"> <span class="hljs-property">name</span>=<span class="hljs-string">"webCrawler_scrapy"</span> <span class="hljs-comment">#定义爬虫名,要和settings中的BOT_NAME属性对应的值一致</span> allowed_domains=[<span class="hljs-string">"desk.zol.com.cn"</span>] <span class="hljs-comment">#搜索的域名范围,也就是爬虫的约束区域,规定爬虫只爬取这个域名下的网页</span> start_urls=[<span class="hljs-string">"http://desk.zol.com.cn/fengjing/1920x1080/1.html"</span>] <span class="hljs-comment">#开始爬取的地址</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul>
- 实现
parse
方法,该函数名不能改变,因为Scrapy源码中默认callback函数的函数名就是parse
<code class="language-stylus hljs python has-numbering"> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span></code><ul style="" class="pre-numbering"><li>1</li></ul><ul style="" class="pre-numbering"><li>1</li></ul>
- 返回item
<code class="language-stylus hljs livecodeserver has-numbering"> <span class="hljs-keyword">item</span>=WebcrawlerScrapyItem() <span class="hljs-comment">#实例item(具体定义的item类),将要保存的值放到事先声明的item属性中</span> <span class="hljs-keyword">item</span>[<span class="hljs-string">'name'</span>]=file_name <span class="hljs-keyword">item</span>[<span class="hljs-string">'url'</span>]=realUrl print <span class="hljs-keyword">item</span>[<span class="hljs-string">"name"</span>],<span class="hljs-keyword">item</span>[<span class="hljs-string">"url"</span>] yield <span class="hljs-keyword">item</span> <span class="hljs-comment">#返回item,这时会自定解析item</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>
测试
测试DBHelper
创建testdb数据库和testtable表- 测试爬虫
scrapy crawl webCrawler_scrapy
运行爬虫后会将爬取得图片保存到本地,并且将name和url保存到数据库中
0 0
- scrapy爬虫框架将数据保存Mysql数据库中
- scrapy框架爬虫将数据保存到MySQL数据库(20170214)
- linux下在服务器上配置scrapy框架的python爬虫,使用mysql数据库保存
- scrapy爬虫数据存入mysql数据库
- 保存数据到MySql数据库——我用scrapy写爬虫(二)
- Python MySQL安装+Scrapy爬虫将Item写入mysql数据库
- 使用WebMagic爬虫框架及javaEE SSH框架将数据保存到数据库(一)
- WebMagic爬虫框架及javaEE SSH框架将数据保存到数据库(二)
- python3实战scrapy获取数据保存至MySQL数据库
- 使用python爬虫抓取页面之后,将页面保存到Mysql数据库中
- Nodejs实现简单爬虫,将爬到的数据以json数据格式保存到MySQL数据库中
- Scrapy网络爬虫实战[保存为Json文件及存储到mysql数据库]
- Python3.6实现scrapy框架爬取数据并将数据插入MySQL与存入文档中
- Scrapy爬虫入门系列3 将抓取到的数据存入数据库与验证数据有效性
- python 将爬虫内容保存到mysql中
- Scrapy 爬虫框架爬取网页数据
- 大数据-爬虫框架学习--scrapy
- Python爬虫系列之----Scrapy(八)爬取豆瓣读书某个tag下的所有书籍并保存到Mysql数据库中去
- 本地无法连接服务器mysql数据库
- 毫秒数转时分秒
- 除去集合中的重复数据
- 最新的SSH框架(Spring4.3.3 +Struts2.5.2+Hibernate5.2.3)搭建
- 设计模式(1) 单例模式--创建型
- scrapy爬虫框架将数据保存Mysql数据库中
- jquery实现table新增、删除行,并实现sum统计
- Android检测当前是否在主线程内
- POI
- codeforces 375D dfs+模拟
- 自定义密码输入框
- Android实现推送方式解决方案
- nginx安装
- scala基础5 —— 类的继承