scrapy爬虫框架将数据保存Mysql数据库中

来源:互联网 发布:网络作家协会好加入吗 编辑:程序博客网 时间:2024/05/21 21:42

scrapy爬虫框架简单Demo

github地址:https://github.com/lawlite19/PythonCrawler-Scrapy-Mysql-File-Template
使用scrapy爬虫框架将数据保存MySQL数据库和文件中

settings.py

  • 修改Mysql的配置信息
<code class="language-stylus hljs bash has-numbering"><span class="hljs-comment">#Mysql数据库的配置信息</span>MYSQL_HOST = <span class="hljs-string">'127.0.0.1'</span>MYSQL_DBNAME = <span class="hljs-string">'testdb'</span>         <span class="hljs-comment">#数据库名字,请修改</span>MYSQL_USER = <span class="hljs-string">'root'</span>             <span class="hljs-comment">#数据库账号,请修改 </span>MYSQL_PASSWD = <span class="hljs-string">'123456'</span>         <span class="hljs-comment">#数据库密码,请修改</span>MYSQL_PORT = <span class="hljs-number">3306</span>               <span class="hljs-comment">#数据库端口,在dbhelper中使用</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li></ul>
  • 指定pipelines
<code class="language-stylus hljs bash has-numbering">ITEM_PIPELINES = {    <span class="hljs-string">'webCrawler_scrapy.pipelines.WebcrawlerScrapyPipeline'</span>: <span class="hljs-number">300</span>,<span class="hljs-comment">#保存到mysql数据库</span>    <span class="hljs-string">'webCrawler_scrapy.pipelines.JsonWithEncodingPipeline'</span>: <span class="hljs-number">300</span>,<span class="hljs-comment">#保存到文件中</span>}</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li></ul>

items.py

  • 声明需要格式化处理的字段
<code class="language-stylus hljs python has-numbering"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WebcrawlerScrapyItem</span><span class="hljs-params">(scrapy.Item)</span>:</span>    <span class="hljs-string">'''定义需要格式化的内容(或是需要保存到数据库的字段)'''</span>    <span class="hljs-comment"># define the fields for your item here like:</span>    <span class="hljs-comment"># name = scrapy.Field()</span>    name = scrapy.Field()   <span class="hljs-comment">#修改你所需要的字段</span>    url = scrapy.Field()</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>

pipelines.py

一、保存到数据库的类WebcrawlerScrapyPipeline(在settings中声明)

  • 定义一个类方法from_settings,得到settings中的Mysql数据库配置信息,得到数据库连接池dbpool
<code class="language-stylus hljs python has-numbering">    <span class="hljs-decorator">@classmethod</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">from_settings</span><span class="hljs-params">(cls,settings)</span>:</span>        <span class="hljs-string">'''1、@classmethod声明一个类方法,而对于平常我们见到的则叫做实例方法。            2、类方法的第一个参数cls(class的缩写,指这个类本身),而实例方法的第一个参数是self,表示该类的一个实例           3、可以通过类来调用,就像C.f(),相当于java中的静态方法'''</span>        dbparams=dict(            host=settings[<span class="hljs-string">'MYSQL_HOST'</span>],<span class="hljs-comment">#读取settings中的配置</span>            db=settings[<span class="hljs-string">'MYSQL_DBNAME'</span>],            user=settings[<span class="hljs-string">'MYSQL_USER'</span>],            passwd=settings[<span class="hljs-string">'MYSQL_PASSWD'</span>],            charset=<span class="hljs-string">'utf8'</span>,<span class="hljs-comment">#编码要加上,否则可能出现中文乱码问题</span>            cursorclass=MySQLdb.cursors.DictCursor,            use_unicode=<span class="hljs-keyword">False</span>,        )        dbpool=adbapi.ConnectionPool(<span class="hljs-string">'MySQLdb'</span>,**dbparams)<span class="hljs-comment">#**表示将字典扩展为关键字参数,相当于host=xxx,db=yyy....</span>        <span class="hljs-keyword">return</span> cls(dbpool)<span class="hljs-comment">#相当于dbpool付给了这个类,self中可以得到</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li></ul>
  • __init__中会得到连接池dbpool
<code class="language-stylus hljs python has-numbering">    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self,dbpool)</span>:</span>        self.dbpool=dbpool</code><ul style="" class="pre-numbering"><li>1</li><li>2</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li></ul>
  • process_item方法是pipeline默认调用的,进行数据库操作
<code class="language-stylus hljs python has-numbering">    <span class="hljs-comment">#pipeline默认调用</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_item</span><span class="hljs-params">(self, item, spider)</span>:</span>        query=self.dbpool.runInteraction(self._conditional_insert,item)<span class="hljs-comment">#调用插入的方法</span>        query.addErrback(self._handle_error,item,spider)<span class="hljs-comment">#调用异常处理方法</span>        <span class="hljs-keyword">return</span> item</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul>
  • 插入数据库方法_conditional_insert
<code class="language-stylus hljs python has-numbering">    <span class="hljs-comment">#写入数据库中</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_conditional_insert</span><span class="hljs-params">(self,tx,item)</span>:</span>        <span class="hljs-comment">#print item['name']</span>        sql=<span class="hljs-string">"insert into testpictures(name,url) values(%s,%s)"</span>        params=(item[<span class="hljs-string">"name"</span>],item[<span class="hljs-string">"url"</span>])        tx.execute(sql,params)</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>
  • 错误处理方法_handle_error
<code class="language-stylus hljs python has-numbering">    <span class="hljs-comment">#错误处理方法</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_handle_error</span><span class="hljs-params">(self, failue, item, spider)</span>:</span>        <span class="hljs-keyword">print</span> failue</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li></ul>

二、保存到文件中的类JsonWithEncodingPipeline(在settings中声明)

  • 保存为json格式的文件,比较简单,代码如下
<code class="language-stylus hljs python has-numbering"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JsonWithEncodingPipeline</span><span class="hljs-params">(object)</span>:</span>    <span class="hljs-string">'''保存到文件中对应的class       1、在settings.py文件中配置       2、在自己实现的爬虫类中yield item,会自动执行'''</span>        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self)</span>:</span>        self.file = codecs.open(<span class="hljs-string">'info.json'</span>, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>)<span class="hljs-comment">#保存为json文件</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_item</span><span class="hljs-params">(self, item, spider)</span>:</span>        line = json.dumps(dict(item)) + <span class="hljs-string">"\n"</span><span class="hljs-comment">#转为json的</span>        self.file.write(line)<span class="hljs-comment">#写入文件中</span>        <span class="hljs-keyword">return</span> item    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">spider_closed</span><span class="hljs-params">(self, spider)</span>:</span><span class="hljs-comment">#爬虫结束时关闭文件</span>        self.file.close()</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li></ul>

dbhelper.py

  • 自己实现的操作Mysql数据库的类
  • init方法,获取settings配置文件中的信息
<code class="language-stylus hljs ruby has-numbering">    <span class="hljs-function"><span class="hljs-keyword">def</span> </span>__init_<span class="hljs-number">_</span>(<span class="hljs-keyword">self</span>)<span class="hljs-symbol">:</span>        <span class="hljs-keyword">self</span>.settings=get_project_settings() <span class="hljs-comment">#获取settings配置,设置需要的信息</span>        <span class="hljs-keyword">self</span>.host=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_HOST'</span>]        <span class="hljs-keyword">self</span>.port=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_PORT'</span>]        <span class="hljs-keyword">self</span>.user=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_USER'</span>]        <span class="hljs-keyword">self</span>.passwd=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_PASSWD'</span>]        <span class="hljs-keyword">self</span>.db=<span class="hljs-keyword">self</span>.settings[<span class="hljs-string">'MYSQL_DBNAME'</span>]</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li></ul>
  • 连接到Mysql
<code class="language-stylus hljs python has-numbering">    <span class="hljs-comment">#连接到mysql,不是连接到具体的数据库</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">connectMysql</span><span class="hljs-params">(self)</span>:</span>        conn=MySQLdb.connect(host=self.host,                             port=self.port,                             user=self.user,                             passwd=self.passwd,                             <span class="hljs-comment">#db=self.db,不指定数据库名</span>                             charset=<span class="hljs-string">'utf8'</span>) <span class="hljs-comment">#要指定编码,否则中文可能乱码</span>        <span class="hljs-keyword">return</span> conn</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul>
  • 连接到settings配置文件中的数据库名(MYSQL_DBNAME)
<code class="language-stylus hljs python has-numbering">    <span class="hljs-comment">#连接到具体的数据库(settings中设置的MYSQL_DBNAME)</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">connectDatabase</span><span class="hljs-params">(self)</span>:</span>        conn=MySQLdb.connect(host=self.host,                             port=self.port,                             user=self.user,                             passwd=self.passwd,                             db=self.db,                             charset=<span class="hljs-string">'utf8'</span>) <span class="hljs-comment">#要指定编码,否则中文可能乱码</span>        <span class="hljs-keyword">return</span> conn </code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul>
  • 创建数据库(settings文件中配置的数据库名)
<code class="language-stylus hljs python has-numbering">    <span class="hljs-comment">#创建数据库</span>    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">createDatabase</span><span class="hljs-params">(self)</span>:</span>        <span class="hljs-string">'''因为创建数据库直接修改settings中的配置MYSQL_DBNAME即可,所以就不要传sql语句了'''</span>        conn=self.connectMysql()<span class="hljs-comment">#连接数据库</span>        sql=<span class="hljs-string">"create database if not exists "</span>+self.db        cur=conn.cursor()        cur.execute(sql)<span class="hljs-comment">#执行sql语句</span>        cur.close()        conn.close()</code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li></ul>
  • 还有一些数据库操作方法传入sql语句和参数即可(具体看代码)

实现具体的爬虫.py(即模板中的pictureSpider_demo.py文件)

  • 继承scrapy.spiders.Spider
  • 声明三个属性
<code class="language-stylus hljs applescript has-numbering">    <span class="hljs-property">name</span>=<span class="hljs-string">"webCrawler_scrapy"</span>    <span class="hljs-comment">#定义爬虫名,要和settings中的BOT_NAME属性对应的值一致</span>    allowed_domains=[<span class="hljs-string">"desk.zol.com.cn"</span>] <span class="hljs-comment">#搜索的域名范围,也就是爬虫的约束区域,规定爬虫只爬取这个域名下的网页</span>    start_urls=[<span class="hljs-string">"http://desk.zol.com.cn/fengjing/1920x1080/1.html"</span>]   <span class="hljs-comment">#开始爬取的地址</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul>
  • 实现parse方法,该函数名不能改变,因为Scrapy源码中默认callback函数的函数名就是parse
<code class="language-stylus hljs python has-numbering">    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span></code><ul style="" class="pre-numbering"><li>1</li></ul><ul style="" class="pre-numbering"><li>1</li></ul>
  • 返回item
<code class="language-stylus hljs livecodeserver has-numbering">    <span class="hljs-keyword">item</span>=WebcrawlerScrapyItem()  <span class="hljs-comment">#实例item(具体定义的item类),将要保存的值放到事先声明的item属性中</span>    <span class="hljs-keyword">item</span>[<span class="hljs-string">'name'</span>]=file_name     <span class="hljs-keyword">item</span>[<span class="hljs-string">'url'</span>]=realUrl    print <span class="hljs-keyword">item</span>[<span class="hljs-string">"name"</span>],<span class="hljs-keyword">item</span>[<span class="hljs-string">"url"</span>]        yield <span class="hljs-keyword">item</span>  <span class="hljs-comment">#返回item,这时会自定解析item</span></code><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><ul style="" class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>

测试

  • 测试DBHelper
    创建testdb数据库和testtable表

    ![创建testdb数据库和testtable表][1]

  • 测试爬虫
    scrapy crawl webCrawler_scrapy运行爬虫后会将爬取得图片保存到本地,并且将name和url保存到数据库中

这里写图片描述

0 0