创新实训5.14 Scrapy学习、信息提取

来源：互联网发布：日本旅游多少钱知乎编辑：程序博客网时间：2024/05/18 00:25

周末两天主要学习了scrapy的安装和使用，在环境配置上遇到了好多次问题，因为自己电脑同时使用python2和python3，pip命令就出了问题，当我按照教程将python3中的pip.exe删掉，运行pip3仍然报错（Fatal error in launcher: Unable to create process using '"'），最后查了很多资料，通过使用python3 -m pip install xxx来解决了问题。

环境配置参考教程http://www.cnblogs.com/wuxl360/p/5567065.html，需要注意 windows系统必须要安装openSSL，另一方面，即使安装过程中并不需要pywin32，但是不装是运行不起来的，再者还需要安装VC++ 14 build tools。

环境配置成功后，参考学习了Scrapy和XPath的教程

http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/tutorial.html

http://www.w3school.com.cn/xpath/index.asp

经过调试完成了HDU题目信息的提取并存入数据库。

注意：

setting.py中ROBOTSTXT_OBEY = False，不然无法抓取内容

setting.py中设置ITEM_PIPELINES ，添加编写的Pipeline

Python的字符串处理函数

提取的信息要将单引号进行转义

items.py（定义提取的题目item）

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ProblemItem(scrapy.Item):    originOj = scrapy.Field()    problemId = scrapy.Field()    problemUrl = scrapy.Field()    title = scrapy.Field()    timeLimit = scrapy.Field()    memoryLimit= scrapy.Field()    desc = scrapy.Field()    input = scrapy.Field()    output = scrapy.Field()    sampleInput = scrapy.Field()    sampleOutput = scrapy.Field()    updateTime = scrapy.Field()

problem_spider.py (抓取内容，提取Item)

from scrapy.spiders import Spiderfrom scrapy.selector import Selectorfrom datetime import datetimefrom crawl.items import ProblemItemclass HduProblemSpider(Spider):    name = 'hdu_problem'    #allowed_domains = ['acm.hdu.edu.cn']    problem_id = '1000'    def __init__(self, problem_id='1005', *args, **kwargs):        self.problem_id = problem_id        super(HduProblemSpider, self).__init__(*args, **kwargs)        self.start_urls = [            'http://acm.hdu.edu.cn/showproblem.php?pid=%s' % problem_id        ]    def parse(self, response):        sel = Selector(response)        item = ProblemItem()        item['originOj'] = 'hdu'        item['problemId'] = self.problem_id        item['problemUrl'] = response.url        item['title'] = sel.xpath('//h1/text()').extract()[0]        item['desc'] = sel.css('.panel_content').extract()[0]        item['input'] = sel.css('.panel_content').extract()[1]        item['output'] = sel.css('.panel_content').extract()[2]        item['timeLimit'] = \            sel.xpath('//b/span/text()').re('T[\S*\s]*S')[0][12:]        item['memoryLimit'] = \            sel.xpath('//b/span/text()').re('Me[\S*\s]*K')[0][14:]        item['sampleInput'] = sel.xpath('//pre/div/text()').extract()[0]        item['sampleOutput'] = sel.xpath('//pre/div/text()').extract()[1]        item['updateTime'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")        print("-------------------------------------------")        print("desc : %s"%item['desc'])        print("-------------------------------------------")        return item

piplines.py（处理提取的item，并存入数据库）

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom crawl.items import ProblemItem#from items import *import pymysqlclass SolPipeline(object):    def __init__(self):        print("<<<<<<<<<<<<<<<pipeline init>>>>>>>>>>>>>>>>>>>>")    def process_item(self, item, spider):        print(">>>>>>>>>>>>>>>>>>>>>pipeline process")        self.testdb()        if isinstance(item,ProblemItem):            print(">>>>>>>>>ProblemItem")            self.processProblemItem(item)        return item    def open_spider(self,spider):        self.db = pymysql.connect("localhost","root","root","vjtest")    def close_spider(self,spider):        self.db.close()    def processProblemItem(self,item):        print("processProblemItem")        cursor = self.db.cursor()        need = ['desc','input','output']        for k in need :            str = item[k]            L = 0            R = len(str)            while L < R:                if str[L] == '>':                    break                else:                    L+=1            while L < R:                if str[R-1] == '<':                    break;                else:                    R-=1            item[k] = str[L+1:R-1]        for k in item.keys():            str = ""            for i in range(0,len(item[k])):                if item[k][i] == '\'':                    str += '\\'                str += item[k][i]            item[k] = str        sql = "select * from problem " \              " where originOj = '%s' and problemId = '%s'"%(item['originOj'],item['problemId'])        print(sql)        try:            cursor.execute(sql)            results = cursor.fetchall()            hasProb = False            for row in results:                hasProb = True                break            if hasProb:                print("---beautiful split one---")                sql = "update problem set title = '%s' " \                      ",timeLimit = '%s' " \                      ",memoryLimit = '%s' " \                      ",description = '%s' " \                      ",input = '%s' " \                      ",output = '%s' " \                      ",sampleInput = '%s' " \                      ",sampleOutput = '%s' " \                      ",updateTime = '%s' " \                      "where originOj = '%s' and problemId = '%s'"%(item['title'],item['timeLimit'],item['memoryLimit'],item['desc'],item['input'],item['output'],item['sampleInput'],item['sampleOutput'],item['updateTime'],item['originOj'],item['problemId'])                print(sql)                cursor.execute(sql)            else:                print("---beautiful split two---")                sql = "insert into problem values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')"%(item['originOj'],item['problemId'],item['problemUrl'],item['title'],item['timeLimit'],item['memoryLimit'],item['desc'],item['input'],item['output'],item['sampleInput'],item['sampleOutput'],item['updateTime'])                print("sql get!!!!!!!! : %s",sql)                cursor.execute(sql)            self.db.commit()        except:            self.db.rollback()            print("Error : sql execute failed")    def testdb(self):        # 使用 cursor() 方法创建一个游标对象 cursor        cursor = self.db.cursor()        # 使用 execute()  方法执行 SQL 查询        cursor.execute("SELECT VERSION()")        # 使用 fetchone() 方法获取单条数据.        data = cursor.fetchone()        print("Database version : %s " % data)

题目抓取过程：

2017-05-14 21:36:53 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: crawl)2017-05-14 21:36:53 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'crawl', 'NEWSPIDER_MODULE': 'crawl.spiders', 'SPIDER_MODULES': ['crawl.spiders']}2017-05-14 21:36:53 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats']2017-05-14 21:36:54 [scrapy.middleware] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']2017-05-14 21:36:54 [scrapy.middleware] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']<<<<<<<<<<<<<<<pipeline init>>>>>>>>>>>>>>>>>>>>2017-05-14 21:36:54 [scrapy.middleware] INFO: Enabled item pipelines:['crawl.pipelines.SolPipeline']2017-05-14 21:36:54 [scrapy.core.engine] INFO: Spider opened2017-05-14 21:36:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2017-05-14 21:36:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:60232017-05-14 21:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://acm.hdu.edu.cn/showproblem.php?pid=3456> (referer: None)-------------------------------------------desc : <div class="panel_content">In computer science, an oracle is something that gives you the answer to a particular question. For this problem, you need to write an oracle that gives the answer to everything. But it's not as bad as it sounds; you know that 42 is the answer to life, the universe, and everything.</div>------------------------------------------->>>>>>>>>>>>>>>>>>>>>pipeline processDatabase version : 5.7.14-log>>>>>>>>>ProblemItemprocessProblemItemselect * from problem  where originOj = 'hdu' and problemId = '3456'---beautiful split two---sql get!!!!!!!! : %s insert into problem values('hdu','3456','http://acm.hdu.edu.cn/showproblem.php?pid=3456','Universal Oracle','2000/1000 MS','32768/32768 K','In computer science, an oracle is something that gives you the answer to a particular question. For this problem, you need to write an oracle that gives the answer to everything. But it\'s not as bad as it sounds; you know that 42 is the answer to life, the universe, and everything.','The input consists of a single line of text with at most 1000 characters. This text will contain only well-formed English sentences. The only characters that will be found in the text are uppercase and lowercase letters, spaces, hyphens, apostrophes, commas, semicolons, periods, and question marks. Furthermore, each sentence begins with a single uppercase letter and ends with either a period or a question mark. Besides these locations, no other uppercase letters, periods, or question marks will appear in the sentence. Finally, every question (that is, a sentence that ends with a question mark) will begin with the phrase "What is..."','For each question, print the answer, which replaces the "What" at the beginning with "Forty-two" and the question mark at the end with a period. Each answer should reside on its own line. ','Let me ask you two questions. What is the answer to life? What is the answer to the universe?','Forty-two is the answer to life.Forty-two is the answer to the universe.','2017-05-14 21:36:55')2017-05-14 21:36:55 [scrapy.core.scraper] DEBUG: Scraped from <200 http://acm.hdu.edu.cn/showproblem.php?pid=3456>{'desc': 'In computer science, an oracle is something that gives you the '         'answer to a particular question. For this problem, you need to write '         "an oracle that gives the answer to everything. But it\\'s not as bad "         'as it sounds; you know that 42 is the answer to life, the universe, '         'and everything.', 'input': 'The input consists of a single line of text with at most 1000 '          'characters. This text will contain only well-formed English '          'sentences. The only characters that will be found in the text are '          'uppercase and lowercase letters, spaces, hyphens, apostrophes, '          'commas, semicolons, periods, and question marks. Furthermore, each '          'sentence begins with a single uppercase letter and ends with either '          'a period or a question mark. Besides these locations, no other '          'uppercase letters, periods, or question marks will appear in the '          'sentence. Finally, every question (that is, a sentence that ends '          'with a question mark) will begin with the phrase "What is..."', 'memoryLimit': '32768/32768 K', 'originOj': 'hdu', 'output': 'For each question, print the answer, which replaces the "What" at '           'the beginning with "Forty-two" and the question mark at the end '           'with a period. Each answer should reside on its own line. ', 'problemId': '3456', 'problemUrl': 'http://acm.hdu.edu.cn/showproblem.php?pid=3456', 'sampleInput': 'Let me ask you two questions. What is the answer to life? '                'What is the answer to the universe?', 'sampleOutput': 'Forty-two is the answer to life.\r\n'                 'Forty-two is the answer to the universe.', 'timeLimit': '2000/1000 MS', 'title': 'Universal Oracle', 'updateTime': '2017-05-14 21:36:55'}2017-05-14 21:36:55 [scrapy.core.engine] INFO: Closing spider (finished)2017-05-14 21:36:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 236, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 3637, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 5, 14, 13, 36, 55, 418228), 'item_scraped_count': 1, 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 5, 14, 13, 36, 54, 855729)}2017-05-14 21:36:55 [scrapy.core.engine] INFO: Spider closed (finished)

最后经过大家共同努力，将数据库中的题目成功显示在原来的前端框架下。

下面是原HDU5722在我们页面中的显示，但是仍然存在这数学符号的转义问题。

0 0