创新实训5.14 Scrapy学习、信息提取
来源:互联网 发布:日本旅游多少钱 知乎 编辑:程序博客网 时间:2024/05/18 00:25
周末两天主要学习了scrapy的安装和使用,在环境配置上遇到了好多次问题,因为自己电脑同时使用python2和python3,pip命令就出了问题,当我按照教程将python3中的pip.exe删掉,运行pip3仍然报错(Fatal error in launcher: Unable to create process using '"'),最后查了很多资料,通过使用python3 -m pip install xxx来解决了问题。
环境配置参考教程http://www.cnblogs.com/wuxl360/p/5567065.html,需要注意 windows系统必须要安装openSSL,另一方面,即使安装过程中并不需要pywin32,但是不装是运行不起来的,再者还需要安装VC++ 14 build tools。
环境配置成功后,参考学习了Scrapy和XPath的教程
http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/tutorial.html
http://www.w3school.com.cn/xpath/index.asp
经过调试完成了HDU题目信息的提取并存入数据库。
注意:
setting.py中ROBOTSTXT_OBEY = False,不然无法抓取内容
setting.py中设置ITEM_PIPELINES ,添加编写的Pipeline
Python的字符串处理函数
提取的信息要将单引号进行转义
items.py(定义提取的题目item)
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ProblemItem(scrapy.Item): originOj = scrapy.Field() problemId = scrapy.Field() problemUrl = scrapy.Field() title = scrapy.Field() timeLimit = scrapy.Field() memoryLimit= scrapy.Field() desc = scrapy.Field() input = scrapy.Field() output = scrapy.Field() sampleInput = scrapy.Field() sampleOutput = scrapy.Field() updateTime = scrapy.Field()
problem_spider.py (抓取内容,提取Item)
from scrapy.spiders import Spiderfrom scrapy.selector import Selectorfrom datetime import datetimefrom crawl.items import ProblemItemclass HduProblemSpider(Spider): name = 'hdu_problem' #allowed_domains = ['acm.hdu.edu.cn'] problem_id = '1000' def __init__(self, problem_id='1005', *args, **kwargs): self.problem_id = problem_id super(HduProblemSpider, self).__init__(*args, **kwargs) self.start_urls = [ 'http://acm.hdu.edu.cn/showproblem.php?pid=%s' % problem_id ] def parse(self, response): sel = Selector(response) item = ProblemItem() item['originOj'] = 'hdu' item['problemId'] = self.problem_id item['problemUrl'] = response.url item['title'] = sel.xpath('//h1/text()').extract()[0] item['desc'] = sel.css('.panel_content').extract()[0] item['input'] = sel.css('.panel_content').extract()[1] item['output'] = sel.css('.panel_content').extract()[2] item['timeLimit'] = \ sel.xpath('//b/span/text()').re('T[\S*\s]*S')[0][12:] item['memoryLimit'] = \ sel.xpath('//b/span/text()').re('Me[\S*\s]*K')[0][14:] item['sampleInput'] = sel.xpath('//pre/div/text()').extract()[0] item['sampleOutput'] = sel.xpath('//pre/div/text()').extract()[1] item['updateTime'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S") print("-------------------------------------------") print("desc : %s"%item['desc']) print("-------------------------------------------") return item
piplines.py(处理提取的item,并存入数据库)
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom crawl.items import ProblemItem#from items import *import pymysqlclass SolPipeline(object): def __init__(self): print("<<<<<<<<<<<<<<<pipeline init>>>>>>>>>>>>>>>>>>>>") def process_item(self, item, spider): print(">>>>>>>>>>>>>>>>>>>>>pipeline process") self.testdb() if isinstance(item,ProblemItem): print(">>>>>>>>>ProblemItem") self.processProblemItem(item) return item def open_spider(self,spider): self.db = pymysql.connect("localhost","root","root","vjtest") def close_spider(self,spider): self.db.close() def processProblemItem(self,item): print("processProblemItem") cursor = self.db.cursor() need = ['desc','input','output'] for k in need : str = item[k] L = 0 R = len(str) while L < R: if str[L] == '>': break else: L+=1 while L < R: if str[R-1] == '<': break; else: R-=1 item[k] = str[L+1:R-1] for k in item.keys(): str = "" for i in range(0,len(item[k])): if item[k][i] == '\'': str += '\\' str += item[k][i] item[k] = str sql = "select * from problem " \ " where originOj = '%s' and problemId = '%s'"%(item['originOj'],item['problemId']) print(sql) try: cursor.execute(sql) results = cursor.fetchall() hasProb = False for row in results: hasProb = True break if hasProb: print("---beautiful split one---") sql = "update problem set title = '%s' " \ ",timeLimit = '%s' " \ ",memoryLimit = '%s' " \ ",description = '%s' " \ ",input = '%s' " \ ",output = '%s' " \ ",sampleInput = '%s' " \ ",sampleOutput = '%s' " \ ",updateTime = '%s' " \ "where originOj = '%s' and problemId = '%s'"%(item['title'],item['timeLimit'],item['memoryLimit'],item['desc'],item['input'],item['output'],item['sampleInput'],item['sampleOutput'],item['updateTime'],item['originOj'],item['problemId']) print(sql) cursor.execute(sql) else: print("---beautiful split two---") sql = "insert into problem values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')"%(item['originOj'],item['problemId'],item['problemUrl'],item['title'],item['timeLimit'],item['memoryLimit'],item['desc'],item['input'],item['output'],item['sampleInput'],item['sampleOutput'],item['updateTime']) print("sql get!!!!!!!! : %s",sql) cursor.execute(sql) self.db.commit() except: self.db.rollback() print("Error : sql execute failed") def testdb(self): # 使用 cursor() 方法创建一个游标对象 cursor cursor = self.db.cursor() # 使用 execute() 方法执行 SQL 查询 cursor.execute("SELECT VERSION()") # 使用 fetchone() 方法获取单条数据. data = cursor.fetchone() print("Database version : %s " % data)
题目抓取过程:
2017-05-14 21:36:53 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: crawl)2017-05-14 21:36:53 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'crawl', 'NEWSPIDER_MODULE': 'crawl.spiders', 'SPIDER_MODULES': ['crawl.spiders']}2017-05-14 21:36:53 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats']2017-05-14 21:36:54 [scrapy.middleware] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']2017-05-14 21:36:54 [scrapy.middleware] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']<<<<<<<<<<<<<<<pipeline init>>>>>>>>>>>>>>>>>>>>2017-05-14 21:36:54 [scrapy.middleware] INFO: Enabled item pipelines:['crawl.pipelines.SolPipeline']2017-05-14 21:36:54 [scrapy.core.engine] INFO: Spider opened2017-05-14 21:36:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2017-05-14 21:36:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:60232017-05-14 21:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://acm.hdu.edu.cn/showproblem.php?pid=3456> (referer: None)-------------------------------------------desc : <div class="panel_content">In computer science, an oracle is something that gives you the answer to a particular question. For this problem, you need to write an oracle that gives the answer to everything. But it's not as bad as it sounds; you know that 42 is the answer to life, the universe, and everything.</div>------------------------------------------->>>>>>>>>>>>>>>>>>>>>pipeline processDatabase version : 5.7.14-log>>>>>>>>>ProblemItemprocessProblemItemselect * from problem where originOj = 'hdu' and problemId = '3456'---beautiful split two---sql get!!!!!!!! : %s insert into problem values('hdu','3456','http://acm.hdu.edu.cn/showproblem.php?pid=3456','Universal Oracle','2000/1000 MS','32768/32768 K','In computer science, an oracle is something that gives you the answer to a particular question. For this problem, you need to write an oracle that gives the answer to everything. But it\'s not as bad as it sounds; you know that 42 is the answer to life, the universe, and everything.','The input consists of a single line of text with at most 1000 characters. This text will contain only well-formed English sentences. The only characters that will be found in the text are uppercase and lowercase letters, spaces, hyphens, apostrophes, commas, semicolons, periods, and question marks. Furthermore, each sentence begins with a single uppercase letter and ends with either a period or a question mark. Besides these locations, no other uppercase letters, periods, or question marks will appear in the sentence. Finally, every question (that is, a sentence that ends with a question mark) will begin with the phrase "What is..."','For each question, print the answer, which replaces the "What" at the beginning with "Forty-two" and the question mark at the end with a period. Each answer should reside on its own line. ','Let me ask you two questions. What is the answer to life? What is the answer to the universe?','Forty-two is the answer to life.Forty-two is the answer to the universe.','2017-05-14 21:36:55')2017-05-14 21:36:55 [scrapy.core.scraper] DEBUG: Scraped from <200 http://acm.hdu.edu.cn/showproblem.php?pid=3456>{'desc': 'In computer science, an oracle is something that gives you the ' 'answer to a particular question. For this problem, you need to write ' "an oracle that gives the answer to everything. But it\\'s not as bad " 'as it sounds; you know that 42 is the answer to life, the universe, ' 'and everything.', 'input': 'The input consists of a single line of text with at most 1000 ' 'characters. This text will contain only well-formed English ' 'sentences. The only characters that will be found in the text are ' 'uppercase and lowercase letters, spaces, hyphens, apostrophes, ' 'commas, semicolons, periods, and question marks. Furthermore, each ' 'sentence begins with a single uppercase letter and ends with either ' 'a period or a question mark. Besides these locations, no other ' 'uppercase letters, periods, or question marks will appear in the ' 'sentence. Finally, every question (that is, a sentence that ends ' 'with a question mark) will begin with the phrase "What is..."', 'memoryLimit': '32768/32768 K', 'originOj': 'hdu', 'output': 'For each question, print the answer, which replaces the "What" at ' 'the beginning with "Forty-two" and the question mark at the end ' 'with a period. Each answer should reside on its own line. ', 'problemId': '3456', 'problemUrl': 'http://acm.hdu.edu.cn/showproblem.php?pid=3456', 'sampleInput': 'Let me ask you two questions. What is the answer to life? ' 'What is the answer to the universe?', 'sampleOutput': 'Forty-two is the answer to life.\r\n' 'Forty-two is the answer to the universe.', 'timeLimit': '2000/1000 MS', 'title': 'Universal Oracle', 'updateTime': '2017-05-14 21:36:55'}2017-05-14 21:36:55 [scrapy.core.engine] INFO: Closing spider (finished)2017-05-14 21:36:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 236, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 3637, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 5, 14, 13, 36, 55, 418228), 'item_scraped_count': 1, 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 5, 14, 13, 36, 54, 855729)}2017-05-14 21:36:55 [scrapy.core.engine] INFO: Spider closed (finished)
下面是原HDU5722在我们页面中的显示,但是仍然存在这数学符号的转义问题。
- 创新实训5.14 Scrapy学习、信息提取
- Python自动化(八)使用Scrapy shell提取网页信息
- 创新实训5.11 Python爬虫学习
- scrapy xpath提取标签
- Python 小技巧:使用 scrapy.selector 从 XML 中提取信息
- Lucene学习之Tika提取文本信息
- NLTK学习笔记(七):文本信息提取
- 【scrapy】学习Scrapy入门
- <scrapy>scrapy入门学习
- 【scrapy】学习Scrapy入门
- scrapy提取wikipedia实践1
- Scrapy学习
- scrapy学习
- Scrapy学习
- Scrapy 学习
- scrapy学习
- Scrapy学习
- scrapy学习
- 数据结构之俩循环单链表合并
- 据说用CSDN写代码,放到word里面很好看,试试
- 希望能帮到各位校园网同学对于windows没有更新的同学防止永恒之蓝攻击
- 内存对齐
- bzoj1296
- 创新实训5.14 Scrapy学习、信息提取
- mysql入门(分组查询八)
- 再看泛型
- 继承EditorWindow实现画布
- java中线程死锁实例
- 新系统Ubuntu16.0.4安装Lamp笔记
- HDU1040 As Easy As A+B【排序】
- HTML个人学习基础篇(1)
- 惠州学院-单片机实验2-P1口转弯灯实验