用Scrapy爬取教务处通知公告
来源:互联网 发布:mac book可以安装vs么 编辑:程序博客网 时间:2024/05/17 09:06
1.准备工作
python2.7.11 win32, scrapy 1.1.0rc1
scrapy入门教程 http://scrapy-chs.readthedocs.org/zh_CN/latest/intro/tutorial.html
xpath基础语法 http://www.cnblogs.com/zhaozhan/archive/2009/09/09/1563617.html
2.创建项目
scrapy startproject jwc
3.修改items.py
# -*- coding: utf-8 -*-# __author__ = 'Maximus'# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass JwcItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() url = scrapy.Field() content = scrapy.Field() date = scrapy.Field()
4.spiders目录下创建jwc_spider.py
# -*- coding: utf-8 -*-# __author__ = 'Maximus'from scrapy.http import Requestimport scrapyfrom jwc.items import JwcItemclass JwcSpider(scrapy.Spider): name = "jwc" start_urls = [ "http://jwc.njupt.edu.cn/s/24/t/923/p/21/i/1/list.htm" ] def parse(self, response): for sel in response.xpath('//tr/td'): item = JwcItem() item['title'] = [n.encode('utf-8') for n in sel.xpath('a/font/text()').extract()] item['url'] = "http://jwc.njupt.edu.cn" + "".join(sel.xpath('a/@href').extract()) item['date'] = sel.xpath("../td[@class='postTime']/text()").extract() if item['title']: yield Request(item['url'], callback=self.parse_content, meta={'item': item}) url = "http://jwc.njupt.edu.cn" + response.xpath("//table/tr/td/a[@title]/@href").extract()[2] yield Request(url, callback=self.parse_from_second) def parse_from_second(self, response): for sel in response.xpath('//tr/td'): item = JwcItem() item['title'] = [n.encode('utf-8') for n in sel.xpath('a/font/text()').extract()] item['url'] = "http://jwc.njupt.edu.cn" + "".join(sel.xpath('a/@href').extract()) item['date'] = sel.xpath("../td[@class='postTime']/text()").extract() if item['title']: yield Request(item['url'], callback=self.parse_content, meta={'item': item}) url = "http://jwc.njupt.edu.cn" + response.xpath("//table/tr/td/a[@title]/@href").extract()[4] yield Request(url, callback=self.parse_from_second) def parse_content(self, response): item = response.meta['item'] item['content'] = [n.encode('utf-8') for n in response.xpath('//div[@id="container_content"]').extract()] return item
5.修改pipelines.py
# -*- coding: utf-8 -*-# __author__ = 'Maximus'# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport jsonimport codecsclass JwcPipeline(object): def __init__(self): self.file = codecs.open('items.json', 'wb', encoding='utf-8') def process_item(self, item, spider): line = json.dumps(dict(item)) + '\n' self.file.write(line.decode("unicode_escape")) return item
6.settings.py中添加pipelines
ITEM_PIPELINES = { 'jwc.pipelines.JwcPipeline': 300}
7.运行并将结果保存
scrapy crawl jwc -o items.json
1 0
- 用Scrapy爬取教务处通知公告
- 学院教务处公告
- JAVA爬取学校教务处课表
- Java利用JSOUP爬取教务处成绩信息简单示例
- 用scrapy爬取网页数据
- 用scrapy爬取GIF图
- 用Python 的 Scrapy 爬取 网站
- scrapy实战-爬取
- Scrapy爬取图片
- scrapy爬取图片
- Scrapy爬取1
- scrapy 爬取漫画
- scrapy爬取图片
- scrapy爬取链接
- scrapy爬取深度设置
- Scrapy爬取亚马逊商品信息
- scrapy爬取博客文章
- Scrapy爬取博客内容
- web app开发 转
- Java中的内部类
- ARP协议包类型的几种应用
- 自定义ListView item包含checkbox实现单选记录自己开发是遇到的问题
- Android常用延时操作的两种方法
- 用Scrapy爬取教务处通知公告
- 从 数组1中过滤出数组2中没有的对象
- 关于学习Gson的简单分析
- Xcode更新7.2 之后注释插件失效的解决办法
- IntelliSense: PCH warning: header stop cannot be in a macro or #if block. An intellisense PCH file
- React-Native的学习指南
- 为什么要用内省和BeanUtils以及路径问题
- GNU Make 学习 (www.andyyin.com) 待续
- mysql远程访问 sql删除记录 伪造浏览器post php错误处理学习 django模板改tdk