Python 抓取【参考消息网站】的新闻
来源:互联网 发布:看电视的软件大全 编辑:程序博客网 时间:2024/06/05 02:55
在学习Python,写的一个简单的爬取参考消息的例子。
根据参考消息网站的js可以爬取下一页。
# -*- coding:utf-8 -*-'''Created on 2015-12-8@author: AndyCoder'''import reimport urllib2import jsonclass spider(object): ''' spider ''' def __init__(self, url="",header=""): ''' Constructor ''' self.url = url self.header = header def parseUrl(self, urlPatter='"url":"(.*?)",'): urlList = [] pattern = re.compile(urlPatter, re.DOTALL) request = urllib2.Request(self.url) request.add_header('User-Agent', self.header) response = urllib2.urlopen(request) html = response.read() contentHtml = html.decode('raw_unicode_escape') items = re.findall(pattern, html) for item in items: urls = item.replace('\\','') urlList.append(urls) return urlList, contentHtml def parseContent(self,url,contentPattern='<div class="content">(.*?)</a>(.*?)</strong>(.*?)</div>'): newsList = [] newsDict = {} contentPattern = re.compile(contentPattern, re.DOTALL) titlePattern = re.compile('<title>(.*?)-(.*?)</title>', re.DOTALL) req = urllib2.Request(url) resp = urllib2.urlopen(req) content = resp.read() utfContent = content.decode('utf8') title = '' for item in re.findall(titlePattern, utfContent): title = item[0] for item in re.findall(contentPattern, content):# news = "{'title':" + "'" + title + "'" + "," + "'url':" + "'" + url + "'" + "," + "'time':" + "'" + item[1] + "'" + "," + "'content':" + "'" + item[2] + "'}" newsDict['title'] = title newsDict['url'] = url newsDict['time'] = item[1] newsDict['content'] = item[2] newsList.append(newsDict) return newsList,content# s = spider('http://app.cankaoxiaoxi.com/?app=system&controller=channel&action=wap_index&catid=1&order=publish&num=2weight=60&jsoncallback=?','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36')# urls, html = s.parseUrl()# for url in urls:# newsList, content = s.parseContent(url)# for news in newsList:# print news# # s = spider('http://app.cankaoxiaoxi.com/?app=system&controller=channel&action=wap_index&catid=1&order=publish&num=2weight=60&jsoncallback=?','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36')urls, html = s.parseUrl()for url in urls: newsList, content = s.parseContent(url) news_string = json.dumps(newsList) decoded = json.loads(news_string) if len(decoded)>0: print decoded[0]
0 0
- Python 抓取【参考消息网站】的新闻
- 《参考消息》评出2011年十大涉华新闻
- python抓取新闻【华盛顿邮报】
- 参考消息
- Python 实现腾讯新闻抓取
- Python 实现腾讯新闻抓取
- 【python学习笔记】自动抓取雅虎新闻的内容
- 第五课 Python爬虫抓取新浪新闻的内容页
- 网络爬虫之抓取网站新闻
- python爬虫(抓取百度新闻列表)
- Python爬虫:抓取新浪新闻数据
- 第一个python爬虫 抓取新浪新闻
- 开发一款抓取门户网站新闻,并生成pdf的小软件
- 抓取新闻
- 基础的python抓取网站图片的例子
- Python爬虫:新浪新闻详情页的数据抓取(函数版)
- 使用python抓取网站信息
- python 用于网站抓取 登录 发布的模块介绍
- C 文件操作
- 什么是跨域?
- css中的line-height小知识
- 初识 cookie
- 小谈MVC
- Python 抓取【参考消息网站】的新闻
- Edittext弹出键盘移动屏幕底部Button的实现
- 模拟上传头像支持截图
- android中的Activity的开启StartActivity()和StartActivityForResult()
- c++libcur发送post请求
- fzu 2150 - Fire Game解题报告
- css3之background-clip
- 因更新驱动致“win7重启后无法正常启动、无法通过系统还原修复”的解决方案。
- 类与对象