python爬虫小试
来源:互联网 发布:小学生上网数据 编辑:程序博客网 时间:2024/05/01 16:15
说明:之所以要构造http请求,因为单纯的用urllib.urlopen(url)
来获得html源码,次数多了,网站就会是403 forbidden了,构造http请求则会避免403错误
脚本如下:
#-*- coding:utf-8 -*-#import urllibimport urllib2import reimport sysimport cookielibreload(sys)sys.setdefaultencoding("utf-8")def getHtml(url): #page = urllib.urlopen(url) #page = urllib2.urlopen(url) #html = page.read().decode("utf-8") cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar()) #opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler) opener = urllib2.build_opener(cookie_support) user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'] opener.addheaders = [("User-agent", user_agents), ("Accept", "*/*"), ('Referer', 'http://www.douban.com')] response = opener.open(url) html = response.read().decode("utf-8") return htmldef getTitle(html): reg = u'class="note-item">.*?<a.*?href=(.*?)class="title".*?target="_blank">(.*?)</a>.*?<span>(.*?)喜欢</span>' titleRe = re.compile(reg,re.S) titlelist = re.findall(titleRe,html) return titlelistpage_num = 0filePath = r'C:\Users\Administrator\tmp\DoubanTop250.txt'while page_num < 10: html_url = 'https://www.douban.com/tag/%E8%A3%85%E4%BF%AE/article?start=' + str(page_num*15) page_num = page_num + 1 html = getHtml(html_url) #print html Contents= getTitle(html) if page_num == 1: fileTop250 = open(filePath, 'w') else: fileTop250 = open(filePath, 'a') for Content in Contents: if int(Content[2]) >= 1500: #if 1: fileTop250.write(Content[2] + '人喜欢' + '\r') fileTop250.write('Title:' + Content[1] + '\r') fileTop250.write('Link:' + Content[0] + '\r') fileTop250.write('from the ' + str(page_num) + ' page' + '\r\n') print 'Read the ' + str(page_num) + ' page successful...' fileTop250.close()
0 0
- python爬虫小试
- Python爬虫小试身手
- Python基础学习-爬虫小试2
- Python爬虫入门-小试CrawlSpider
- [python]网页小爬虫
- 一个Python小爬虫
- python爬虫小程序
- python图片小爬虫
- Python小爬虫,(多线程)
- python小爬虫
- python爬虫小实例
- python小爬虫
- python小爬虫-糗百
- python 图片小爬虫
- python requests 小爬虫
- 一个python小爬虫
- python小爬虫
- Python小爬虫小总
- 噩梦射手(SurvivalShooter)教程(一)
- JSON与JAVA数据的转换-JSONObject.fromObject(map)
- MyBatis Generator自动创建代码
- face aging 调研
- iOS addChildViewController方法
- python爬虫小试
- maven工程-pom文件
- 《深入理解Java虚拟机——JVM高级特性与最佳实践》学习笔记——虚拟机类加载机制
- node包管理工具-----npm
- Jmeter查看结果树响应结果unicode转成中文显示
- Java开发岗位面试题归类
- 设计模式——观察者模式(Observer)
- Linux基础知识和常用命令(三)
- zabbix监控端口详解