Python的爬虫
来源:互联网 发布:怎么进入淘宝网开店 编辑:程序博客网 时间:2024/04/29 06:09
下载一个网页的图片:
#-*- coding= utf-8 -*-import urllibimport redef getHtml(url): page = urllib.urlopen(url) html = page.read() return htmldef getImg(html): #格式的匹配 reg = r'src="(.+?\.jpg)" pic_ext' imgre = re.compile(reg) imgList = re.findall(imgre, html) x= 0 for imgurl in imgList: #下载的主要语句 img = urllib.urlretrieve(imgurl, r"D://picture/%s.jpg" %x) x = x+1 print img#下载页面的地址 html = getHtml("http://tieba.baidu.com/p/2460150866")print getImg(html)
#-*- coding=utf-8 -*-import urllib2import urllibimport reimport HTMLParserimport time,oshost = "http://desk.zol.com.cn"startImageUrl =''localSavePath = 'D:\\picture\\'ISOTIMEFORMAT='%Y%m%d%H%M%S'def downloadImage(url): imgRe = '[0-9]*\.jpg' match = re.search(imgRe, url) if match: print "Downloading image begin" ,url filename = localSavePath + str(time.strftime(ISOTIMEFORMAT))+ r'.jpg' img = urllib.urlretrieve(url, filename) else: print "NO match" def getImageUrlByHtmlUrl(htmlUrl): parser = MyHtmlParser(False) request = urllib2.Request(htmlUrl) try: response = urllib2.urlopen(request) content = response.read() parser.feed(content) except urllib2.URLError, e: print e.reason class MyHtmlParser(HTMLParser.HTMLParser): def __init__(self,isIndex): self.isIndex = isIndex HTMLParser.HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): if(self.isIndex): if(tag == 'a'): if(len(attrs) == 4): if(attrs[0] == ('class','pic')): newUrl = host + attrs[1][1] print "Find a image site: ",newUrl#Question 这句话去掉就会只有一个网页的图片 global 定义的用法 startImageUrl = newUrl getImageUrlByHtmlUrl(newUrl) else: if(tag == 'img'): if(attrs[0] == ('id','bigImg')): imgUrl = attrs[1][1] print " one image : " ,imgUrl downloadImage(imgUrl) if(tag == 'a'): if(len(attrs) == 4): if(attrs[1] == ('class','next')): nextUrl = host + attrs[2][1] print "Find a next image Link" ,nextUrl global startImageUrl if( nextUrl != startImageUrl ): getImageUrlByHtmlUrl(nextUrl) if __name__ == "__main__": indexUrl = "http://desk.zol.com.cn/meinv/" page = urllib2.urlopen(indexUrl).read() parseIndex = MyHtmlParser(True) parseIndex.feed(page)
API: http://blog.csdn.net/tianxicool/article/details/5942523
0 0
- Python的爬虫程序
- Python简易的爬虫
- Python的爬虫
- Python简单的爬虫
- 基于python的爬虫
- python的爬虫工具
- python爬虫的使用
- 简单的python爬虫
- 【Python】健壮的爬虫
- 简单的Python 爬虫
- python 爬虫试手,好简单的爬虫
- python爬虫:爬虫的工作原理
- Python的爬虫的笔记
- python编的糗百小爬虫
- python爬虫常用的模块
- Scrapy:Python的爬虫框架
- python爬虫超时的处理
- PYTHON 爬虫简单的认识
- java的卸载,安装,环境变量配置;MyEclipse的破解
- Latency numbers every programmer should know
- 泛型2
- 软件设计最近发展趋势对话录
- 新的一年,祝大家健康快乐!
- Python的爬虫
- 奇技之VIM:win下的gvim启动外部程序如何不弹出黑屏
- vim实用设置
- [leetcode 121] Best Time to Buy and Sell Stock
- Unknown character set: 'utf8mb4' / mysql
- Mac真机调试不显示设备
- linux基础学习
- MySQL密码忘记的解决方案
- 设计模式之一对多