BeautifulSoup下载给定URL里的图片(二)

来源:互联网 发布:蒙文翻译软件 编辑:程序博客网 时间:2024/05/16 10:48

     在BeautifulSoup下载给定URL里的图片(一)中介绍了通过Beautiful Soup下载一给定页面的图片,但实际中,可能我们要下载很多的图片,想把给定页面的下一层链接里包含的图片也下载下来,其实这个很容易实现,只需在上一博文的基础上,再添加点内容即可,添加代码如下:

def getUrlLinks(content):    links = content.find_all('a')    for link in links:        url = link.get('href')                print("visiting url is:",url)        #print("url[0:4]=",url[0:4])        if url !='' and url !=None and url[0:4]=='http' and len(url)>10:            cont = getContentFromUrl(url)                      getInfoFromContent(cont)                                #time.sleep(0.02)        else:            pass  
     该函数的作用是从将从给定url里的内容content传进来,然后对对content里的<a>进行提取,然后获得<a>的“href”属性(为了防止一些空的href,和不标准的url里的内容,加上了些判断);最后继续进行提取。

     完整代码如下所示:

# -*- coding:UTF-8 -*-#coding = gbkimport timeimport urllibimport urllib2from bs4 import BeautifulSoupimgID = 0   def getContentFromUrl(url):    req = urllib2.Request(url)    content = urllib2.urlopen(req).read()    content = BeautifulSoup(content, from_encoding='gbk')    return contentdef getUrlLinks(content):    links = content.find_all('a')    for link in links:        url = link.get('href')                print("visiting url is:",url)        #print("url[0:4]=",url[0:4])        if url !='' and url !=None and url[0:4]=='http' and len(url)>10:            cont = getContentFromUrl(url)                      getInfoFromContent(cont)                                #time.sleep(0.02)        else:            pass            def getInfoFromContent(content):    global imgID    imgs = content.find_all('img')    for link in imgs:        print(link)        if link.name =='img':            url = link.get('src')            if  url !='' and url !=None and url[0:4]=='http' and len(url)>10:                try:                    urllib.urlretrieve(url, "girl/%02d.jpg"%imgID)                except urllib2.URLError,e:                    print(e.reason)                #data = urllib2.urlopen(url)                #with open("girl/%02d.jpg"%imgID,'wb')as code:                #code.write(data.read())                print(link.get('src'))                imgID =imgID+1                print("get images:", imgID)                                if __name__ == "__main__":    #print(getContentFromUrl("http://www.sohu.com"))    #content = getContentFromUrl("http://car.autohome.com.cn/jingxuan/index.html")    #content = getContentFromUrl("http://price.pcauto.com.cn/cars/pic.html")    content = getContentFromUrl("http://image.baidu.com/channel/star")    getUrlLinks(content)    
       运行如下:




0 0
原创粉丝点击