BeautifulSoup下载给定URL里的图片(二）

来源：互联网发布：蒙文翻译软件编辑：程序博客网时间：2024/05/16 10:48

在BeautifulSoup下载给定URL里的图片(一）中介绍了通过Beautiful Soup下载一给定页面的图片，但实际中，可能我们要下载很多的图片，想把给定页面的下一层链接里包含的图片也下载下来，其实这个很容易实现，只需在上一博文的基础上，再添加点内容即可，添加代码如下：

def getUrlLinks(content):    links = content.find_all('a')    for link in links:        url = link.get('href')                print("visiting url is:",url)        #print("url[0:4]=",url[0:4])        if url !='' and url !=None and url[0:4]=='http' and len(url)>10:            cont = getContentFromUrl(url)                      getInfoFromContent(cont)                                #time.sleep(0.02)        else:            pass

该函数的作用是从将从给定url里的内容content传进来，然后对对content里的<a>进行提取，然后获得<a>的“href”属性（为了防止一些空的href，和不标准的url里的内容，加上了些判断）；最后继续进行提取。

完整代码如下所示：

# -*- coding:UTF-8 -*-#coding = gbkimport timeimport urllibimport urllib2from bs4 import BeautifulSoupimgID = 0   def getContentFromUrl(url):    req = urllib2.Request(url)    content = urllib2.urlopen(req).read()    content = BeautifulSoup(content, from_encoding='gbk')    return contentdef getUrlLinks(content):    links = content.find_all('a')    for link in links:        url = link.get('href')                print("visiting url is:",url)        #print("url[0:4]=",url[0:4])        if url !='' and url !=None and url[0:4]=='http' and len(url)>10:            cont = getContentFromUrl(url)                      getInfoFromContent(cont)                                #time.sleep(0.02)        else:            pass            def getInfoFromContent(content):    global imgID    imgs = content.find_all('img')    for link in imgs:        print(link)        if link.name =='img':            url = link.get('src')            if  url !='' and url !=None and url[0:4]=='http' and len(url)>10:                try:                    urllib.urlretrieve(url, "girl/%02d.jpg"%imgID)                except urllib2.URLError,e:                    print(e.reason)                #data = urllib2.urlopen(url)                #with open("girl/%02d.jpg"%imgID,'wb')as code:                #code.write(data.read())                print(link.get('src'))                imgID =imgID+1                print("get images:", imgID)                                if __name__ == "__main__":    #print(getContentFromUrl("http://www.sohu.com"))    #content = getContentFromUrl("http://car.autohome.com.cn/jingxuan/index.html")    #content = getContentFromUrl("http://price.pcauto.com.cn/cars/pic.html")    content = getContentFromUrl("http://image.baidu.com/channel/star")    getUrlLinks(content)

运行如下：

0 0