Python网络爬虫爬取站长素材上的表情包

来源:互联网 发布:关口知宏五台山 编辑:程序博客网 时间:2024/04/29 21:12

由于不经常看群消息,收藏的表情包比较少,每次在群里斗图我都处于下风,最近在中国大学MOOC上学习了嵩天老师的Python网络爬虫与信息提取课程,于是决定写一个爬取网上表情包的网络爬虫。通过搜索发现站长素材上的表情包很是丰富,一共有446页,每页10个表情包,一共是4000多个表情包,近万个表情,我看以后谁还敢给我斗图生气

技术路线

requests+beautifulsoup

网页分析

站长素材第一页表情包是这样的:


可以看到第一页的url是:http://sc.chinaz.com/biaoqing/index.html

点击下方的翻页按钮可以看到第二页的url是:http://sc.chinaz.com/biaoqing/index_2.html

可以推测第446页的url是:http://sc.chinaz.com/biaoqing/index_446.html

接下来是分析每一页表情包列表的源代码:


再来分析每个表清包全部表情对应的网页:



步骤

1、获得每页展示的每个表情包连接和title

2、获得每个表情包的所有表情的链接

3、使用获取到的表情链接获取表情,每个表情包的表情放到一个单独的文件夹中,文件夹的名字是title属性值

代码

#-*-coding:utf-8-*-'''Created on 2017年3月18日@author: lavi'''import bs4from bs4 import BeautifulSoupimport reimport requestsimport osimport traceback'''获得页面内容'''def getHtmlText(url):    try:        r = requests.get(url,timeout=30)        r.raise_for_status()        r.encoding = r.apparent_encoding        return r.text    except:        return ""'''获得content'''    def getImgContent(url):    head = {"user-agent":"Mozilla/5.0"}    try:        r = requests.get(url,headers=head,timeout=30)        print("status_code:"+r.status_code)        r.raise_for_status()        return r.content    except:        return None    '''获得页面中的表情的链接'''    def getTypeUrlList(html,typeUrlList):    soup = BeautifulSoup(html,'html.parser')    divs = soup.find_all("div", attrs={"class":"up"})    for div in divs:        a = div.find("div", attrs={"class":"num_1"}).find("a")        title = a.attrs["title"]        typeUrl = a.attrs["href"]        typeUrlList.append((title,typeUrl))def getImgUrlList(typeUrlList,imgUrlDict):    for tuple in typeUrlList:        title = tuple[0]        url = tuple[1]        title_imgUrlList=[]        html = getHtmlText(url)        soup = BeautifulSoup(html,"html.parser")        #print(soup.prettify())                div = soup.find("div", attrs={"class":"img_text"})        #print(type(div))        imgDiv = div.next_sibling.next_sibling        #print(type(imgDiv))        imgs = imgDiv.find_all("img");        for img in imgs:            src = img.attrs["src"]            title_imgUrlList.append(src)        imgUrlDict[title] = title_imgUrlListdef getImage(imgUrlDict,file_path):    head = {"user-agent":"Mozilla/5.0"}    countdir = 0    for title,imgUrlList in imgUrlDict.items():        #print(title+":"+str(imgUrlList))        try:            dir = file_path+title            if not os.path.exists(dir):                os.mkdir(dir)            countfile = 0            for imgUrl in imgUrlList:                path = dir+"/"+imgUrl.split("/")[-1]                #print(path)                #print(imgUrl)                if not os.path.exists(path):                    r = requests.get(imgUrl,headers=head,timeout=30)                    r.raise_for_status()                    with open(path,"wb") as f:                        f.write(r.content)                        f.close()                        countfile = countfile+1                        print("当前进度文件夹进度{:.2f}%".format(countfile*100/len(imgUrlList)))            countdir = countdir + 1            print("文件夹进度{:.2f}%".format(countdir*100/len(imgUrlDict)))                    except:            print(traceback.print_exc())            #print("from getImage 爬取失败")        def main():    #害怕磁盘爆满就不获取全部的表情了,只获取30页,大约300个表情包里的表情    pages = 30    root = "http://sc.chinaz.com/biaoqing/"    url = "http://sc.chinaz.com/biaoqing/index.html"    file_path = "e://biaoqing/"    imgUrlDict = {}    typeUrlList = []    html = getHtmlText(url);    getTypeUrlList(html,typeUrlList)    getImgUrlList(typeUrlList,imgUrlDict)    getImage(imgUrlDict,file_path)    for page in range(pages):        url = root + "index_"+str(page)+".html"        imgUrlDict = {}        typeUrlList = []        html = getHtmlText(url);        getTypeUrlList(html,typeUrlList)        getImgUrlList(typeUrlList,imgUrlDict)        getImage(imgUrlDict,file_path)        main()
结果



如果你在群里斗图吃了亏,把上面的程序运行一遍。。。不要谢我,3月是学雷锋月。哈哈,来把我们斗会图,

1 0
原创粉丝点击