数据采集（一）：requests爬取图片(3种方式)

来源：互联网发布：java垃圾回收机制英文编辑：程序博客网时间：2024/05/18 12:42

举例爬取百度贴吧上一张网页上的图片，附上相关html源码，网址失效也无关系，重在分析学习。

<div id="post_content_87286618651" class="d_post_content j_d_post_content  clearfix">            <img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=b2310eb7be389b5038ffe05ab534e5f1/680c676d55fbb2fbc7f64cbb484a20a44423dc98.jpg" size="21406" changedsize="true" width="560" height="747" style="cursor: url(&quot;http://tb2.bdstatic.com/tb/static-pb/img/cur_zin.cur&quot;), pointer;"></div>

首先…

打开网页

# -*- coding: utf-8 -*-import requestsurl = 'http://tieba.baidu.com/p/4468445702'html = requests.get(url)#指定编码html.encoding='utf-8'

然后…

获取url (3种方式)

使用 BeautifulSoup 库

from bs4 import BeautifulSoupbs = BeautifulSoup(html.content,'html.parser')img_list = bs.find('div',{'id':'post_content_87286618651'}).findAll('img')img_src = img_list[0].attrs['src']print(img_src)

使用xpath

from lxml import etreeselector = etree.HTML(html.content)images = selector.xpath('//*[@id="post_content_87286618651"]/img')print image.attrib.get('src')

使用正则表达式

import retext = html.contentpattern = re.compile(r'<img .*src="(.*?)" size="21406"',re.S)match = pattern.search(text)print match.group(1)

最后…

将图像写入文件

img = requests.get(img_src)with open('baidu_tieba.jpg', 'ab') as f:    f.write(img.content)    f.close()

阅读全文

0 0