python爬数据小试牛刀--beautifulSoup使用

来源：互联网发布：工业控制网络视频编辑：程序博客网时间：2024/06/15 04:06

python爬数据小试牛刀–beautifulSoup使用

1.环境配置

编译环境：python 2.7
编译器：pycharm
HTML或XML提取工具：beautifulSoup(安装自行百度)

2.网站分析

网站：斗鱼（http://www.douyu.com）
爬取目标：首页的图片
步骤一：查看图片信息，鼠标右键图片，选择检查
步骤二：分析发现图片连接都在src下面
步骤三：代码编写
导入库

 import urllib from  bs4  import BeautifulSoup

获取网页

 import urllibfrom  bs4  import BeautifulSoupf=urllib.urlopen("http://www.douyu.com")html =f.read()soup = BeautifulSoup(html, 'html.parser')

匹配查询

ss=soup.find_all('img')print sslenth=int(len(ss))print lenthfor i in range(lenth):    url =ss[i].attrs['src']    print url    tad=url.rfind('.')    print tad    if tad>0:       str= url[tad+1:tad+4]       if str=='png':           print "this is png"           urllib.urlretrieve(url, './img2/png%d.png'%i)       elif str=='jpg':           print 'this is jpg'           urllib.urlretrieve(url, './img2/img%d.jpg' % i)       elif str=='gif':           print "this is gif"           urllib.urlretrieve(url, './img2/gif%d.gif' % i)       else:print "Error"

3.总结

获取图片的过程中，发现图片有jpg,png,和gif,于是通过字符串操作，把格式区分开来。

阅读全文

0 0