So cute are you Python 13
来源:互联网 发布:overture mac 编辑:程序博客网 时间:2024/06/05 14:12
上一节简单的介绍了,BeautifulSoup 的基本使用方法:
这一节,我们加一点点难度抓取博客的文章连接:
1.首先我们要分析要爬取的页面的结构,再根据结构抓取特定的页面:
1.1 解释 BeautifulSoup 函数用法
soup = BeautifulSoup(page) #格式化抓取的页面
soup.findAll(name="span")#抓取标签
抓取文章标题--代码:
#!/usr/bin/evn python #coding:utf-8 #FileName:re_learn01.py #Function:show first time to use beautifulSoup #History:25-10-2013 import bs4import urllibfrom bs4 import BeautifulSoupdef bea_Demo(): url='http://yxh1157686920.blog.51cto.com/all/7743046' ss=urllib.urlopen(url) page=ss.read() soup = BeautifulSoup(page) print "type(soup)=",type(soup) h1userSoup=[] h1userSoup = soup.findAll(name="span") #print "soup=",soup for h in h1userSoup: res=h.findAll('a') for r in res: if r!=None: #print '' print "***:",r.string,":--:\n" if __name__=='__main__': bea_Demo()结果:
$ python bea_learn02.py type(soup)= <class 'bs4.BeautifulSoup'>***: 原创 :--:***: 翻译 :--:***: 转载 :--:***: Drupal7 Note-1:smtp模块+Gmail搭建邮件发送功能 :--:***: 初识 XSS 3 :--:***: Struts 2 漏洞解决办法 :--:***: 初识 XSS 2 :--:***: Drupal 函数实现分页机制 :--:***: drupal 为内容添加分类 :--:***: drupal 6 DB :--:2.分析出连接:
代码:
#!/usr/bin/evn python #coding:utf-8 #FileName:re_learn01.py #Function:show first time to use beautifulSoup #History:25-10-2013 import bs4import urllibfrom bs4 import BeautifulSoupdef bea_Demo(): url='http://yxh1157686920.blog.51cto.com/all/7743046' url1='http://yxh1157686920.blog.51cto.com' ss=urllib.urlopen(url) page=ss.read() soup = BeautifulSoup(page) print "type(soup)=",type(soup) h1userSoup=[] h1userSoup = soup.findAll(name="span") #print "soup=",soup for h in h1userSoup: res=h.findAll('a') for r in res: if r!=None: print '' print "文章标题:",r.string,":--:" s1=str(r).find('href') s2=str(r).find('">') s3= str(r)[s1+6:s2] print s3 if s3.find('title')>0: ss1=s3.find('title') ss2=s3[:ss1-2] s3=ss2 s4=url1+s3 print 'URL:_',s4 if __name__=='__main__': bea_Demo()结果:
$ python bea_learn02.py type(soup)= <class 'bs4.BeautifulSoup'>文章标题: 原创 :--:http://yxh1157686920.blog.51cto.com/7743046/o" title="察看yxh1157686920所有原创文章URL:_ http://yxh1157686920.blog.51cto.comhttp://yxh1157686920.blog.51cto.com/7743046/o文章标题: 翻译 :--:http://yxh1157686920.blog.51cto.com/7743046/t" title="察看yxh1157686920所有翻译文章URL:_ http://yxh1157686920.blog.51cto.comhttp://yxh1157686920.blog.51cto.com/7743046/t文章标题: 转载 :--:http://yxh1157686920.blog.51cto.com/7743046/c" title="察看yxh1157686920所有转载文章URL:_ http://yxh1157686920.blog.51cto.comhttp://yxh1157686920.blog.51cto.com/7743046/c文章标题: Drupal7 Note-1:smtp模块+Gmail搭建邮件发送功能 :--:/7743046/1315279URL:_ http://yxh1157686920.blog.51cto.com/7743046/1315279文章标题: 初识 XSS 3 :--:/7743046/1314686URL:_ http://yxh1157686920.blog.51cto.com/7743046/1314686文章标题: Struts 2 漏洞解决办法 :--:/7743046/1314095URL:_ http://yxh1157686920.blog.51cto.com/7743046/1314095文章标题: 初识 XSS 2 :--:/7743046/1314092URL:_ http://yxh1157686920.blog.51cto.com/7743046/1314092文章标题: Drupal 函数实现分页机制 :--:/7743046/1313463URL:_ http://yxh1157686920.blog.51cto.com/7743046/1313463
- So cute are you Python 13
- So cute are you python 1
- So cute are you python 2
- So cute are you python 3
- So cute are you python 4
- So cute are you python 5
- So cute are you python 6
- So cute are you python 7
- So cute are you Python 8
- So cute are you Python 9
- So cute are you Python 10
- So cute are you Python 11
- So cute are you Python 12
- So cute are you python 14
- So cute are you python 15
- So cute are you python 16
- So cute are you python 17
- so cute are you python 18
- windows xp系统下Android模拟器安装apk与卸载apk
- c++和java中关于如何调用父类方法和子类方法的辨析
- android开发之wifi网络操作初步
- jdk 1.5新特性——泛型
- java类库的阅读笔记_jdk1.7.0_40_java.util.ArrayList
- So cute are you Python 13
- ffmpeg开发中的问题(八)
- Python中__init__(),__getitem__()和__setitem__()的使用实例
- 内存分配方式,堆区,栈区,new/delete/malloc/free
- C/C++中extern关键字详解
- 从此刻起,认真的生活下去
- android开发之socket通信 向PC机发信息 获取本机IP
- DHCP服务
- JBoss 系列二十八:JBoss Data Grid(Infinispan)CarMart 示例