python爬虫入门——beautifulsoup初使用

来源:互联网 发布:童瑶的知乎回答 编辑:程序博客网 时间:2024/04/29 10:41
from《python网络数据采集》第一、二章
书上是python3的版本,而我电脑是python2.7,做了小修改
import urllib2import bs4def getTitle(url):    try:        html = urllib2.urlopen(url)    except urllib2.HTTPError as e:        return None    try:        bsObj = bs4.BeautifulSoup(html.read(), "lxml")        title = bsObj.h5    except urllib2.AttributeError as e:        return None    return titletitle = getTitle("http://www.pythonscraping.com/pages/page1.html")if title == None:    print("Title could not be found")else:    print title
标签处理
import urllib2import bs4html = urllib2.urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = bs4.BeautifulSoup(html.read(), "lxml")#获取表格内容 children获取子标签for child in bsObj.find("table",{"id":"giftList"}).children:    print(child)#获取除标题外的表格内容 next_siblings获取兄弟标签for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:    print(sibling) #获取父标签 parentparent = bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text()print(parent)

0 0
原创粉丝点击