python爬虫入门——beautifulsoup初使用

来源：互联网发布：童瑶的知乎回答编辑：程序博客网时间：2024/04/29 10:41

from《python网络数据采集》第一、二章

书上是python3的版本，而我电脑是python2.7，做了小修改

import urllib2import bs4def getTitle(url):    try:        html = urllib2.urlopen(url)    except urllib2.HTTPError as e:        return None    try:        bsObj = bs4.BeautifulSoup(html.read(), "lxml")        title = bsObj.h5    except urllib2.AttributeError as e:        return None    return titletitle = getTitle("http://www.pythonscraping.com/pages/page1.html")if title == None:    print("Title could not be found")else:    print title

标签处理

import urllib2import bs4html = urllib2.urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = bs4.BeautifulSoup(html.read(), "lxml")#获取表格内容 children获取子标签for child in bsObj.find("table",{"id":"giftList"}).children:    print(child)#获取除标题外的表格内容 next_siblings获取兄弟标签for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:    print(sibling) #获取父标签 parentparent = bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text()print(parent)

0 0