网络爬虫：BeautifulSoup

来源：互联网发布：php工厂模式数据库代码编辑：程序博客网时间：2024/05/14 15:05

以获取网页的title为例

基础模板

# -*- coding: utf-8 -*-import urllib.requestimport urllib.errorfrom bs4 import BeautifulSoupdef get_title(url):    req=urllib.request.Request(url)    response=urllib.request.urlopen(req)    html = response.read().decode("utf-8")    soup=BeautifulSoup(html,"html.parser")    print (soup.title)if __name__=="__main__":    url="http://www.pythonscraping.com/exercises/exercise1.html"    get_title(url)

增加判断异常

# -*- coding: utf-8 -*-import urllib.requestimport urllib.errorfrom bs4 import BeautifulSoupdef get_title(url):    try:        req=urllib.request.Request(url)        response=urllib.request.urlopen(req)    except (urllib.error.HTTPError,urllib.error.URLError as e:        print(e)        return None    try:        html = response.read().decode("utf-8")        soup = BeautifulSoup(html, "html.parser")        title=soup.title    except AttributeError as e:        return None    return titleif __name__=="__main__":    url="http://www.pythonscraping.com/exercises/exercise1.html"    title=get_title(url)    if title == None:        print("Title could not be found")    else:        print(title)

find()与findAll()

通过BeautifulSoup对象，可以用findAll函数抽取只包含在<-span class=”green”><-/span>标签（注：没有-符号）中的文字，这样会得到一个Python列表。

nameList=soup.findAll("span",{"class":"green"})    for name in nameList:        print (name.get_text())

注：什么时候使用.get_text()与什么时候应该保留标签？
.get_text()会把正在处理的HTML文档中所有的标签都清除，然后返回一个只包含文字的字符串。假如你正在处理一个包含许多超链接、段落和标签的大段源代码，那么.get_text()会把这些超链接、段落和标签都清除掉，只剩下一串不代表标签的文字。

用BeautifulSoup对象查找你想要的信息，比直接在HTML文本中查找信息要简单的多。通常在你准备打印、存储和操作数据时，应该最后才使用.get_text()。一般情况下，你应该尽可能的保留HTML文档的标签结构。
这里写图片描述

from bs4 import BeautifulSoupimport re#根据HTML网页字符串创建BeautifulSoup对象html_doc='<a href="/view/123.htm" class="article_link">Python</a>'soup=BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8")#查找所有标签为a的节点print (soup.findAll("a"))#查找所有标签为a,链接符合/view/123.htm形式的节点print (soup.findAll("a",href="/view/123.htm"))print (soup.findAll("a",href=re.compile(r'/view/\d+\.htm')))#查找所有标签为div,class为abc，文字为Python的节点print (soup.findAll("a",class_="article_link",text="Python"))

0 0