Python3学习笔记12--urllib，BeautifulSoup

来源：互联网发布：贴吧群发软件编辑：程序博客网时间：2024/06/07 16:26

1 介绍

urllib 是Python 的标准库，包含了从网络请求数据，处理cookie，甚至改变像请求头和用户代理这些元数据的函数。
Python3的urllib被分成3个子模块：urllib.request、urllib.parse、urllib.error。
接下来的文章中用到的是urllib.request模块中的urlopen()方法，可以用来打开并读取一个从网络获取的远程对象。

BeautifulSoup是第三方库。
BeautifulSoup能通过定位HTML 标签来格式化和组织复杂的网络信息，用简单易用的Python 对象为我们展现XML 结构信息。

Python安装第三方库，最方便的是在命令行中使用pip install …
要使用python3，在Windows命令行中输入pip3 install bs4

这里写图片描述

urllib加上BeautifulSoup，就可以搞出很多事情了

2 bsobj.tagName

bsobj.tagName 能获取页面中的第一个指定的标签.

http://www.pythonscraping.com/pages/page2.html页面有如下代码。

<html><head><title>A Useful Page</title></head><body><h1>An Interesting Title</h1><div class="body" id="fakeLatin">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</div></body></html>

Python3的IDLE中运行如下代码，可以爬取http://www.pythonscraping.com/pages/page2.html页面标签为h1的内容。

from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://www.pythonscraping.com/pages/page2.html")bsobj = BeautifulSoup(html.read())print(bsobj.h1)

运行结果：

<h1>An Interesting Title</h1>

3 findAll()和get_text()方法

bsobj.findAll(tagName, tagAttributes) 可以获取页面中所有指定的标签。
属性参数attributes 是用一个Python 字典封装一个标签的若干属性和对应的属性值，下面的代码中使用了namelist = bsobj.findAll(“span”,{“class”:”green”}) 语句。
.get_text() 会把超链接、段落和标签都清除掉，只剩下一串不带标签的文字。

from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bsobj = BeautifulSoup(html)namelist = bsobj.findAll("span",{"class":"green"})for name in namelist:    print(name.get_text())

网页：
这里写图片描述
源码：

程序运行会输出所有属性为green的span标签里的文字。
在www.pythonscraping.com/pages/warandpeace.html对应着人名。
程序运行结果：

AnnaPavlovna SchererEmpress MaryaFedorovnaPrince Vasili KuraginAnna PavlovnaSt. Petersburgthe princeAnna PavlovnaAnna Pavlovnathe princethe princethe princePrince VasiliAnna PavlovnaAnna Pavlovnathe princeWintzingerodeKing of Prussiale Vicomte de MortemartMontmorencysRohansAbbe Moriothe Emperorthe princePrince VasiliDowager Empress Marya Fedorovnathe baronAnna Pavlovnathe Empressthe EmpressAnna Pavlovna'sHer MajestyBaronFunkeThe princeAnnaPavlovnathe EmpressThe princeAnatolethe princeThe princeAnnaPavlovnaAnna Pavlovna

findAll(),find()方法的定义:
findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

一般情况下只用传递标签名，标签属性两个参数。

参考书籍：Python网络数据采集

阅读全文

0 0