练习008-009

来源：互联网发布：python爬虫教程 pdf 编辑：程序博客网时间：2024/05/19 13:14

第 0008 题：一个HTML文件，找出里面的正文。
第 0009 题：一个HTML文件，找出里面的链接。

使用的BeautifulSoup来完成的，只需要调用方法就可以，比较方便
程序如下：

#!/usr/bin python #coding:utf-8from bs4 import BeautifulSouphtml='''<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>'''soup = BeautifulSoup(html)print soup.get_text()for i in soup.findAll('a'):    print i.get('href')

感兴趣的可以看看这个文档
BeautifulSoup4.2.0文档

（写于2016年5月6日，http://blog.csdn.net/bzd_111）

0 0