[爬虫] Python爬虫 urllib BeautifulSoup

来源：互联网发布：mac air怎么截屏编辑：程序博客网时间：2024/05/21 06:43

开发文档与源码

爬虫开源代码：https://github.com/REMitchell/python-scraping
urllib开发文档：https://docs.python.org/3/library/urllib.html
BeautifulSoup开发文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/
所需文件：http://pan.baidu.com/s/1i55olGL 密码：1985

安装BeautifulSoup

BeautifulSoup可以帮助你解析获取的文档，HTML或XML格式

下载版本
https://www.crummy.com/software/BeautifulSoup/bs4/download/
解压缩到Python的lib目录下
cmd进入beautifulsoup文件夹中，运行命令
```
setup.py buildsetup.py install
```
错误You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work：
1. 把bs4文件夹解压到python/lib
2. 把python/Tools/scripts/2to3.py也放到lib目录中
3. cmd到python/lib文件夹下，运行2to3.py bs4 -w

记录：

2to3.py param1 (-w)

param1可以是要转换的.py文件、文件夹（文件及里的.py都会被转换）
-w可选，如果不写默认输出转换后的结果到显示屏，如果要把转换的文件再写入原文件

简单安全爬虫

from urllib.request import urlopenfrom urllib.error import HTTPError,URLErrorfrom bs4 import BeautifulSoupdef getTitle(url):    try:        html = urlopen(url)    except (HTTPError,URLError) as e:        return None    try:        bsObj = BeautifulSoup(html)        title = bsObj.body.h1    except AttributeError as e:        return None    return titletitle = getTitle("url")if title==None:    print("Title cound not be found")else:    print(title)

结果

阅读全文

0 0