[爬虫] Python爬虫 urllib BeautifulSoup

来源:互联网 发布:mac air怎么截屏 编辑:程序博客网 时间:2024/05/21 06:43

开发文档与源码

爬虫开源代码:https://github.com/REMitchell/python-scraping
urllib开发文档:https://docs.python.org/3/library/urllib.html
BeautifulSoup开发文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
所需文件:http://pan.baidu.com/s/1i55olGL 密码:1985

安装BeautifulSoup

BeautifulSoup可以帮助你解析获取的文档,HTML或XML格式

  1. 下载版本
    https://www.crummy.com/software/BeautifulSoup/bs4/download/
  2. 解压缩到Python的lib目录下
  3. cmd进入beautifulsoup文件夹中,运行命令

    setup.py buildsetup.py install

    错误You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work:

    1. 把bs4文件夹解压到python/lib
    2. 把python/Tools/scripts/2to3.py也放到lib目录中
    3. cmd到python/lib文件夹下,运行2to3.py bs4 -w

记录:

2to3.py param1 (-w)

param1可以是要转换的.py文件、文件夹(文件及里的.py都会被转换)
-w可选,如果不写默认输出转换后的结果到显示屏,如果要把转换的文件再写入原文件

简单安全爬虫

from urllib.request import urlopenfrom urllib.error import HTTPError,URLErrorfrom bs4 import BeautifulSoupdef getTitle(url):    try:        html = urlopen(url)    except (HTTPError,URLError) as e:        return None    try:        bsObj = BeautifulSoup(html)        title = bsObj.body.h1    except AttributeError as e:        return None    return titletitle = getTitle("url")if title==None:    print("Title cound not be found")else:    print(title)

结果
结果