Web Scraping with Python-Chapter1读书笔记

来源：互联网发布：蒋方舟东京一年知乎编辑：程序博客网时间：2024/06/06 01:51

前记：正式开始我的Python爬虫之旅

Chapter 1. Your First Web Scraper

1.库函数的安装

本章涉及两个库函数的使用，分别是urllib与BeautifulSoup 4 library（通常也被称为BS4）。前者是Python的标准库，BS4需要自行安装。WIN10系统的安装方法：执行命令pip install beautifulsoup4。过程如下：

D:\PythonProject\webScraping>pip install beautifulsoup4Collecting beautifulsoup4  Downloading beautifulsoup4-4.5.1-py3-none-any.whl (83kB)    100% |████████████████████████████████| 92kB 67kB/sInstalling collected packages: beautifulsoup4Successfully installed beautifulsoup4-4.5.1D:\PythonProject\webScraping>

2.网页爬取例子

from urllib.request import urlopenfrom urllib.error import HTTPErrorfrom bs4 import BeautifulSoupdef getTitle(url):    try:        html = urlopen(url)    except HTTPError as e:        return None    try:        bsObj = BeautifulSoup(html.read())        title = bsObj.body.h1    except AttributeError as e:        return  None    return titletitle = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")# bsObj = BeautifulSoup(html.read())# print(bsObj.h1)if title == None:    print("Title not found")else:    print(title)

3.程序的运行结果

a.exercise1.html网页的源码如下

<html><head><title>A Useful Page</title></head><body><h1>An Interesting Title</h1><div>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</div></body></html>

b.程序的爬取结果如下

<h1>An Interesting Title</h1>Process finished with exit code 0

4.异常处理说明

html = urlopen(url)
urlopen()函数会涉及两种错误：
1.在服务器上没有找到访问的url页
2.访问的服务器不存在
两种错误的处理方式如下：
第一种，返回HTTP错误：“404 PageNot Found,” “500 Internal Server Error,”等。urlopen()函数会抛出“HTTPError”
第二种，urlopen()函数会返回None

另外写爬虫程序需要考虑到代码处理异常与可读性的平衡

0 0