读书笔记--用Python写网络爬虫02--数据抓取

来源：互联网发布：多重网络编辑：程序博客网时间：2024/06/10 17:47

抓取（scraping）---爬虫从网页中抽取一些数据用以实现某些用途。
三种抽取网页数据的方法：正则表达式、Beautiful Soup和lxml。

2.1 分析网页

通过浏览器自带选项，查看网页源代码
通过Firebug Lite扩展（http://getfirebug.com/firebuglite），分析网页信息。Firefox浏览器可以安装完整版的Firebug。

2.2 三种网页抓取方法

2.2.1 正则表达式
Python正则表达式（2.x）：https://docs.python.org/2/howto/regex.html

虽然可以通过匹配单个网页元素抓取数据，但如果网页发生变化，正则表达式往往会失效。
更健壮的方法是将目标网页元素的唯一标识的父元素也加入匹配规则。

import urllib2import redef scrape(html):    area = re.findall('<tr id="places_area__row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)[0]    return areaif __name__ == '__main__':    html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()    print scrape(html)

总的来说，正则表达式的方法并不适合网页变化频繁的场景，本身又存在难以构造、可读性差的问题。

2.2.2 Beautiful Soup
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。
相比正则表达式，使用Beautiful Soup的代码更容易构造和理解。
https://www.crummy.com/software/BeautifulSoup/
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
安装模块：pip install beautifulsoup4 -i https://mirrors.ustc.edu.cn/pypi/web/simple/
使用Beautiful Soup首先要将已下载的HTML内容解析为soup文档，对实际格式进行确定；然后使用find()和find_all()等方法定位所需要的元素。

# -*- coding: utf-8 -*-import urllib2from bs4 import BeautifulSoupdef scrape(html):    soup = BeautifulSoup(html, "html.parser")    tr = soup.find(attrs={'id': 'places_area__row'})  # locate the area row    # 'class' is a special python attribute so instead 'class_' is used    td = tr.find(attrs={'class': 'w2p_fw'})  # locate the area tag    area = td.text  # extract the area contents from this tag    return areaif __name__ == '__main__':    html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()    print scrape(html)

2.2.3 Lxml
Lxml是基于libxml2这一XML解析库的Python封装。
http://lxml.de/
http://lxml.de/installation.html
CSS选择器表示选择元素所使用的模式。相比XPath选择器，CSS选择器更加简洁。
但Lxml在内部实现中，实际上是将CSS选择器转换为等价的XPath选择器。

# -*- coding: utf-8 -*-import urllib2import lxml.htmldef scrape(html):    tree = lxml.html.fromstring(html)    td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]    area = td.text_content()    return areaif __name__ == '__main__':    html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()    print scrape(html)

2.2.4 性能对比

# -*- coding: utf-8 -*-import csvimport timeimport urllib2import refrom bs4 import BeautifulSoupimport lxml.htmlFIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name',          'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')def regex_scraper(html):    results = {}    for field in FIELDS:        results[field] = re.search('<tr id="places_{}__row">.*?<td class="w2p_fw">(.*?)</td>'.format(field),                                   html).groups()[0]    return resultsdef beautiful_soup_scraper(html):    soup = BeautifulSoup(html, 'html.parser')     results = {}    for field in FIELDS:        results[field] = soup.find('table').find('tr', id='places_{}__row'.format(field)).find('td', class_='w2p_fw').text    return resultsdef lxml_scraper(html):    tree = lxml.html.fromstring(html)    results = {}    for field in FIELDS:        results[field] = tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content()    return resultsdef main():    times = {}    html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()    NUM_ITERATIONS = 1000  # number of times to test each scraper    for name, scraper in ('Regular expressions', regex_scraper), ('Beautiful Soup', beautiful_soup_scraper), ('Lxml', lxml_scraper):        times[name] = []        # record start time of scrape        start = time.time()        for i in range(NUM_ITERATIONS):            if scraper == regex_scraper:                # the regular expression module will cache results                # so need to purge this cache for meaningful timings                re.purge()  # 默认情况下，正则表达式模块会缓存搜索结果。这里调用re.purge()方法清除缓存。             result = scraper(html)            # check scraped result is as expected            assert(result['area'] == '244,820 square kilometres')            times[name].append(time.time() - start)        # record end time of scrape and output the total        end = time.time()        print '{}: {:.2f} seconds'.format(name, end - start)    writer = csv.writer(open('times.csv', 'w'))    header = sorted(times.keys())    writer.writerow(header)    for row in zip(*[times[scraper] for scraper in header]):        writer.writerow(row)if __name__ == '__main__':    main()

2.2.5 结论
Lxml方法既快速又健壮，通常情况下是抓取数据的最好选择，而正则表达式和Beautiful Soup只在某些特定场景下有用。

抓取方法性能使用难度安装难度正则表达式快困难简单（内置模块）Beautiful Soup慢简单简单（纯Python）Lxml快简单相对困难

2.2.6 为链接爬虫添加抓取回调
特殊方法__call__，在对象作为函数被调用时调用该方法。
Python2特殊方法：https://docs.python.org/2/reference/datamodel.html#special-method-names

阅读全文

0 0