Scraper_compare( 'NoneType' object has no attribute 'group')
来源:互联网 发布:python 字节流 编辑:程序博客网 时间:2024/05/16 07:31
三种解析网页的方法各有所用,各有特点。通过,对比三种方式更能明白在什么情况之下采用什么方法。其中,运行代码时,可能会遇到一个bug(
results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).group()AttributeError: 'NoneType' object has no attribute 'group'
),这其实是有一个参数没有对,如果删除之后,运行顺利!
# 比较这三种解析的相对性能import urllib2import refrom bs4 import BeautifulSoupimport lxml.htmlimport timedef download(url, user_agent="wswp", num_retries=2): print "Download :", url headers = {"User_agent": user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print "Download Error :", e.reason html = None if num_retries > 0: if hasattr(e, "code") and 500 <= e.code < 600: return download(url, user_agent, num_retries-1) return html# 如果运行出现“results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).group()AttributeError: 'NoneType' object has no attribute 'group'”则只需删除FIELDS中的“iso”元素就行。FIELDS = ("area", "population", "iso", "country", "capital", "continent", "tld", "currency_code", "currency_name", "phone", "postal_code_format", "postal_code_regex", "languages", "neighbours")# 正则def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0] return results# BeautifulSoupdef bs_scraper(html): soup = BeautifulSoup(html, "html.parser", from_encoding="utf8") results = {} for field in FIELDS: # tr = soup.find(attrs={'id': 'places_%s__row' % field}) # td = tr.find(attrs={'class':'w2p_fw'}) # results[field] = td.text results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text return results# lxml模块def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('tr#places_%s__row > td.w2p_fw' % field)[0].text_content() return results# 测试相对性能NUM_ITERATIONS = 1000html = download("http://example.webscraping.com/view/United-Kingdom-239")for name, scraper in [("Regular expressions", re_scraper), ("BeautifulSoup", bs_scraper), ("Lxml", lxml_scraper)]: start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: # 清除正则表达式产生的搜索缓存,使结果更加公平 re.purge() result = scraper(html) assert (result['area'] == '244,820 square kilometres') end = time.time() print "%s: %.2f seconds" % (name, end - start)
输出结果是:
Regular expressions: 3.82 seconds
BeautifulSoup: 25.92 seconds
Lxml: 4.33 seconds
可以看出BeautifulSoup的性能最慢,但是,同时它也是最简单的。不过,在通常情况下,LXML是抓取数据的最好选择,这是因为该方法既快速又健壮,而正则表达式和BeautifulSoup只在某些特定的场景下有用。
阅读全文
0 0
- Scraper_compare( 'NoneType' object has no attribute 'group')
- AttributeError: 'NoneType' object has no attribute 'sc' 解决方法(二)
- AttributeError: ‘NoneType’ object has no attribute ‘clone’
- AttributeError: 'NoneType' object has no attribute 'clone'
- python : 'NoneType' object has no attribute 'recv'
- AttributeError: 'NoneType' object has no attribute
- AttributeError: 'NoneType' object has no attribute 'startswith'
- 使用Buildozer部署时,出现AttributeError: 'NoneType' object has no attribute 'group'错误的解决
- Python: Attribute Error - 'NoneType' object has no attribute 'something'
- python __def__ Exception AttributeError: "'NoneType' object has no attribute
- python __def__ Exception AttributeError: "'NoneType' object has no attribute
- [Python][转]Exception AttributeError: "'NoneType' object has no attribute
- Exception AttributeError: "'NoneType' object has no attribute
- python PIL bug: 'NoneType' object has no attribute 'bands' (解决)
- scapy AttributeError: 'NoneType' object has no attribute 'route'解决办法
- python使用PIL库报错,'NoneType' object has no attribute 'bands'
- selenium 2.0 'NoneType' object has no attribute 'send_keys'
- AttributeError: 'NoneType' object has no attribute 'sc' 解决方法!
- grunt 新手一日入门
- 第三章 组合逻辑
- 自定义Flash背景的相关设置方法以及其与目录下的文件的对应关系
- C语言程序设计(24)
- PKUSC 2017 酱油记
- Scraper_compare( 'NoneType' object has no attribute 'group')
- 数据挖掘——亲和性分析
- python算法演练——One Rule 算法
- 用 c 语言实现自动发牌
- 用scikit-learn估计值分类——近邻算法(KNN)
- 手机桌面
- TCP/IP网络编程期末考试复习
- 美团多渠道打包Python 转groovy gradle 插件方式
- 那些年——3 纯技术只是你手中的剑--而别人遥控着