网络爬虫---2.数据分析

来源：互联网发布：网络不良色情举报中心编辑：程序博客网时间：2024/06/08 17:08

安装FireBug Lite

三种网页抓取方法

1.正则表达式

官网正则表达式网址：https://docs.python.org/3/howto/regex.html

>>> import re>>> url = 'http://example.webscraping.com/places/default/view/Afghanistan-1'>>> p = re.compile('<td class="w2p_fw">(.*?)</td>')>>> html = urllib.request.urlopen(url).read()>>> p.findall(html)Traceback (most recent call last):  File "<pyshell#79>", line 1, in <module>    p.findall(html)TypeError: cannot use a string pattern on a bytes-like object>>> p.findall(html.decode('utf-8'))['<img src="/places/static/images/flags/af.png" />', '647,500 square kilometres', '29,121,286', 'AF', 'Afghanistan', 'Kabul', '<a href="/places/default/continent/AS">AS</a>', '.af', 'AFN', 'Afghani', '93', '', '', 'fa-AF,ps,uz-AF,tk', '<div><a href="/places/default/iso/TM">TM </a><a href="/places/default/iso/CN">CN </a><a href="/places/default/iso/IR">IR </a><a href="/places/default/iso/TJ">TJ </a><a href="/places/default/iso/PK">PK </a><a href="/places/default/iso/UZ">UZ </a></div>']

2.BeautifulSoup

安装：

pip install beautifulsoup4

安装lxml

pip install lxml

Beautifulsoup正确解析缺失的引号并闭合标签

>>> import lxml>>> from bs4 import BeautifulSoup>>> brouken_html = '<ul class=country><li>Area<li>Population</ul>'>>> soup = BeautifulSoup(brouken_html, 'lxml')>>> fixed_html = soup.prettify()>>> prin(fixed_html)Traceback (most recent call last):  File "<pyshell#5>", line 1, in <module>    prin(fixed_html)NameError: name 'prin' is not defined>>> print(fixed_html)<html> <body>  <ul class="country">   <li>    Area   </li>   <li>    Population   </li>  </ul> </body></html>

查找数据：

>>> ul = soup.find('ul',attrs={'class':'country'})>>> ul.find('li')<li>Area</li>>>> ul.find_all('li')[<li>Area</li>, <li>Population</li>]

中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
英文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

获取国家面积：

>>> from urllib import request>>> html = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()>>> soup = BeautifulSoup(html)Warning (from warnings module):  File "C:\Users\zhuangyy\AppData\Local\Programs\Python\Python35\lib\site-packages\bs4\__init__.py", line 181    markup_type=markup_type))UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP})to this: BeautifulSoup(YOUR_MARKUP, "lxml")>>> soup = BeautifulSoup(html,'lxml')>>> tr = soup.find(attrs={'id':'places_area__row'})>>> td = tr.find(attrs={'class':'w2p_fw'})>>> area = td.text>>> print(area)647,500 square kilometres

3. lxml

lxml基于libxml2 XML解析库的Python封装，使用C语言编写，解析速度比Beautiful Soup快

相关文档：http://lxml.de/installation.html#source-builds-on-ms-windows

>>> import lxml.html>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'>>> tree = lxml.html.fromstring(broken_html)>>> fixed_html = lxml.html.tostring(tree, pretty=True)Traceback (most recent call last):  File "<pyshell#25>", line 1, in <module>    fixed_html = lxml.html.tostring(tree, pretty=True)TypeError: tostring() got an unexpected keyword argument 'pretty'>>> fixed_html = lxml.html.tostring(tree, pretty_print=True)>>> pint(fixed_html)Traceback (most recent call last):  File "<pyshell#27>", line 1, in <module>    pint(fixed_html)NameError: name 'pint' is not defined>>> print(fixed_html)b'<ul class="country">\n<li>Area</li>\n<li>Population</li>\n</ul>\n'>>>

lxml也可以正确解析属性两侧缺失的引号，并闭合标签，不过该模块没有额外添加<html>和<body>标签

XPath选择器和BeautifulSoup的find()类似

CSS选择器:

安装
pip install cssselect

>>> from lxml.cssselect import CSSSelector>>> import urllib import requestSyntaxError: invalid syntax>>> from urllib import request>>> html1 = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()>>> tree = lxml.html.fromstring(html1)>>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]>>> area = td.text_content()>>> print(area)647,500 square kilometres

这段代码会先找到ID为places_area__row的表格行元素，然后选择class为w2p_fw的表格数据子标签。

CSS选择器表示选择元素所使用的模式。常用的选择器示例：

选择所有标签*选择<a>标签a选择所有class="link"的元素.link选择class="link"的<a>标签a.link选择id="home"的<a>标签a#home选择父元素为<a>标签的所有<span>子标签：a > span选择<a>标签内部的所有<span>标签a span选择title属性为"Home"的所有<a>标签a[title=Home]

可以把数据存在　csv中

阅读全文

0 0