网络爬虫---2.数据分析
来源:互联网 发布:网络不良色情举报中心 编辑:程序博客网 时间:2024/06/08 17:08
安装FireBug Lite
三种网页抓取方法
1.正则表达式
官网正则表达式网址:https://docs.python.org/3/howto/regex.html
>>> import re>>> url = 'http://example.webscraping.com/places/default/view/Afghanistan-1'>>> p = re.compile('<td class="w2p_fw">(.*?)</td>')>>> html = urllib.request.urlopen(url).read()>>> p.findall(html)Traceback (most recent call last): File "<pyshell#79>", line 1, in <module> p.findall(html)TypeError: cannot use a string pattern on a bytes-like object>>> p.findall(html.decode('utf-8'))['<img src="/places/static/images/flags/af.png" />', '647,500 square kilometres', '29,121,286', 'AF', 'Afghanistan', 'Kabul', '<a href="/places/default/continent/AS">AS</a>', '.af', 'AFN', 'Afghani', '93', '', '', 'fa-AF,ps,uz-AF,tk', '<div><a href="/places/default/iso/TM">TM </a><a href="/places/default/iso/CN">CN </a><a href="/places/default/iso/IR">IR </a><a href="/places/default/iso/TJ">TJ </a><a href="/places/default/iso/PK">PK </a><a href="/places/default/iso/UZ">UZ </a></div>']
2.BeautifulSoup
安装:
pip install beautifulsoup4
安装lxml
pip install lxml
Beautifulsoup正确解析缺失的引号并闭合标签
>>> import lxml>>> from bs4 import BeautifulSoup>>> brouken_html = '<ul class=country><li>Area<li>Population</ul>'>>> soup = BeautifulSoup(brouken_html, 'lxml')>>> fixed_html = soup.prettify()>>> prin(fixed_html)Traceback (most recent call last): File "<pyshell#5>", line 1, in <module> prin(fixed_html)NameError: name 'prin' is not defined>>> print(fixed_html)<html> <body> <ul class="country"> <li> Area </li> <li> Population </li> </ul> </body></html>
查找数据:
>>> ul = soup.find('ul',attrs={'class':'country'})>>> ul.find('li')<li>Area</li>>>> ul.find_all('li')[<li>Area</li>, <li>Population</li>]
中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
英文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
获取国家面积:
>>> from urllib import request>>> html = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()>>> soup = BeautifulSoup(html)Warning (from warnings module): File "C:\Users\zhuangyy\AppData\Local\Programs\Python\Python35\lib\site-packages\bs4\__init__.py", line 181 markup_type=markup_type))UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP})to this: BeautifulSoup(YOUR_MARKUP, "lxml")>>> soup = BeautifulSoup(html,'lxml')>>> tr = soup.find(attrs={'id':'places_area__row'})>>> td = tr.find(attrs={'class':'w2p_fw'})>>> area = td.text>>> print(area)647,500 square kilometres
3. lxml
lxml基于libxml2 XML解析库的Python封装,使用C语言编写,解析速度比Beautiful Soup快
相关文档:http://lxml.de/installation.html#source-builds-on-ms-windows
>>> import lxml.html>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'>>> tree = lxml.html.fromstring(broken_html)>>> fixed_html = lxml.html.tostring(tree, pretty=True)Traceback (most recent call last): File "<pyshell#25>", line 1, in <module> fixed_html = lxml.html.tostring(tree, pretty=True)TypeError: tostring() got an unexpected keyword argument 'pretty'>>> fixed_html = lxml.html.tostring(tree, pretty_print=True)>>> pint(fixed_html)Traceback (most recent call last): File "<pyshell#27>", line 1, in <module> pint(fixed_html)NameError: name 'pint' is not defined>>> print(fixed_html)b'<ul class="country">\n<li>Area</li>\n<li>Population</li>\n</ul>\n'>>>lxml也可以正确解析属性两侧缺失的引号,并闭合标签,不过该模块没有额外添加<html>和<body>标签
XPath选择器和BeautifulSoup的find()类似
CSS选择器:
安装
pip install cssselect
>>> from lxml.cssselect import CSSSelector>>> import urllib import requestSyntaxError: invalid syntax>>> from urllib import request>>> html1 = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()>>> tree = lxml.html.fromstring(html1)>>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]>>> area = td.text_content()>>> print(area)647,500 square kilometres
这段代码会先找到ID为places_area__row的表格行元素,然后选择class为w2p_fw的表格数据子标签。
CSS选择器表示选择元素所使用的模式。常用的选择器示例:
选择所有标签*选择<a>标签a选择所有class="link"的元素.link选择class="link"的<a>标签a.link选择id="home"的<a>标签a#home选择父元素为<a>标签的所有<span>子标签:a > span选择<a>标签内部的所有<span>标签a span选择title属性为"Home"的所有<a>标签a[title=Home]可以把数据存在 csv中
阅读全文
0 0
- 网络爬虫---2.数据分析
- 市场研究 网络爬虫 数据分析
- Python基础和网络爬虫数据分析
- 《网络爬虫-Python和数据分析》数据库建库建表问题
- 网络爬虫,python和数据分析学习--part1
- 网络爬虫,python和数据分析学习--part2
- 网络爬虫,python和数据分析学习--part3
- 网络爬虫架构分析
- larbin网络爬虫分析
- 爬虫--网络数据采集
- 新闻数据爬虫分析
- 网络爬虫的设计分析
- 网络爬虫原理与分析
- 网络爬虫工作原理分析
- 【网络爬虫】数据采集——将html的数据分析保存到数据库
- 大数据 网络爬虫资料
- 网络爬虫采集数据几个问题
- 用Python进行网络爬虫和数据分析的初次尝试(一)
- TCP拥塞控制原理
- java学习——Jstl标签库大全
- ListView的使用(一)--常规属性的使用
- 函数
- ViewPager禁止滑动
- 网络爬虫---2.数据分析
- Unity游戏开发使用Assetbundle加载场景实战
- 【原】Android热更新开源项目Tinker源码解析系列之一:Dex热更新
- Android Studio 中的 gradle 详解
- hive使用中踩的一些坑
- openlayer4 + arcgisserver + wms +sld
- PAT乙级真题及训练集(23)--1014. 福尔摩斯的约会 (20)(细节决定成败)
- tmp
- FFmpeg源代码简单分析:av_find_input_format()