Beautiful学习笔记
来源:互联网 发布:诚信娱乐软件下载 编辑:程序博客网 时间:2024/06/05 10:45
```pythonfrom bs4 import BeautifulSoup```# # 标签选择总结:获取tag时,总是获取第一个,若返回结果只有一个,则直接返回元素,若结果有多个,以迭代器返回,通过enumerate返回,两个标签之间若有换行,则有一个"\n "标签# 标签选择器### 选择元素(只返回第一个匹配标签)```pythonhtml = """<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html,"lxml")print(soup.title)print(type(soup.title))print(soup.p)print(soup.a)``` <title>The Dormouse's story</title> <class 'bs4.element.Tag'> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> ## 获取名称```pythonprint(soup.title.name)``` title ## 获取属性```pythonprint(soup.p["name"])print(soup.p.attrs["name"])``` dromouse dromouse ## 获取内容```pythonprint(soup.p.string)print(soup.p.get_text())``` The Dormouse's story The Dormouse's story # 嵌套选择```pythonprint(soup.head.title.string)``` The Dormouse's story ## 子节点(以list返回)和子孙节点```pythonhtml = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p>"""soup = BeautifulSoup(html,"lxml")print(soup.p.contents)print(len(soup.p.contents))``` ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n '] 7 ## children返回一个由子节点组成的迭代器,由序号和内容构成,通过enumerate获取```pythonprint(soup.p.children)for i,child in enumerate(soup.p.children): print(i,child)``` <list_iterator object at 0x00000137FD009908> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 and 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 and they lived at the bottom of a well. ## descendants返回由子孙节点组成的迭代器,由序号和内容构成,通过enumerate获取,```pythonprint(soup.p.descendants)for i,child in enumerate(soup.p.descendants): print(i,child)``` <generator object descendants at 0x00000137FD0261A8> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> 2 3 <span>Elsie</span> 4 Elsie 5 6 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 8 Lacie 9 and 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 11 Tillie 12 and they lived at the bottom of a well. ## 父节点和祖先节点```pythonprint(soup.a.parent)``` <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> ```pythonprint(soup.a.parents)print(list(enumerate(soup.a.parents)))``` <generator object parents at 0x00000137FD026308> [(0, <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p>), (1, <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body>), (2, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>), (3, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>)] ## 兄弟节点```pythonprint(list(enumerate(soup.a.previous_siblings)))print(list(enumerate(soup.a.next_siblings)))``` [(0, '\n Once upon a time there were three little sisters; and their names were\n ')] [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')] # 标准选择器# find_all(name,attrs,recursive,text,**kwargs)### name(通过标签查找)```pythonhtml='''<div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div></div>'''soup = BeautifulSoup(html,"lxml")print(soup.find_all("ul"))print(soup.find_all("ul")[0])``` [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> ### attrs(根据属性查找)```pythonprint(soup.find_all(attrs = {"class":"element"}))print(soup.find_all(attrs = {"class":"list"}))``` [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] #### 针对class和id的快速查找```pythonprint(soup.find_all(class_ = "list"))print(soup.find_all(id = "list-2"))``` [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] [<ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] ### text(根据内容查找,只返回内容,不返回整个标签)```pythonprint(soup.find_all(text = "Foo"))``` ['Foo', 'Foo'] # find(name,attrs,recursive,text,**kwargs),只返回第一个## find_parents(),find_parent()查找祖先节点和父节点## find_next_siblings(),find_next_sibling(),find_previous_siblings(),find_previous_sibling()返回所有后面的兄弟节点,后面第一个兄弟节点,前面所有兄弟节点,前面第一个兄弟节点与直接选择标签中的.next_siblings()。。。用法完全不一样,详见下面代码```pythonhtml2='''<div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element1">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div></div>'''from bs4 import BeautifulSoupsoup2 = BeautifulSoup(html2, 'lxml')link = soup2.find(class_ = "element1")print(link)print(link.find_previous_siblings("li"))print(link.find_next_siblings("li"))``` <li class="element1">Bar</li> [<li class="element">Foo</li>] [<li class="element">Jay</li>] ```python```## find_all_next(),find_next(),find_all_previous(),find_previous()返回所有之前所有符合条件的节点,之后第一个符合条件的节点,之前所有符合条件的节点,之前第一个符合条件的节点# CSS选择器,class用#,id用.开始,用空格隔开,返回所有得到的结果,以list返回```pythonprint(soup.select(".panel .panel-heading"))print(soup.select("#list-1 .element"))``` [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] ```pythonimport requestsimport re,jsonfrom bs4 import BeautifulSoupurl = "https://www.toutiao.com/a6467787316680196622/"html = requests.get("https://www.toutiao.com/a6467787316680196622/").text# print(html)def parse_page_detail(html, url): soup = BeautifulSoup(html, 'lxml') result = soup.select('title') title = result[0].get_text() if result else '' images_pattern = re.compile('var gallery = (.*?);', re.S) result = re.search(images_pattern, html) if result: data = json.loads(result.group(1)) if data and 'sub_images' in data.keys(): sub_images = data.get('sub_images') images = [item.get('url') for item in sub_images] #for image in images: download_image(image) return { 'title': title, 'url': url, 'images': images }print(parse_page_detail(html,url))``` None
阅读全文
0 0
- Beautiful Soup-学习笔记
- Beautiful学习笔记
- Python的Beautiful Soup学习笔记
- python爬虫-Beautiful Soup学习笔记
- 【python学习笔记】10:Beautiful Soup模块的使用
- Beautiful Soup学习
- LeetCode笔记:526. Beautiful Arrangement
- 【python学习笔记】8:网页解析器及安装Beautiful Soup 4
- webbrowser、requests、Beautiful Soup学习
- Beautiful
- Beautiful Soup -- 文档笔记(一)
- Beautiful Soup 4.4.0文档学习记录
- 课堂学习——Beautiful number
- Beautiful Soup 4库--python2.x(学习日记)
- Python爬虫学习二——Beautiful Soup库
- Python语言学习:Beautiful Soup四个对象的具体用法
- Python3爬虫学习3:Beautiful Soup的用法
- beautiful song
- NKOJ 2650 (SDOI 2011) 消防(树的直径+DP+单调队列/二分答案)
- easyui combobox 的一个需要关注的地方
- MATLAB图像处理(包括图像类型转换)
- StringUtils的两个方法比较
- 微信小程序访问豆瓣电影api400错误解决方法
- Beautiful学习笔记
- Java 中finalize()方法使用
- SSH整合问题(1):严重: Exception starting filter struts2
- Python 窗体布局
- Spring中的IOC与AOP
- 可变参数问题(以及Myprintf函数的实现)
- 剑指offer之二十---二叉搜索树的后序遍历序列
- 关于arange\range\numpy.random.randint()\numpy.random.randn()的总结
- Kibana 5.6.2 主界面之(Dev Tools)