Beautiful学习笔记

来源：互联网发布：诚信娱乐软件下载编辑：程序博客网时间：2024/06/05 10:45
```pythonfrom bs4 import BeautifulSoup```#    # 标签选择总结：获取tag时，总是获取第一个，若返回结果只有一个，则直接返回元素，若结果有多个，以迭代器返回，通过enumerate返回，两个标签之间若有换行，则有一个"\n    "标签# 标签选择器### 选择元素(只返回第一个匹配标签)```pythonhtml = """<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html,"lxml")print(soup.title)print(type(soup.title))print(soup.p)print(soup.a)```    <title>The Dormouse's story</title>    <class 'bs4.element.Tag'>    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>    ## 获取名称```pythonprint(soup.title.name)```    title    ## 获取属性```pythonprint(soup.p["name"])print(soup.p.attrs["name"])```    dromouse    dromouse    ## 获取内容```pythonprint(soup.p.string)print(soup.p.get_text())```    The Dormouse's story    The Dormouse's story    # 嵌套选择```pythonprint(soup.head.title.string)```    The Dormouse's story    ## 子节点(以list返回)和子孙节点```pythonhtml = """<html>    <head>        <title>The Dormouse's story</title>    </head>    <body>        <p class="story">            Once upon a time there were three little sisters; and their names were            <a href="http://example.com/elsie" class="sister" id="link1">                <span>Elsie</span>            </a>            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>             and            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>            and they lived at the bottom of a well.        </p>        <p class="story">...</p>"""soup = BeautifulSoup(html,"lxml")print(soup.p.contents)print(len(soup.p.contents))```    ['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']    7    ## children返回一个由子节点组成的迭代器，由序号和内容构成,通过enumerate获取```pythonprint(soup.p.children)for i,child in enumerate(soup.p.children):    print(i,child)```    <list_iterator object at 0x00000137FD009908>    0                 Once upon a time there were three little sisters; and their names were                    1 <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>    2         3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>    4                  and                    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>    6                 and they lived at the bottom of a well.                ## descendants返回由子孙节点组成的迭代器，由序号和内容构成，通过enumerate获取，```pythonprint(soup.p.descendants)for i,child in enumerate(soup.p.descendants):    print(i,child)```    <generator object descendants at 0x00000137FD0261A8>    0                 Once upon a time there were three little sisters; and their names were                    1 <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>    2         3 <span>Elsie</span>    4 Elsie    5         6         7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>    8 Lacie    9                  and                    10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>    11 Tillie    12                 and they lived at the bottom of a well.                ## 父节点和祖先节点```pythonprint(soup.a.parent)```    <p class="story">                Once upon a time there were three little sisters; and their names were                <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>                 and                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>                and they lived at the bottom of a well.            </p>    ```pythonprint(soup.a.parents)print(list(enumerate(soup.a.parents)))```    <generator object parents at 0x00000137FD026308>    [(0, <p class="story">                Once upon a time there were three little sisters; and their names were                <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>                 and                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>                and they lived at the bottom of a well.            </p>), (1, <body>    <p class="story">                Once upon a time there were three little sisters; and their names were                <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>                 and                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>                and they lived at the bottom of a well.            </p>    <p class="story">...</p>    </body>), (2, <html>    <head>    <title>The Dormouse's story</title>    </head>    <body>    <p class="story">                Once upon a time there were three little sisters; and their names were                <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>                 and                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>                and they lived at the bottom of a well.            </p>    <p class="story">...</p>    </body></html>), (3, <html>    <head>    <title>The Dormouse's story</title>    </head>    <body>    <p class="story">                Once upon a time there were three little sisters; and their names were                <a class="sister" href="http://example.com/elsie" id="link1">    <span>Elsie</span>    </a>    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>                 and                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>                and they lived at the bottom of a well.            </p>    <p class="story">...</p>    </body></html>)]    ## 兄弟节点```pythonprint(list(enumerate(soup.a.previous_siblings)))print(list(enumerate(soup.a.next_siblings)))```    [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]    [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]    # 标准选择器# find_all(name,attrs,recursive,text,**kwargs)### name(通过标签查找）```pythonhtml='''<div class="panel">    <div class="panel-heading">        <h4>Hello</h4>    </div>    <div class="panel-body">        <ul class="list" id="list-1">            <li class="element">Foo</li>            <li class="element">Bar</li>            <li class="element">Jay</li>        </ul>        <ul class="list list-small" id="list-2">            <li class="element">Foo</li>            <li class="element">Bar</li>        </ul>    </div></div>'''soup = BeautifulSoup(html,"lxml")print(soup.find_all("ul"))print(soup.find_all("ul")[0])```    [<ul class="list" id="list-1">    <li class="element">Foo</li>    <li class="element">Bar</li>    <li class="element">Jay</li>    </ul>, <ul class="list list-small" id="list-2">    <li class="element">Foo</li>    <li class="element">Bar</li>    </ul>]    <ul class="list" id="list-1">    <li class="element">Foo</li>    <li class="element">Bar</li>    <li class="element">Jay</li>    </ul>    ### attrs(根据属性查找)```pythonprint(soup.find_all(attrs = {"class":"element"}))print(soup.find_all(attrs = {"class":"list"}))```    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]    [<ul class="list" id="list-1">    <li class="element">Foo</li>    <li class="element">Bar</li>    <li class="element">Jay</li>    </ul>, <ul class="list list-small" id="list-2">    <li class="element">Foo</li>    <li class="element">Bar</li>    </ul>]    #### 针对class和id的快速查找```pythonprint(soup.find_all(class_ = "list"))print(soup.find_all(id = "list-2"))```    [<ul class="list" id="list-1">    <li class="element">Foo</li>    <li class="element">Bar</li>    <li class="element">Jay</li>    </ul>, <ul class="list list-small" id="list-2">    <li class="element">Foo</li>    <li class="element">Bar</li>    </ul>]    [<ul class="list list-small" id="list-2">    <li class="element">Foo</li>    <li class="element">Bar</li>    </ul>]    ### text(根据内容查找,只返回内容，不返回整个标签)```pythonprint(soup.find_all(text = "Foo"))```    ['Foo', 'Foo']    # find（name,attrs,recursive,text,**kwargs),只返回第一个## find_parents(),find_parent()查找祖先节点和父节点## find_next_siblings(),find_next_sibling(),find_previous_siblings(),find_previous_sibling()返回所有后面的兄弟节点，后面第一个兄弟节点，前面所有兄弟节点，前面第一个兄弟节点与直接选择标签中的.next_siblings()。。。用法完全不一样，详见下面代码```pythonhtml2='''<div class="panel">    <div class="panel-heading">        <h4>Hello</h4>    </div>    <div class="panel-body">        <ul class="list" id="list-1">            <li class="element">Foo</li>            <li class="element1">Bar</li>            <li class="element">Jay</li>        </ul>        <ul class="list list-small" id="list-2">            <li class="element">Foo</li>            <li class="element">Bar</li>        </ul>    </div></div>'''from bs4 import BeautifulSoupsoup2 = BeautifulSoup(html2, 'lxml')link = soup2.find(class_ = "element1")print(link)print(link.find_previous_siblings("li"))print(link.find_next_siblings("li"))```    <li class="element1">Bar</li>    [<li class="element">Foo</li>]    [<li class="element">Jay</li>]    ```python```## find_all_next(),find_next(),find_all_previous(),find_previous()返回所有之前所有符合条件的节点，之后第一个符合条件的节点，之前所有符合条件的节点，之前第一个符合条件的节点# CSS选择器,class用#，id用.开始，用空格隔开，返回所有得到的结果，以list返回```pythonprint(soup.select(".panel .panel-heading"))print(soup.select("#list-1 .element"))```    [<div class="panel-heading">    <h4>Hello</h4>    </div>]    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]    ```pythonimport requestsimport re,jsonfrom bs4 import BeautifulSoupurl = "https://www.toutiao.com/a6467787316680196622/"html = requests.get("https://www.toutiao.com/a6467787316680196622/").text# print(html)def parse_page_detail(html, url):    soup = BeautifulSoup(html, 'lxml')    result = soup.select('title')    title = result[0].get_text() if result else ''    images_pattern = re.compile('var gallery = (.*?);', re.S)    result = re.search(images_pattern, html)    if result:        data = json.loads(result.group(1))        if data and 'sub_images' in data.keys():            sub_images = data.get('sub_images')            images = [item.get('url') for item in sub_images]            #for image in images: download_image(image)            return {                'title': title,                'url': url,                'images': images            }print(parse_page_detail(html,url))```    None
阅读全文
0 0