爬虫基础 -- pyquery

来源:互联网 发布:那个软件可以套花呗 编辑:程序博客网 时间:2024/06/05 22:39

Pyquery

       强大又灵活的网页解析库,相比而言,正则写起来太麻烦。如果说beautifulsoup语法太难记,如果熟悉jQuery,那么Pyquery是绝佳的选择。Pyquery是模仿jQuery的。


初始化  

字符串初始化

html = '''<div>    <ul>         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)print(doc('li'))

结果:

<li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li>

URL 初始化

会自动请求该链接,并完成初始化
from pyquery import PyQuery as pqdoc = pq(url='http://www.mi.com')print(doc('head'))

结果:
<head><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta charset="UTF-8"/><title>小米商城 - 小米MIX 2、红米Note 5A、小米Note 3、小米笔记本官方网站</title>
...

还有通过文件初始化

from pyquery import PyQuery as pqdoc = pq(filename='demo.html')print(doc('li'))


基本的CSS选择器

html = '''<div id="container">    <ul class="list">         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)print(doc('#container .list li'))

结果:
拿出所有的li标签的内容。

查找子元素:
html = '''<div id="container">    <ul class="list">         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')print(type(items))print(items)lis = items.find('li')print(type(lis))print(lis)

结果:
<class 'pyquery.pyquery.PyQuery'><ul class="list">         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> <class 'pyquery.pyquery.PyQuery'><li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>

通过children来查找所有的直接子元素、
lis = items.children('.active')print(type(lis))print(lis)

结果:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li>


父元素

返回一个外节点。
html = '''<div id="container">    <ul class="list">         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')container = items.parent()print(type(container))print(container)

兄弟元素

html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.list .item-0.active')print(li.siblings())

结果:
<li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0">first item</li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>

li = doc('.list .item-0.active')print(li.siblings('.active'))
结果:

<li class="item-1 active"><a href="link4.html">fourth item</a></li>


遍历

单个元素

html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)

结果:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>



html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)lis = doc('li').items()print(type(lis))for li in lis:    print(li)


获取文本

html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)a = doc('.item-0.active a')print(a)print(a.text())

结果:

third item

获取HTML

html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)print(li.html())

结果:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <a href="link3.html"><span class="bold">third item</span></a>

DOM操作,addClass,removeClass 

html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.removeClass('active')print(li)li.addClass('active')print(li)

结果:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

添加attr,css

html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.attr('name', 'link')print(li)li.css('font-size', '14px')print(li)

结果:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>

remove

html = '''<div class="wrap">    Hello, World    <p>This is a paragraph.</p> </div>'''from pyquery import PyQuery as pqdoc = pq(html)wrap = doc('.wrap')print(wrap.text())wrap.find('p').remove()print(wrap.text())


其他的DOM操作

http://pyquery.readthedocs.io/en/latest/api.html


pyquery的官方文档   http://pyquery.readthedocs.io/

原创粉丝点击