BeautifulSoup库中find_all()方法

来源:互联网 发布:js实现点击重置按钮 编辑:程序博客网 时间:2024/06/05 05:46

今天看了BeautifulSoup库的find_all()方法,特来总结一下。BeautifulSoup库是专门用来解析、遍历和维护标签树的功能库,在爬取网页信息后,我们可以用BeautifulSoup库来解析网页信息

find_all(names,attrs,recursive,string,**kwargs)

 1、name:指的是标签名

import requestsfrom bs4 import BeautifulSoupurl = 'http://python123.io/ws/demo.html'try:    r = requests.get(url,timeout=30)    r.raise_for_status() #response对象的一个方法,判断返回状态    r.encoding = r.apparent_encoding #encoding为从http header中猜测的编码方式,apparent_encoding则是从内容中猜测的编码方式    demo = r.text    print(demo)except:    print('there is a mistake')soup = BeautifulSoup(demo,'html.parser')soup.find_all('a')


输出:

soup.find_all('a')Out[1]: [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]


2、attrs:指的是标签属性

soup.find_all(id=re.compile('link'))Out[1]: [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
这里是检索属性id中含有link的标签


3、recrusive:是否对子孙全部检索,默认为True

soup.find_all('a')Out[2]: [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]soup.find_all('a',recursive=False)Out[3]: []


4、string:对<></>中字符串区域检索字符串

soup.find_all(string = re.compile('python'))Out[4]: ['This is a python demo page', 'The demo python introduces several python courses.']





原创粉丝点击