BeautifulSoup使用相关知识

来源:互联网 发布:java代码重用 编辑:程序博客网 时间:2024/06/07 02:10

1基础使用,获取某一网址内容的h1标签

from urllib.request import urlopenfrom bs4 import BeautifulSoupfrom urllib.error import HTTPErrordef gettitle(url):    try:        html=urlopen(url)    except HTTPError as e:        return None    try:        bsobj=BeautifulSoup(html.read())        title=bsobj.body.h1    except AttributeError as e:        return None    return titletitle=gettitle("http://www.pythonscraping.com/pages/page1.html")if title==None:    print('not found')else:    print(title)

2复杂HTML解析

  • 通过属性查找标签的方法,标签组的使用,以 及标签解析树的导航过程
from urllib.request import urlopenfrom bs4 import BeautifulSoupfrom urllib.error import HTTPErrordef gettitle(url):    try:        html=urlopen(url)    except HTTPError as e:        return None    try:        bsobj=BeautifulSoup(html.read())        namelist=bsobj.findAll('span',{"class":'green'})        if namelist==None:            pass        else:            for name in namelist:                print(name.get_text())#get_text()清除标签,只保留内容    except AttributeError as e:        return Nonegettitle("http://www.pythonscraping.com/pages/warandpeace.html")

find(tag,attributes,recursive,text,keywords)
findAll(tag,attributes,recursive,text,limit,keywords)
tag:标签(dv,h1,h2等等)
attributes:属性
recursive:设置为Flase只会查找一级标签,默认是True,会根据 筛选条件查找所有子标签
text:使用标签文本匹配,返回该字符的标签数量
limit:范围限制
keywords:使用关键词查找(使用class时,应该obj.findAll(class_=”green”)这样使用)

3采集一个网站,有很多界面,并且有部分界面是重复的,需要把以发现的所有链接放到一起,并保存在方便查询的列表里,只有心链接才会被采集(不能爬取有反爬机制的网站,例如知乎,百度)

from urllib.request import urlopenfrom bs4 import BeautifulSoupimport repages=set()def get_links(pageurl):    global pages    html=urlopen('http://en.wikipedia.org'+pageurl)    bsObj=BeautifulSoup(html)    for link in bsObj.findAll('a',href=re.compile('^(/wiki/)')):        if 'href' in link.attrs:            if link.attrs['href'] not in pages:                newPage=link.attrs['href']                print(newPage)                pages.add(newPage)                get_links(newPage)if __name__=='__main__':    get_links('')

这里写图片描述

4通过互联网采集:外链

from urllib.request import urlopenfrom bs4 import BeautifulSoupimport reimport datetimeimport randompages=set()random.seed(datetime.datetime.now())#获取所有内链的链接def getInternalLinks(bsObj,includeUrk):    internalLinks=[]    for link in bsObj.findAll('a',href=re.compile('^(/|.*'+includeUrk+')')):        if link.attrs['href'] is not None:            if link.attrs['href'] not in internalLinks:                internalLinks.append(link.attrs['href'])    return internalLinks#获取页面所有外链的链接列表def getExternalLinks(bsObj,excludeUrl):    externalLinks=[]    for link in bsObj.findAll('a',href=re.compile('^(http|wwww|https)((?!'+excludeUrl+').)*$')):        if link.attrs['href'] is not None:            if link.attrs['href'] not in externalLinks:                externalLinks.append(link.attrs['href'])    return externalLinksdef splitAddress(address):    addressParts=address.replace('http://','').split('/')    return addressPartsdef getRandomExternalLink(startingPage):    html=urlopen(startingPage)    bsObj=BeautifulSoup(html)    externalLinks=getExternalLinks(bsObj,splitAddress(startingPage)[0])    if len(externalLinks)==0:        internalLinks=getInternalLinks(startingPage)        return None    else:        return externalLinks[random.randint(0,len(externalLinks)-1)]def followWExternalOnly(startingSite):    externalLink=getRandomExternalLink('http://jianshu.com')    print('随机外链:'+externalLink)    followWExternalOnly(externalLink)followWExternalOnly('http://jianshu.com')

这里写图片描述

Python群:298948196

原创粉丝点击