如何从头搭建一个搜索引擎_HTML简介和BeautifulSoup的基础使用

来源：互联网发布：努比亚专业相机软件编辑：程序博客网时间：2024/06/05 19:14

日期：2016年9月16日

标题：HTML简介和BeautifulSoup的基础使用

编号：1

一.HTML

1.HTML是什么：HTML是超文本标记语言，用标记标签来设计网页

2.HTML标签为用<>括起来的关键字，以<> </>的形式成对出现，第一个为起始标签，第二个为结束标签

3.HTML的元素指的是一对tag中间的全部内容，大部分HTML元素可以嵌套（包含其他HTML元素）使用

4.HTML的元素可以包含一些属性，以名值对的形式出现，如name=value

e.g. <a href="http://bbs.sjtu.edu.cn/">bbs链接</a>

6.常见的tag

·html标题：<hi>hi标题</hi> i=1,2,3,4,5,6

·样式（正文）:

- ...：着重强调的文本
- ...：更加着重强调的文本
- ...：粗体字
- ...：斜体
- <big>...</big>:大字体
- ...:一般段落文本（可以起到分段的作用）
- ...:下标
- ...:上标
- PS：上标和下标（sub，sup）可以插在别的标签里面

·文本：

- <hr /> 水平线
- 换行

·特殊字符：

- < <
- > >
- & &
- " "
-   空格

·链接：

<a href="http://www.baidu.com/">百度网址</a> 百度网址

·图像

<img src="logo.jpg" width="130" height="60"/> src为图像的绝对或者相对地址

·列表

<li>:每一项
<ul>:无序列表
<ol>:有序列表
e.g:

·表格：
- <table>:表格
- <tr>:划分行
- <td>:数据单元格
- <th>:表头
- border属性:表格边框
- e.g.:

二.BeautifulSoup

0.写在前面：Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

1.什么是BeautifulSoup：BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

2.BeautifulSoup的安装：自行百度

3.如何使用：

1. 从网页到python的代码：将一段HTML文档传入Beautiful Soup的构造方法

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html")) #可以是文本文档

soup = BeautifulSoup("<html>data</html>") #可以直接输入

#如果直接从网页上导入，需要使用urllib2库

import urllib2

content = urllib2.urlopen('http://www.baidu.com').read()

soup = BeautifulSoup(content) #将网页HTML内容给BeautifulSoup处理

PS：文档在此处自动被转换成为Unicode码

2.对象的种类：BS4将HTML文档转换成一个复杂的树状结构，每一个节点都是一个python对象，所有对象都可以归纳为四类：Tag,NavigableString，BeautifulSoup，Comment

1. 1. Tag：Tag对象和HTML或XML中的tag相同，例如

soup = BeautifulSoup('Extremely bold')

tag = soup.b

type(tag)

>>> <class 'bs4.element.Tag'>

·Tag的一些重要属性：

- - - Name：每个tag都有自己的名字，通过.name来获取

tag.name

>>> u'b'

#可以改变一个tag的Name

tag.name = "blockquote"

tag

>>> <blockquote class="boldest">Extremely bold</blockquote>

- - - Attributes（属性）：一个Tag可能有很多属性，Attribute的使用方法和字典相同

#e.g:

tag['class']

>>> u'boldest' #u''指的是unicode码

#也可以直接取出属性，比如.attrs

tag.attrs

>>> {u'class':u'boldest'}

#tag的属性可以被添加，修改，使用方法和字典相同

tag['class'] = 'verybold'

tag['id'] = 1

tag

>>> <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

>>> <blockquote>Extremely bold</blockquote>

tag['class']

>>> KeyError: 'class'

print(tag.get('class'))

>>> None

- - - 多值属性：
      HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性rel ,rev , accept-charset , headers ,accesskey . 在Beautiful Soup中多值属性的返回类型是list:

css_soup = BeautifulSoup('')

css_soup.p['class']

>>> ["body", "strikeout"]

css_soup=BeautifulSoup('')

css_soup.p['class']

>>> ["body"]

#如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回

id_soup = BeautifulSoup('')

id_soup.p['id']

>>> 'my id'

2.NavigableString（可以遍历的字符串）：字符串常被包含在tag内，Beautiful Soup用NavigableString 类来包装tag中的字符串

tag.string# u'Extremely bold'

type(tag.string)

>>> <class 'bs4.element.NavigableString'>

一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性. 通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:

unicode_string = unicode(tag.string)

unicode_string

# u'Extremely bold'

type(unicode_string)

# <type 'unicode'>

tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:

tag.string.replace_with("No longer bold")

tag

# <blockquote>No longer bold</blockquote>

PS：字符串不支持.contents(),.string(),.find()方法

3.BeautifulSoup:BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法。因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的.name 属性是很方便的,所以BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性.name

soup.name

# u'[document]'

4.Comment(注释和特殊字符串): Tag ,NavigableString ,BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(markup)

comment = soup.b.string

type(comment)

# <class 'bs4.element.Comment'>

#Comment 对象是一个特殊类型的NavigableString 对象:

comment

# u'Hey, buddy. Want to buy a used parser'

#但是当它出现在HTML文档中时,Comment 对象会使用特殊的格式输出:

print(soup.b.prettify())

#

#

#

3.遍历文档树：

#e.g

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

..."""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

·子节点：一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

- - - tag的名字：
 - - 操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取<head>标签,只要用soup.head

soup.head

>>> <head><title>The Dormouse's story</title></head>

soup.title

>>> <title>The Dormouse's story</title>

#这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个标签:

soup.body.b

>>> The Dormouse's story

- - - - 通过这种方式，如果一个节点下有很多相同类型的子节点，那么这种方法只能获取第一个

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

- - - - 如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

- - - .contents 和 .children:
      tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

#BeautifulSoup 对象本身一定会包含子节点,也就是说<html>标签也是 BeautifulSoup 对象的子节点:

len(soup.contents)

# 1

soup.contents[0].name

# u'html'

- - - - 通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag.children:
print(child)

# The Dormouse's story

- - - .descendants:
 .contents 和 .children 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>
 但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点..descendants 属性可以对所有tag的子孙节点进行递归循环:

for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>

# The Dormouse's story

- - - 上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, BeautifulSoup 有一个直接子节点(<html>节点),却有很多子孙节点:

len(list(soup.children))

# 1

len(list(soup.descendants))

# 25

- 父节点：每个tag或字符串都有父节点:被包含在某个tag中
- - .parent
 - - 通过 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:

title_tag = soup.title

title_tag

# <title>The Dormouse's story</title>

title_tag.parent

# <head><title>The Dormouse's story</title></head>

#BeautifulSoup 对象的 .parent 是None

print(soup.parent)

# None

- - .parents：通过元素的 .parents 属性可以递归得到元素的所有父辈节点,下面的例子使用了 .parents 方法遍历了<a>标签到根节点的所有节点.

link = soup.a

link

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

for parent in link.parents:

if parent is None:
print(parent)
else:

print(parent.name)# p# body# html# [document]# None

- 兄弟节点

>>> p.nextSibling.name #p的下一个节点

>>> p.previousSibling.name #p的上一个节点

搜索文档树：
- 首先介绍用于搜索的参数
- - 字符串

soup.find_all('b')

# [The Dormouse's story]

- - 正则表达式

import re

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

# body

# b

- - 列表：如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和标签:

soup.find_all(["a", "b"])

# [The Dormouse's story,

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

- - True（所有东西）：
    True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):

print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

# p

- - 方法：如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4] ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False
- find_all()函数：
  find_all( name , attrs , recursive , string , **kwargs )
  find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子:

soup.find_all("title")

# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")

# [The Dormouse's story]

soup.find_all("a")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re

soup.find(string=re.compile("sisters"))

# u'Once upon a time there were three little sisters; and their names were\n'

- - name参数：
    name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉.
  - keyword参数：
    如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性.
    搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True .

soup.find_all(id='link2')

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(id=True)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

#可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

- - 按CSS搜索
    按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过class_ 参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

- - string参数：
    通过 string 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, string 参数接受字符串 ,正则表达式 , 列表, True . 看例子:

soup.find_all(string="Elsie")
# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
""Return True if this string is the only child of its parent tag.""
return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)

# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

- - 其他limit，recursive参数自己百度
- find()函数：与find_all()类似，只返回一个结果
- PS：find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等
- find_parents() 和 find_parent()：在父节点中寻找
- find_next_siblings() 和 find_next_sibling()和find_previous_siblings() 和 find_previous_sibling()：在兄弟节点中寻找
其他方法：
- get_text()：如果只想得到tag中包含的文本内容,那么可以用 get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回
- get('id','')：得出id的内容，失败返回None//

阅读全文

0 0