黑马程序员_Jsoup知识整理

来源：互联网发布：全文翻译软件编辑：程序博客网时间：2024/05/21 14:54

------- android培训、java培训、期待与您交流！ ----------

Jsoup

一、解析Html文档

Jsoup处理HTML文档是将用户输入的HTML文档转换成一个Document对象

1、解析Html字符串

Document doc =parse(String html)

2、根据url地址加载Document对象

Document doc =connect(String url).get()

connect(String url) 返回Connection接口，包含Connection.Request，Connection.Response，Connection.Method等内部接口，分别代表HTTP请求，响应和GET、POST方法

Connection成员方法：

request()：获取Connection.Request对象

request(Connection.Requestrequest)：设置request对象

response()：获取Connection.Response对象

response(Connection.Responseresponse)：设置response对象

设置请求：

data()：设置请求数据，GETs方法在query string：?name=jsoup&language=Java，POST方法请求数据在request body

Firebug截图：

其中wd=java就是请求数据

cookie()：设置request的cookie

userAgent(StringuserAgent)：设置User-Agent

timeout()：设置连接超时时间

get()：使用Get方法访问URL

post()：使用Post方法访问URL

followRedirects()：设置是否跟踪URL重定位

3、根据文件加载Document对象

Document doc =parser(File in, String charset，String baseURL)

baseURL：因为 HTML文档中会有很多如链接、图片以及所引用的外部脚本、css文件等，baseURL参数的意思就是当 HTML文档使用相对路径方式引用外部文件时，jsoup会自动为这些 URL加上一个baseURL前缀

4、解析body片段

Document doc =parserBodyFragment(String html)

二、数据抽取

1、使用Dom方法遍历文档

Html Document：

Html element：

An HTML element iseverything from the start tag to the end tag

如：<ahref="http://www.w3schools.com">This is a link</a>整体是一个element. Element由tag，attributes，nested elements组成

上面的Html文档有3个Html elements，分别是：

<html>element，<body>element和<p>element。其中<html>element 包含<body>element，<body>element包含<p>element，<p>element是：<p>This is myfirst paragraph.</p>整体，<p>element content是This is my first paragraph.

Element content 是指开始标签和结束标签里面的内容

Html attribute：

一个element能包含多个attributes，attribute为element功能增色。Attributes通常定义在start tag里面，是以key/value的形式出现。vlaue由双引或者单引括起来，某些情况特殊，比如： name='John "ShotGun" Nelson'。key/value通常小写，比如：

<a href="http://www.w3schools.com">Thisis a link</a>是一个element，其中element content是This is a link，attribute key是href，attribute value是http://www.w3schools.com

下面是一些能够被任何Html element引用的attributes：

Document简介：

Document class Family：

A HTML Document

org.jsoup.nodes.Node

|--TextNode

|--DataNode

|--Comment

|--Element

|--Document

Document成员方法：

body()：获取Document的body元素

head()：获取Document的head元素

title()：获取Dcoument的title元素

title(String title)：设置Document的title元素

text()：获取Document的文本内容

text(String text)：设置Document的文本内容

html()：获取<html>元素内HTML内容，不包括<html>

outerHtml()：获取<html>元素外HTML内容，包括<html>

(包括Element的所有成员方法)

Element简介：

An HTML element is everythingfrom the start tag to the end tag

Element由tag，attributes，nested elements，child nodes组成

Element成员方法：

查找元素(返回Element or Elements)：

getElementById(Stringid)：根据id查找元素，从自身element开始直到内嵌element，直到找到第一个匹配的id，比如：

<divid=”content”>

….

</div>

getElementById(“content”)返回的Element：

<divid=”u”>

….

</div>

getElementsByTag(Stringtag)：根据tag查找元素，从自身element开始直到内嵌element，直到找到所有匹配的tag，比如：

getElementsByTag(“a”)返回的Elements：

<ahref="http://www.baidu.com/#" name="ime_py">拼音</a>

<ahref="http://www.baidu.com/#" name="ime_cl">关闭</a>

getElementsByClass(StringclassName)：同上

getElementsByAttribute(String key)：返回element里所有含属性key的元素

getElementsByAttributeStarting(StringkeyPrefix)：

getElementsByAttributeValue(Stringkey，String value)：

getElementsByAttributeValueNot(Stringkey，String value)：

getElementsByAttributeValueStarting(Stringkey，String valuePrefix)：

getElementsByAttributeValueEnding(Stringkey，String valueSuffix)：

getElementsByAttributeValueContaining(Stringkey，String match)：

getElementsByAttributeValueMatching(Stringkey，String regex)：

getElementsByIndexLessThan(intindex)：

getElementsByIndexGreaterThan(intindex)：

getElementsByIndexEquals(intindex)：

根据返回index信息，返回siblingElements

getElementsContainingText(StringsearchText)：通过文本查找元素，文本可以在element和child node里面

getElementsContainingOunText()：通过文本查找元素，文本必须在element里面

getElementsMatchingText(Stringregex)：Findelements whose text matches the supplied regular expression

getElementsMatchingOwnText()：

getAllElements()：

Sibling Elements：

siblingElements()：

firstElementSibling()：

lastElementSibling()：

nextElementSibling()：

previousElementSibling()：

Graph：

parent()：

parents()：

child(intindex)：

children()：

元素数据：

attr(Stringkey)：获取属性

attr(Stringkey，String value)：设置属性

attributes()：获取所有属性

id()：获取元素的id属性

className()：获取元素的所有class名字

addClass(StringclassName)：为元素的class属性命名

text()：获取element和childnode的组合文本内容，比如：

<p>Hello<b>there</b> now!</p>，p.text() returns "Hellothere now!"

ownText()：获取element的文本内容，比如：

<p>Hello<b>there</b> now!</p>，p.ownText() returns "Hellonow!"

text(Stringvalue)：设置文本内容

html()：获取元素内HTML内容

outerHtml()：获取元素外HTML内容

比如：on a <div> with oneempty <p>，html() will return <p></p>，outerHtml() wouldreturn<div><p></p></div>

html(Stringvalue)：设置元素内的HTML内容

data()：获取数据内容，比如script和style标签内容

tag()：获取标签，比如<div>

tagName()：获取标签名字，比如div

tagName(StringtagName)：设置元素的标签属性

val()：获取表单元素的值（inputtextarea等）

val()：设置表单元素的值

wrap(Stringhtml)：

操作HTML和文本：

append(Stringhtml)：

prepend(Stringhtml)：

appendText()：

prependText()：

appendElement(StringtagName)：

prependElement(StringtagName)：

选择器方法（重要）：

select(StringcssQuery)：

Elements：

实现了Cloneable，Iterable<Element>，Collection<Element>，List<Element>

Attribute简介：

2、使用选择器查找元素

Select syntax：

基本用法：

组合用法：

表达式：

三、修改数据

在解析文档的同时，我们可能会需要对文档中的某些元素进行修改，例如我们可以为文档中的所有图片增加可点击链接、修改链接地址或者是修改文本等，道理很简单，你只需要利用 jsoup的选择器找出元素，然后就可以通过以上的方法来进行修改，除了无法修改标签名外（可以删除后再插入新的元素），包括元素的属性和文本都可以修改。

例子：

doc.select("div.commentsa").attr("rel", "nofollow")；

// 为所有链接增加 rel=nofollow属性

doc.select("div.commentsa").addClass("mylinkclass")；

// 为所有链接增加 class=mylinkclass属性

doc.select("img").removeAttr("onclick")； //删除所有图片的 onclick属性

doc.select("input[type=text]").val("")；//清空所有文本输入框中的文本

0 0