Jsoup 一款Java的HTML解析器
来源:互联网 发布:eclipse怎么连接数据库 编辑:程序博客网 时间:2024/05/16 18:52
==================================官网====================================
网址:http://jsoup.org/
里面有文档、下载地址
===================================简介====================================
jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似jQurey的操作方法来取出和操作数据。
主要功能:
1、从一个URL,文件或字符串中解析HTML
2、使用DOM或CSS选择器来查找、取出数据
3、可操作HTML元素、属性、文本
jsoup是基于MIT协议发布的,可放心适用于商业项目中。
================================Maven中依赖====================================
<span style="font-size:14px;"><dependency> <!-- jsoup HTML parser library @ http://jsoup.org/ --> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.8.3</version></dependency></span>
1、Parse a document from a String
<span style="font-size:14px;"> String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);</span>
2、Parsing a body fragment
String html = "<div><p>Lorem ipsum.</p>"; Document doc = Jsoup.parseBodyFragment(html); Element body = doc.body();
3、Load a Document from a URL
getDocument doc = Jsoup.connect("http://example.com/").get(); String title = doc.title();post
Document doc = Jsoup.connect("http://example.com") .data("query", "Java") .userAgent("Mozilla") .cookie("auth", "token") .timeout(3000) .post();
4、Load a Document from a File
File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
==================================Extracting-data================================
1、Use DOM methods to navigate a document
<span style="font-size:14px;">File input = new File("/tmp/input.html");Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text();}</span>
2、Use selector-syntax to find elements
<span style="font-size:14px;">File input = new File("/tmp/input.html");Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");Elements links = doc.select("a[href]"); // a with hrefElements pngs = doc.select("img[src$=.png]"); // img with src ending .pngElement masthead = doc.select("div.masthead").first(); // div with class=mastheadElements resultLinks = doc.select("h3.r > a"); // direct a after h3</span>
3、Extract attributes, text, and HTML from elements
<span style="font-size:14px;">String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";Document doc = Jsoup.parse(html);Element link = doc.select("a").first();String text = doc.body().text(); // "An example link"String linkHref = link.attr("href"); // "http://example.com/"String linkText = link.text(); // "example""String linkOuterH = link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>"String linkInnerH = link.html(); // "<b>example</b>"</span>
4、Working with URLs
<span style="font-size:14px;">Document doc = Jsoup.connect("http://jsoup.org").get();Element link = doc.select("a").first();String relHref = link.attr("href"); // == "/"String absHref = link.attr("abs:href"); // "http://jsoup.org/"</span>
5、Example program: list links
<span style="font-size:14px;">package org.jsoup.examples;import org.jsoup.Jsoup;import org.jsoup.helper.Validate;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import java.io.IOException;/** * Example program to list links from a URL. */public class ListLinks { public static void main(String[] args) throws IOException { Validate.isTrue(args.length == 1, "usage: supply url to fetch"); String url = args[0]; print("Fetching %s...", url); Document doc = Jsoup.connect(url).get(); Elements links = doc.select("a[href]"); Elements media = doc.select("[src]"); Elements imports = doc.select("link[href]"); print("\nMedia: (%d)", media.size()); for (Element src : media) { if (src.tagName().equals("img")) print(" * %s: <%s> %sx%s (%s)", src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"), trim(src.attr("alt"), 20)); else print(" * %s: <%s>", src.tagName(), src.attr("abs:src")); } print("\nImports: (%d)", imports.size()); for (Element link : imports) { print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel")); } print("\nLinks: (%d)", links.size()); for (Element link : links) { print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35)); } } private static void print(String msg, Object... args) { System.out.println(String.format(msg, args)); } private static String trim(String s, int width) { if (s.length() > width) return s.substring(0, width-1) + "."; else return s; }}</span>
================================Modifying data====================================
1、Set attribute values
doc.select("div.comments a").attr("rel", "nofollow");
2、Set the HTML of an element
Element div = doc.select("div").first(); // <div></div>div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>div.prepend("<p>First</p>");div.append("<p>Last</p>");// now: <div><p>First</p><p>lorem ipsum</p><p>Last</p></div>Element span = doc.select("span").first(); // <span>One</span>span.wrap("<li><a href='http://example.com/'></a></li>");// now: <li><a href="http://example.com"><span>One</span></a></li>
3、Setting the text content of elements
Element div = doc.select("div").first(); // <div></div>div.text("five > four"); // <div>five > four</div>div.prepend("First ");div.append(" Last");// now: <div>First five > four Last</div>
0 0
- Jsoup 一款Java的HTML解析器
- jsoup 是一款很好的 Java 的HTML 解析器
- jsoup:一款使用 Java 语言开发的 HTML 解析器
- Java 的HTML 解析器-jsoup
- Java HTML 解析器:jsoup
- jsoup: Java HTML 解析器
- jsoup java html解析器
- jsoup Cookbook(中文版) Java开发的HTML解析器
- JSOUP入门指南-Java开发的HTML解析器
- JSOUP入门指南-Java开发的HTML解析器
- java的html解析器——Jsoup详解
- java-jsoup解析html页面的内容
- java-jsoup解析html页面的内容
- java 解析 html 的利器-->jsoup
- HTML解析器 jsoup
- HTML解析器 jsoup
- HTML解析器 jsoup
- JSoup HTML解析器
- c 函数fopen,fwrite,fread,fgets,fputs
- 最小逆序数
- URL与参数的相关知识点
- 知乎笔记
- [探索与发现]贝加尔湖的龙
- Jsoup 一款Java的HTML解析器
- 使用ajax和history.pushState无刷新改变页面URL
- 比较ArrayList、LinkedList、Vector
- 20 Command Line Tools to Monitor Linux Performance
- postgresql 临时表空间及注意事项
- dede 模板循环 判断奇偶值
- 元素ID是个字符串变量,如何用jquery选择器获得这个对象?
- Android apk动态加载机制的研究
- mac的逻辑文件存储结构和windows的区别