JSOUP初探

来源:互联网 发布:淘宝发布虚拟宝贝教程 编辑:程序博客网 时间:2024/04/30 16:21
 

JSOUP是偶然看到的一个处理HTML的JAVA 类库,其官方网址是:http://jsoup.org/

1、编写相关的试用程序(只需要在工程中引用jsoup-1.3.3.jar即可):

import java.io.File;import java.io.IOException;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.select.Elements;public class Test {public static void main(String[] args) {Test t = new Test();t.parseFile();}public void parseString() {String html = "<html><head><title>blog</title></head><body onload='test()'><p>Parsed HTML into a doc.</p></body></html>";Document doc = Jsoup.parse(html);System.out.println(doc);Elements es = doc.body().getAllElements();System.out.println(es.attr("onload"));System.out.println(es.select("p"));}public void parseUrl() {try {Document doc = Jsoup.connect("http://www.baidu.com/").get();Elements hrefs = doc.select("a[href]");System.out.println(hrefs);System.out.println("------------------");System.out.println(hrefs.select("[href^=http]"));} catch (IOException e) {e.printStackTrace();}}public void parseFile() {try {File input = new File("input.html");Document doc = Jsoup.parse(input, "UTF-8");// 提取出所有的编号Elements codes = doc.body().select("td[title^=IA] > a[href^=javascript:view]");System.out.println(codes);System.out.println("------------------");System.out.println(codes.html());} catch (IOException e) {e.printStackTrace();}}}


 

2、parseString的输出:

<html> <head>  <title>blog</title> </head> <body onload="test()">  <p>Parsed HTML into a doc.</p> </body></html>test()<p>Parsed HTML into a doc.</p>


 

3、parseUrl的输出:

<a href="/gaoji/preferences.html">设置</a><a href="http://passport.baidu.com/?login&tpl=mn">登录</a><a href="http://news.baidu.com">新 闻</a><a href="http://tieba.baidu.com">贴 吧</a><a href="http://zhidao.baidu.com">知 道</a><a href="http://mp3.baidu.com">MP3</a><a href="http://image.baidu.com">图 片</a><a href="http://video.baidu.com">视 频</a><a href="http://map.baidu.com">地 图</a><a href="#" name="ime_hw">手写</a><a href="#" name="ime_py">拼音</a><a href="#" name="ime_cl">关闭</a><a href="http://hi.baidu.com">空间</a><a href="http://baike.baidu.com">百科</a><a href="http://www.hao123.com">hao123</a><a href="/more/">更多>></a><a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a><a href="http://e.baidu.com/?refer=888">加入百度推广</a><a href="http://top.baidu.com">搜索风云榜</a><a href="http://home.baidu.com">关于百度</a><a href="http://ir.baidu.com">About Baidu</a><a href="/duty/">使用百度前必读</a><a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a>------------------<a href="http://passport.baidu.com/?login&tpl=mn">登录</a><a href="http://news.baidu.com">新 闻</a><a href="http://tieba.baidu.com">贴 吧</a><a href="http://zhidao.baidu.com">知 道</a><a href="http://mp3.baidu.com">MP3</a><a href="http://image.baidu.com">图 片</a><a href="http://video.baidu.com">视 频</a><a href="http://map.baidu.com">地 图</a><a href="http://hi.baidu.com">空间</a><a href="http://baike.baidu.com">百科</a><a href="http://www.hao123.com">hao123</a><a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a><a href="http://e.baidu.com/?refer=888">加入百度推广</a><a href="http://top.baidu.com">搜索风云榜</a><a href="http://home.baidu.com">关于百度</a><a href="http://ir.baidu.com">About Baidu</a><a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a>


 

3、parseFile的输出:

<a href="javascript:view('67530','67530','0');">IA100908-002</a><a href="javascript:view('67529','67529','0');">IA100908-001</a><a href="javascript:view('67544','67544','0');">IA100908-016</a><a href="javascript:view('67364','67364','0');">IA100903-008</a><a href="javascript:view('67363','67363','0');">IA100903-007</a><a href="javascript:view('66104','66104','0');">IA100710-013</a><a href="javascript:view('57916','57916','0');">IA100515-013</a><a href="javascript:view('56962','56962','0');">IA100430-022</a><a href="javascript:view('66958','66958','0');">IA100830-001</a><a href="javascript:view('66319','66319','0');">IA100713-003</a><a href="javascript:view('66317','66317','0');">IA100713-001</a><a href="javascript:view('66321','66321','0');">IA100713-005</a><a href="javascript:view('66967','66967','0');">IA100830-010</a><a href="javascript:view('66999','66999','0');">IA100831-001</a><a href="javascript:view('67377','67377','0');">IA100904-004</a><a href="javascript:view('67378','67378','0');">IA100904-005</a><a href="javascript:view('3271','3271','0');">IA080115-031</a>------------------IA100908-002IA100908-001IA100908-016IA100903-008IA100903-007IA100710-013IA100515-013IA100430-022IA100830-001IA100713-003IA100713-001IA100713-005IA100830-010IA100831-001IA100904-004IA100904-005IA080115-031


补充下,input.html的基本结果如图:

原创粉丝点击