jsoup 解析html网页标签获取数据(java 网页解析数据)

来源：互联网发布：pdf修改软件下载编辑：程序博客网时间：2024/06/05 02:01

今天需要在其他网站上获取一些数据.想到了爬取框架.

解析html框架很多.比较一些框架的介绍以后,感觉jsoup更易使用. 而且在使用中发现jsoup还是很稳定的.

得到jsoup的jar包 , 下面是官网

http://jsoup.org/

只有一个文件.开始以下步骤吧....

1,获取网站的connection

可以设置参数,头信息,cookie, 超时等...

Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; rv:5.0)").timeout(3*1000).get();

get()方法获取jsoup的 Document 对象.

2,使用选择器来选择有用的标签(内容)

这是jsoup优势所在啦. 可以使用类似于 jquery选择方法, 例如要得到 class="content" 的 div 标签(当然获得的是一个集合),那么就可以写成

Elements els= doc.select("div.content");

3,获取内容或是html元素

获取text 或是html的区别就是带不带html标签,

一般获取较大文本, 类似于<br/> 标签是非常有用的, 保留下来替换用于回车换行,

for(Element el:els){if (el.select("a").size()>1){//过滤不想要的标签.continue;}el.text();//获取内容,剔除了 <br/> 这种html标签el.html();//内容包括html标签}

简单吧? gogo

写一个入门小程序吧.

<<获取糗事百科的35页的最新糗事.>>

保存到d盘qiushibaike文件夹...

package com.test.jsoup;import java.io.BufferedOutputStream;import java.io.File;import java.io.FileOutputStream;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;public class Geturlcontent {static String txtpathstr="d:\\qiushibaike\\";public static void main(String[] args) throws Exception {String contents="";String urlbase="http://www.qiushibaike.com/8hr/page/";//1?s=4513032for(int i=1;i<35;i++){String url=urlbase+i+"?s=4513032";try{contents+=gettxtlist(url)+"\r\n";;}catch(Exception e){e.printStackTrace();System.out.println("页面失败啦."+i+"进行下一个.");}}//写入文件writefile(contents);}public static String gettxtlist(String txturl) throws Exception{System.out.println("url:"+txturl);String content="";Document doc=jsoupconnect(txturl,360000);Elements els= doc.select("div.content");System.out.println("页面中的文章数量>"+els.size());for(Element el:els){if (el.select("a").size()>1){continue;}content+=el.text()+"\r\n";System.out.println("");System.out.println(content);}return content;}public static Document jsoupconnect (String url,int timeout){Document doc=null;int retry=5;while (null==doc&&retry>0){retry--;try{doc= Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; rv:5.0)").timeout(timeout).get();}catch(Exception e){e.printStackTrace();System.out.println("connect 获取失败啦,再重试"+retry+"次");}}return doc;}public static void writefile(String txtstr)throws Exception{File txtpath=new File(txtpathstr);if (!txtpath.exists()){txtpath.mkdirs();}File htxt=new File(txtpathstr+"test.txt");BufferedOutputStream outBuff = new BufferedOutputStream(new FileOutputStream(htxt));outBuff.write(txtstr.getBytes());outBuff.flush();outBuff.close();}}

hello world 程序,大家会了吧,再见哈.

运行结果:

当然 d盘也保存着呢>>>

jsoup 解析html网页标签获取数据(java 网页解析 数据)

jsoup 解析html网页标签获取数据(java 网页解析数据)