java 网页解析工具包 Jsoup
来源:互联网 发布:域名icp备案如何申请 编辑:程序博客网 时间:2024/05/16 21:18
Jsoup是一个非常好的解析网页的包,用java开发的,提供了类似DOM,CSS选择器的方式来查找和提取文档中的内容。
相关资料如下:
下载地址:http://jsoup.org/download
中文文档资料:http://www.open-open.com/jsoup/
比较好的文档:http://www.ostools.net/apidocs/apidoc?api=jsoup-1.6.3
今天做了一个Jsoup解析网站的项目,使用Jsoup.connect(url).get()连接某网站时偶尔会出现
java.net.SocketTimeoutException:Read timed out异常。
原因是默认的Socket的延时比较短,而有些网站的响应速度比较慢,
所以会发生超时的情况。
解决方法:
链接的时候设定超时时间即可。
doc = Jsoup.connect(url).timeout(5000).get();
5000表示延时时间设置为5s。
测试代码如下:
1,不设定timeout时:
package jsoupTest;import java.io.IOException;import org.jsoup.*;import org.jsoup.helper.Validate;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;public class JsoupTest {public static void main(String[] args) throws IOException{String url = "http://www.weather.com.cn/weather/101010400.shtml";long start = System.currentTimeMillis();Document doc=null;try{doc = Jsoup.connect(url).get();}catch(Exception e){e.printStackTrace();}finally{System.out.println("Time is:"+(System.currentTimeMillis()-start) + "ms");}Elements elem = doc.getElementsByTag("Title");System.out.println("Title is:" +elem.text());}}
有时发生超时:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.ChunkedInputStream.fastRead(Unknown Source)
at sun.net.www.http.ChunkedInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(Unknown Source)
at java.util.zip.InflaterInputStream.fill(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.GZIPInputStream.read(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at org.jsoup.helper.DataUtil.readToByteBuffer(DataUtil.java:113)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:447)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)
at jsoupTest.JsoupTest.main(JsoupTest.java:17)
Time is:3885ms
Exception in thread "main" java.lang.NullPointerException
at jsoupTest.JsoupTest.main(JsoupTest.java:25)
2,设定了则一般不会超时
package jsoupTest;import java.io.IOException;import org.jsoup.*;import org.jsoup.helper.Validate;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;public class JsoupTest {public static void main(String[] args) throws IOException{String url = "http://www.weather.com.cn/weather/101010400.shtml";long start = System.currentTimeMillis();Document doc=null;try{doc = Jsoup.connect(url).timeout(5000).get();}catch(Exception e){e.printStackTrace();}finally{System.out.println("Time is:"+(System.currentTimeMillis()-start) + "ms");}Elements elem = doc.getElementsByTag("Title");System.out.println("Title is:" +elem.text());}}
输出为:
Time is:4158ms
Title is:顺义天气预报-今日_明日_一周天气预报:16日星期五 多云转晴 11/-4℃
- java 网页解析工具包 Jsoup
- java 网页解析工具包 Jsoup
- java 网页解析工具包 Jsoup
- java 网页解析工具包 Jsoup
- java网页解析工具包
- Java使用Jsoup解析网页
- java : jsoup 网页 table 解析范例
- java 使用Jsoup解析URL网页信息
- java 爬虫 网页解析(Jsoup)
- java爬虫工具包jsoup.jar
- 网页解析利器Jsoup
- 网页解析利器Jsoup
- Jsoup 解析Html网页
- Jsoup解析网页
- 网页解析之Jsoup
- jsoup解析网页二
- Jsoup解析网页内容
- 使用Jsoup解析网页
- 数学表达式计算(汇编实现)
- c#实现矩阵的转置,相乘等
- GridView自动序号
- Traits编程技法一
- Android input输入事件处理
- java 网页解析工具包 Jsoup
- PID和UID的权限问题
- STL六大组件
- div+css position 定位问题
- BaseService类
- 转 Ant builder详解
- POJ1579Function Run Fun
- vs2010环境下调试程序出现0xc000007b问题
- spring与DWR集成的两种方法