jsoup爬虫简单使用笔记

来源：互联网发布：淘宝外卖和口碑的区别编辑：程序博客网时间：2024/06/05 11:03

好久没有写博客了，最近正好遇到一个工作中的需求，要求将类似于“这是我分享的一个链接http://www.cnblogs.com/TTyb/p/5996847.html”这样的字符串中的链接截取出来并将链接中的图片、标题和正文开头的30个字符截取出来；需求明确后就可以思考一下使用的工具了！
本次使用以前没有接触过的jsoup进行html的抓取；其他都好说，字符串中的链接只需要一个正则表达式即可解决，拿到链接直接get请求得到html内容，最后使用jsoup来解析就可以完成所有需求！这里简单记录一下代码；

import java.io.IOException;import java.util.regex.Matcher;import java.util.regex.Pattern;import org.apache.commons.httpclient.HttpException;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import cn.creditease.fso.cupid.utils.HttpclientProxy;public class Test4ZHCMain {/** * jsoup 爬虫 * <dependency>    <groupId>org.jsoup</groupId>    <artifactId>jsoup</artifactId>    <version>1.8.3</version></dependency> *  * */    public static void main(String[] args) throws HttpException, IOException {        //the text inner link        String str = "这是我分享的一个链接http://blog.csdn.net/growing_tree/article/details/50474165";        //the result of link in the text        String result = "";        String regEx = "((http[s]{0,1}|ftp)://[a-zA-Z0-9\\.\\-]+\\.([a-zA-Z]{2,4})(:\\d+)?(/[a-zA-Z0-9\\.\\-~!@#$%^&*+?:_/=<>]*)?)|(www.[a-zA-Z0-9\\.\\-]+\\.([a-zA-Z]{2,4})(:\\d+)?(/[a-zA-Z0-9\\.\\-~!@#$%^&*+?:_/=<>]*)?)";        Pattern pattern = Pattern.compile(regEx);        Matcher matcher = pattern.matcher(str);        while (matcher.find()) {            result = matcher.group(1);        }        System.out.println("catch the link is:"+result);        //request the link         String execGETMethod = HttpclientProxy.execGETMethod(result);        //http request result        System.out.println(execGETMethod);        Document parse = Jsoup.parse(execGETMethod);        //get title tag node list        Elements title = parse.getElementsByTag("title");        //get p tag node list        Elements ps = parse.getElementsByTag("p");        // Elements's index        int i = 0 ;        //get p tag node element         Element p = ps.get(i);        // if the text of p tag that has String "版权" inside         if (p.text().contains("版权")) {            p=ps.get(i+1);        }        //get img tage        Elements img = parse.getElementsByTag("img");        System.out.println(title.text());        System.out.println(img.size()>0?img.get(0).attr("src"):"undefined");        System.out.println(p.text());    }}

注意：这里使用了自己封装的httpclient工具类，所以如果想要直接使用这段代码，需要自己写一个简单的get请求；因为比较简单就不放请求的工具类代码了；

代码中的Elements 是ArrayList类型，所以这里就很好搞了，用角标获取你想要的第几个元素都可以！还可以循环它！Document对象有很多获取节点的方法，名字获取，id获取等等；我这里用的name名字去获取；拿到节点就可以随便去玩了。很好玩！

阅读全文

0 0