Java网络编程(一) - Java网页爬虫 - 爬取自己的CSDN博客标题和阅读数(附源码)

来源：互联网发布：excel表去掉重复数据编辑：程序博客网时间：2024/04/19 23:26

版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利

什么是爬虫
一个Java爬虫需要哪些技术
基于Spring框架的Java爬虫有哪些优势
- 1 spring task组件提供的定时执行功能
- 2 spring的依赖注入DI降低了具体网站之间的耦合度
- 3 spring的Value读取配置文件网址或数据库很方便
基于Spring框架的Java爬虫的详细设计
- 1 项目类图
- 2 apachehttpclient发送POSTGET请求
- 3 html解析 - jericho包的使用
- 举个例子 - 爬取CSDN全部日志和阅读数量
- 4 参数化的爬虫配置信息
- 5 多线程协调爬虫
- 6 定时启动

0 什么是爬虫

网络有很多信息, 比如以”爬虫”为关键字搜索, 获得1,000,000条结果, 不可能靠人工去检测哪些信息是需要的.
所以爬虫的目的, 就是自动获得网页内容并保存有用信息.

1 一个Java爬虫需要哪些技术

向目标网站发送POST, GET请求
解析目标网站返回的html页面, 获得有用信息
抓取的信息写入文件
定时启动, 比如每天23点爬取一次, 检查有没有更新
总结: Java爬虫涉及了前端页面的html解析, http协议, Java读写文件的基础知识, 是不可多得的JAVA网络编程的入门项目. 每一个做Java网络编程的人都应该做一个Java爬虫

2 基于Spring框架的Java爬虫有哪些优势

2.1 spring task组件提供的定时执行功能

2.2 spring的依赖注入(DI)降低了具体网站之间的耦合度

2.3 spring的@Value读取配置文件(网址或数据库)很方便

具体的,

@Repositorypublic class RewardsTestDatabase {    @Value("#{jdbcProperties.databaseName}")    public void setDatabaseName(String dbName) { … }    @Value("#{jdbcProperties.databaseKeyGenerator}")    public void setKeyGenerator(KeyGenerator kg) { … }}

其中, “jdbcProperties”是在applicationContext.xml中配置的

<!--src目录下的jdbcProperties.properties-->    <bean id="config"        class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">        <property name="fileEncoding" value="UTF-8"></property>        <property name="locations">            <list>                <value>classpath:jdbcProperties.properties</value>            </list>        </property>    </bean>

术语:
Spring Expression Language - “#{strategyBean.databaseKeyGenerator}”. Spring EL是Spring 3的新特性.

3 基于Spring框架的Java爬虫的详细设计

3.1 项目类图

3.2 apache.httpclient发送POST/GET请求

依赖包:
apache.httpclient4.5.2 - HttpGet, HttpPost
apache.httpcore4.4 - BasicNameValuePair implements NameValuePair
common-logging.jar - 日志包, 必须引入, 不然httpclient4.5.2运行时提示NoClassDefFoundError: org/apache/commons/logging/LogFactory

详细用法请参考wangpeng047@CSDN的大作, 内容全且准.

下面是我写的GET/POST请求

import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader;import java.net.URI;import java.net.URISyntaxException;import java.util.List;import java.util.regex.Pattern;import org.apache.http.Header;import org.apache.http.HttpEntity;import org.apache.http.HttpHost;import org.apache.http.NameValuePair;import org.apache.http.client.entity.UrlEncodedFormEntity;import org.apache.http.client.methods.CloseableHttpResponse;import org.apache.http.client.methods.HttpGet;import org.apache.http.client.methods.HttpPost;import org.apache.http.client.utils.URIBuilder;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.impl.conn.DefaultProxyRoutePlanner;import org.apache.http.message.BasicHeader;import org.apache.http.util.EntityUtils;public class HttpRequestTool {    private static HttpHost proxy;    /**     * set Proxy for httpclient     *      * @param proxyHost     *            127.0.0.1     * @param port     *            8080     * @return     */    public static boolean setProxy(String proxyHost,String port){        if(proxyHost==null || port == null)            return false;        proxyHost=proxyHost.trim();        port=port.trim();        /*         * 0-9 \\d 进行匹配         * 10-99 [1-9]\\d 进行匹配         * 100-199 1\\d\\d 进行匹配         * 200-249 2[0-4]\\d 进行匹配         * 250-255 25[0-5] 进行匹配         * (xxx|xxx|xxx|xxx)逻辑或, ^xxx$全字符串匹配         * ^(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5].){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$         */        if(!Pattern.compile("^((\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5]).){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$").matcher(proxyHost).find())            return false;        if(Pattern.compile("[^\\d]").matcher(port).find())            return false;        int iPort = Integer.parseInt(port);        if(iPort>65535)            return false;        proxy = new HttpHost(proxyHost, iPort);        return true;    }    /**     * simple getMethod without headers and parameters.     *      * @param host     * @param resourcePath     * @return     * @throws URISyntaxException     * @throws IOException     */    public static String getMethod(String host, String resourcePath) throws URISyntaxException, IOException{        return getMethod("http", host, null, resourcePath, null, null);    }    /**     * getMethod with headers and parameters.     *      * @param protocol     * @param host     * @param port     * @param resourcePath     * @param headKeyValueArray     * @param paraKeyValueList     * @return     * @throws URISyntaxException     * @throws IOException     */    public static String getMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList)            throws URISyntaxException, IOException {        URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host);        if(port!=null)            builder.setPort(Integer.parseInt(port));        if(resourcePath!=null)            builder.setPath("/" + resourcePath);        //Get请求参数        if(paraKeyValueList!=null)            builder.addParameters(paraKeyValueList); //中文参数自动转为utf-8             //不要用已经过时的httpGet.setParams(HetpParams params)方法        URI uri = builder.build();        HttpGet httpGet = new HttpGet(uri);        if (headKeyValueArray != null)            httpGet.setHeaders(headKeyValueArray);        CloseableHttpClient httpclient = (proxy==null)?                                                       HttpClients.createDefault()                                                      :                                                       HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build();        BufferedReader br = null;        InputStreamReader isr = null;        CloseableHttpResponse httpResponse = null;        try {            httpResponse = httpclient.execute(httpGet);            System.out.println(httpResponse.getStatusLine());            HttpEntity bodyEntity = httpResponse.getEntity();            isr = new InputStreamReader(bodyEntity.getContent());            br = new BufferedReader(isr);            StringBuffer httpBody = new StringBuffer();            String resTemp = "";            while ((resTemp = br.readLine()) != null) {                resTemp = resTemp.trim();                if (!"".equals(resTemp))                    httpBody.append(resTemp.trim()).append("\n");            }            EntityUtils.consume(bodyEntity);            return httpBody.toString();        } finally {            try {                if (httpResponse != null)                    httpResponse.close();            } catch (IOException e1) {                e1.printStackTrace();            }            if (isr != null) {                try {                    isr.close();                } catch (IOException e) {                    e.printStackTrace();                }            }            if (br != null) {                try {                    br.close();                } catch (IOException e) {                    e.printStackTrace();                }            }        }    }    /**     *  版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790欢迎非商业目的的转载, 作者保留一切权利     */    /**     * post Method with head and parameters     * @param protocol     * @param host     * @param port     * @param resourcePath     * @param headKeyValueArray     * @param paraKeyValueList     * @return     * @throws IOException     * @throws URISyntaxException     */    public static String postMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList)            throws IOException, URISyntaxException{        CloseableHttpClient httpclient = (proxy==null)?                                                       HttpClients.createDefault()                                                      :                                                       HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build();        try{            URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host);            if(port!=null){                builder.setPort(Integer.parseInt(port));            }            if(resourcePath!=null){                builder.setPath("/" + resourcePath);            }            URI uri = builder.build();            HttpPost httpPost = new HttpPost(uri);            if(headKeyValueArray!=null){                httpPost.setHeaders(headKeyValueArray);            }            postMethod.getParams().setParameter(                    HttpMethodParams.HTTP_CONTENT_CHARSET, "UTF-8");            postMethod.setRequestBody(data);            int statusCode = httpClient.executeMethod(postMethod);            return postMethod.getResponseBodyAsString();        } finally {            if(httpResponse!=null){                httpResponse.close();            }        }    }}

3.3 html解析 - jericho包的使用

jericho-html-3.4.jar包需要jdk7或以上
依赖于 log4j-api-2.4.1.jar, log4j-core-2.4.1.jar

import org.apache.logging.log4j.LogManager;import org.apache.logging.log4j.Logger;public class CsdnGet {    protected Logger logger = LogManager.getLogger(this.getClass());    public void dealHtml(){        String str;        str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null);        //从html页面源码生成jericho树形结构Source        Source source = new Source(str);        //常用获得html标签的方法        //Element ele = source.getElementById("elementid");        //Element ele = source.getFirstElementByClass("elementclass");        //Element ele = source.getAllElementsByClass("elementclass");        //List<Element> eleList = source.getChildElements(); // 获得全部子标签,对分析<table>特别有用        //获得html标签文字内容的String        element.getTextExtractor().toString();    }}

举个例子 - 爬取CSDN全部日志和阅读数量

import java.io.IOException;import java.net.URISyntaxException;import java.util.LinkedList;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;import net.htmlparser.jericho.Element;import net.htmlparser.jericho.Source;import org.apache.http.Header;import org.apache.http.message.BasicHeader;import org.apache.logging.log4j.LogManager;import org.apache.logging.log4j.Logger;import dto.Title_Num;public class CsdnGet {    protected Logger logger = LogManager.getLogger(this.getClass());    private static final String articleListBox = "lstBox",                                 pageBox        = "page_nav";    public void getHtml() {        String str = null;        try {            HttpRequestTool.setProxy("10.37.84.117", "8080");            Header[] headerList = {                    new BasicHeader("Host", "write.blog.csdn.net"),                    new BasicHeader("User-Agent",                            "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"),                    new BasicHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),                    new BasicHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"),                    new BasicHeader("Accept-Encoding", "gzip, deflate"),                    new BasicHeader(                            "Cookie",                            "/*用抓包工具获得你的CSDN博客主页的cookie*/"),                    new BasicHeader("Connection", "keep-alive") };            // list contains all title_num            List<Title_Num> itemlist = new LinkedList<Title_Num>();            //            str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null);            Source source = new Source(str);            getArticlesOnePage(source, itemlist);            // check total page 获得总页数的html标签            String pageInfo = source.getFirstElementByClass(pageBox).getFirstElement("span").getTextExtractor().toString();            // 正则表达式获得总页数            Matcher matcher = Pattern.compile("[^\\d](\\d{1,})[^\\d]").matcher(pageInfo);            String sTotalPage = null;            if(matcher.find())                 sTotalPage = matcher.group(1);            int iTotalPage = Integer.parseInt(sTotalPage);            if(iTotalPage>1){                for(int i=2;i<=iTotalPage;i++){                    String pageSuffix = String.format("postlist/0/0/enabled/%d", i);                    str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", pageSuffix, headerList, null);                    source = new Source(str);                    getArticlesOnePage(source, itemlist);                }            }            // 输出            for(Title_Num title_Num:itemlist){                System.out.println(title_Num.getTitle()+title_Num.getNumber());            }        } catch (URISyntaxException e) {            e.printStackTrace();        } catch (IOException e) {            e.printStackTrace();        }    }    private void getArticlesOnePage(Source source, List<Title_Num> itemlist){        // get 1st page        List<Element> articles = source.getElementById(articleListBox).getChildElements();        articles.remove(0);        for (Element article : articles) {            int col=0;            Title_Num title_Num = new Title_Num();            for (Element column : article.getChildElements()) {                if(col==0)                    title_Num.setTitle(column.getTextExtractor().toString());                if(col==2)                    title_Num.setNumber(Integer.parseInt(column.getTextExtractor().toString()));                col++;            }            itemlist.add(title_Num);        }    }    public static void main(String[] args) {        new CsdnGet().getHtml();    }}

3.4 参数化的爬虫配置信息

3.5 多线程协调爬虫

3.6 定时启动

假设我们需要每天爬取自己的CSDN博客标题和阅读数, 和昨天的比较, 分析出每篇文章阅读量增加了多少.

那么, 我们需要每天手动启动爬虫进程吗?
No, Spring的Task组件可以完成定时启动的功能.

版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利

0 0