Java网络编程(一) - Java网页爬虫 - 爬取自己的CSDN博客标题和阅读数(附源码)
来源:互联网 发布:excel表去掉重复数据 编辑:程序博客网 时间:2024/04/19 23:26
版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利
- 什么是爬虫
- 一个Java爬虫需要哪些技术
- 基于Spring框架的Java爬虫有哪些优势
- 1 spring task组件提供的定时执行功能
- 2 spring的依赖注入DI降低了具体网站之间的耦合度
- 3 spring的Value读取配置文件网址或数据库很方便
- 基于Spring框架的Java爬虫的详细设计
- 1 项目类图
- 2 apachehttpclient发送POSTGET请求
- 3 html解析 - jericho包的使用
- 举个例子 - 爬取CSDN全部日志和阅读数量
- 4 参数化的爬虫配置信息
- 5 多线程协调爬虫
- 6 定时启动
0 什么是爬虫
网络有很多信息, 比如以”爬虫”为关键字搜索, 获得1,000,000条结果, 不可能靠人工去检测哪些信息是需要的.
所以爬虫的目的, 就是自动获得网页内容并保存有用信息.
1 一个Java爬虫需要哪些技术
- 向目标网站发送POST, GET请求
- 解析目标网站返回的html页面, 获得有用信息
- 抓取的信息写入文件
- 定时启动, 比如每天23点爬取一次, 检查有没有更新
总结: Java爬虫涉及了前端页面的html解析, http协议, Java读写文件的基础知识, 是不可多得的JAVA网络编程的入门项目. 每一个做Java网络编程的人都应该做一个Java爬虫
2 基于Spring框架的Java爬虫有哪些优势
2.1 spring task组件提供的定时执行功能
2.2 spring的依赖注入(DI)降低了具体网站之间的耦合度
2.3 spring的@Value读取配置文件(网址或数据库)很方便
具体的,
@Repositorypublic class RewardsTestDatabase { @Value("#{jdbcProperties.databaseName}") public void setDatabaseName(String dbName) { … } @Value("#{jdbcProperties.databaseKeyGenerator}") public void setKeyGenerator(KeyGenerator kg) { … }}
其中, “jdbcProperties”是在applicationContext.xml中配置的
<!--src目录下的jdbcProperties.properties--> <bean id="config" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"> <property name="fileEncoding" value="UTF-8"></property> <property name="locations"> <list> <value>classpath:jdbcProperties.properties</value> </list> </property> </bean>
术语:
Spring Expression Language - “#{strategyBean.databaseKeyGenerator}”. Spring EL是Spring 3的新特性.
3 基于Spring框架的Java爬虫的详细设计
3.1 项目类图
3.2 apache.httpclient发送POST/GET请求
依赖包:
apache.httpclient4.5.2 - HttpGet, HttpPost
apache.httpcore4.4 - BasicNameValuePair implements NameValuePair
common-logging.jar - 日志包, 必须引入, 不然httpclient4.5.2运行时提示NoClassDefFoundError: org/apache/commons/logging/LogFactory
详细用法请参考wangpeng047@CSDN的大作, 内容全且准.
下面是我写的GET/POST请求
import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader;import java.net.URI;import java.net.URISyntaxException;import java.util.List;import java.util.regex.Pattern;import org.apache.http.Header;import org.apache.http.HttpEntity;import org.apache.http.HttpHost;import org.apache.http.NameValuePair;import org.apache.http.client.entity.UrlEncodedFormEntity;import org.apache.http.client.methods.CloseableHttpResponse;import org.apache.http.client.methods.HttpGet;import org.apache.http.client.methods.HttpPost;import org.apache.http.client.utils.URIBuilder;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.impl.conn.DefaultProxyRoutePlanner;import org.apache.http.message.BasicHeader;import org.apache.http.util.EntityUtils;public class HttpRequestTool { private static HttpHost proxy; /** * set Proxy for httpclient * * @param proxyHost * 127.0.0.1 * @param port * 8080 * @return */ public static boolean setProxy(String proxyHost,String port){ if(proxyHost==null || port == null) return false; proxyHost=proxyHost.trim(); port=port.trim(); /* * 0-9 \\d 进行匹配 * 10-99 [1-9]\\d 进行匹配 * 100-199 1\\d\\d 进行匹配 * 200-249 2[0-4]\\d 进行匹配 * 250-255 25[0-5] 进行匹配 * (xxx|xxx|xxx|xxx)逻辑或, ^xxx$全字符串匹配 * ^(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5].){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$ */ if(!Pattern.compile("^((\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5]).){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$").matcher(proxyHost).find()) return false; if(Pattern.compile("[^\\d]").matcher(port).find()) return false; int iPort = Integer.parseInt(port); if(iPort>65535) return false; proxy = new HttpHost(proxyHost, iPort); return true; } /** * simple getMethod without headers and parameters. * * @param host * @param resourcePath * @return * @throws URISyntaxException * @throws IOException */ public static String getMethod(String host, String resourcePath) throws URISyntaxException, IOException{ return getMethod("http", host, null, resourcePath, null, null); } /** * getMethod with headers and parameters. * * @param protocol * @param host * @param port * @param resourcePath * @param headKeyValueArray * @param paraKeyValueList * @return * @throws URISyntaxException * @throws IOException */ public static String getMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList) throws URISyntaxException, IOException { URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host); if(port!=null) builder.setPort(Integer.parseInt(port)); if(resourcePath!=null) builder.setPath("/" + resourcePath); //Get请求参数 if(paraKeyValueList!=null) builder.addParameters(paraKeyValueList); //中文参数自动转为utf-8 //不要用已经过时的httpGet.setParams(HetpParams params)方法 URI uri = builder.build(); HttpGet httpGet = new HttpGet(uri); if (headKeyValueArray != null) httpGet.setHeaders(headKeyValueArray); CloseableHttpClient httpclient = (proxy==null)? HttpClients.createDefault() : HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build(); BufferedReader br = null; InputStreamReader isr = null; CloseableHttpResponse httpResponse = null; try { httpResponse = httpclient.execute(httpGet); System.out.println(httpResponse.getStatusLine()); HttpEntity bodyEntity = httpResponse.getEntity(); isr = new InputStreamReader(bodyEntity.getContent()); br = new BufferedReader(isr); StringBuffer httpBody = new StringBuffer(); String resTemp = ""; while ((resTemp = br.readLine()) != null) { resTemp = resTemp.trim(); if (!"".equals(resTemp)) httpBody.append(resTemp.trim()).append("\n"); } EntityUtils.consume(bodyEntity); return httpBody.toString(); } finally { try { if (httpResponse != null) httpResponse.close(); } catch (IOException e1) { e1.printStackTrace(); } if (isr != null) { try { isr.close(); } catch (IOException e) { e.printStackTrace(); } } if (br != null) { try { br.close(); } catch (IOException e) { e.printStackTrace(); } } } } /** * 版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790欢迎非商业目的的转载, 作者保留一切权利 */ /** * post Method with head and parameters * @param protocol * @param host * @param port * @param resourcePath * @param headKeyValueArray * @param paraKeyValueList * @return * @throws IOException * @throws URISyntaxException */ public static String postMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList) throws IOException, URISyntaxException{ CloseableHttpClient httpclient = (proxy==null)? HttpClients.createDefault() : HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build(); try{ URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host); if(port!=null){ builder.setPort(Integer.parseInt(port)); } if(resourcePath!=null){ builder.setPath("/" + resourcePath); } URI uri = builder.build(); HttpPost httpPost = new HttpPost(uri); if(headKeyValueArray!=null){ httpPost.setHeaders(headKeyValueArray); } postMethod.getParams().setParameter( HttpMethodParams.HTTP_CONTENT_CHARSET, "UTF-8"); postMethod.setRequestBody(data); int statusCode = httpClient.executeMethod(postMethod); return postMethod.getResponseBodyAsString(); } finally { if(httpResponse!=null){ httpResponse.close(); } } }}
3.3 html解析 - jericho包的使用
jericho-html-3.4.jar包需要jdk7或以上
依赖于 log4j-api-2.4.1.jar, log4j-core-2.4.1.jar
import org.apache.logging.log4j.LogManager;import org.apache.logging.log4j.Logger;public class CsdnGet { protected Logger logger = LogManager.getLogger(this.getClass()); public void dealHtml(){ String str; str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null); //从html页面源码生成jericho树形结构Source Source source = new Source(str); //常用获得html标签的方法 //Element ele = source.getElementById("elementid"); //Element ele = source.getFirstElementByClass("elementclass"); //Element ele = source.getAllElementsByClass("elementclass"); //List<Element> eleList = source.getChildElements(); // 获得全部子标签,对分析<table>特别有用 //获得html标签文字内容的String element.getTextExtractor().toString(); }}
举个例子 - 爬取CSDN全部日志和阅读数量
import java.io.IOException;import java.net.URISyntaxException;import java.util.LinkedList;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;import net.htmlparser.jericho.Element;import net.htmlparser.jericho.Source;import org.apache.http.Header;import org.apache.http.message.BasicHeader;import org.apache.logging.log4j.LogManager;import org.apache.logging.log4j.Logger;import dto.Title_Num;public class CsdnGet { protected Logger logger = LogManager.getLogger(this.getClass()); private static final String articleListBox = "lstBox", pageBox = "page_nav"; public void getHtml() { String str = null; try { HttpRequestTool.setProxy("10.37.84.117", "8080"); Header[] headerList = { new BasicHeader("Host", "write.blog.csdn.net"), new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"), new BasicHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"), new BasicHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"), new BasicHeader("Accept-Encoding", "gzip, deflate"), new BasicHeader( "Cookie", "/*用抓包工具获得你的CSDN博客主页的cookie*/"), new BasicHeader("Connection", "keep-alive") }; // list contains all title_num List<Title_Num> itemlist = new LinkedList<Title_Num>(); // str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null); Source source = new Source(str); getArticlesOnePage(source, itemlist); // check total page 获得总页数的html标签 String pageInfo = source.getFirstElementByClass(pageBox).getFirstElement("span").getTextExtractor().toString(); // 正则表达式获得总页数 Matcher matcher = Pattern.compile("[^\\d](\\d{1,})[^\\d]").matcher(pageInfo); String sTotalPage = null; if(matcher.find()) sTotalPage = matcher.group(1); int iTotalPage = Integer.parseInt(sTotalPage); if(iTotalPage>1){ for(int i=2;i<=iTotalPage;i++){ String pageSuffix = String.format("postlist/0/0/enabled/%d", i); str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", pageSuffix, headerList, null); source = new Source(str); getArticlesOnePage(source, itemlist); } } // 输出 for(Title_Num title_Num:itemlist){ System.out.println(title_Num.getTitle()+title_Num.getNumber()); } } catch (URISyntaxException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } private void getArticlesOnePage(Source source, List<Title_Num> itemlist){ // get 1st page List<Element> articles = source.getElementById(articleListBox).getChildElements(); articles.remove(0); for (Element article : articles) { int col=0; Title_Num title_Num = new Title_Num(); for (Element column : article.getChildElements()) { if(col==0) title_Num.setTitle(column.getTextExtractor().toString()); if(col==2) title_Num.setNumber(Integer.parseInt(column.getTextExtractor().toString())); col++; } itemlist.add(title_Num); } } public static void main(String[] args) { new CsdnGet().getHtml(); }}
3.4 参数化的爬虫配置信息
3.5 多线程协调爬虫
3.6 定时启动
假设我们需要每天爬取自己的CSDN博客标题和阅读数, 和昨天的比较, 分析出每篇文章阅读量增加了多少.
那么, 我们需要每天手动启动爬虫进程吗?
No, Spring的Task组件可以完成定时启动的功能.
版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利
- Java网络编程(一) - Java网页爬虫 - 爬取自己的CSDN博客标题和阅读数(附源码)
- Python3爬虫之二网页解析【爬取自己CSDN博客信息】
- 利用Scrapy爬取自己的CSDN博客
- Scrapy 探索:使用 Scrapy 爬取自己的 CSDN 博客
- Python-爬取自己博客文章的URL
- 使用Scrapy来爬取自己的CSDN文章
- 使用Scrapy来爬取自己的CSDN文章 (2)
- 使用Scrapy来爬取自己的CSDN文章
- python爬取自己博客访问量
- Python3 大型网络爬虫实战 002 --- scrapy 爬虫项目的创建及爬虫的创建 --- 实例:爬取百度标题和CSDN博客
- NodeJs简单七行爬虫--爬取自己Qzone的说说并存入数据库
- 《JAVA网络编程》阅读笔记(一)
- 【网络爬虫】【java】微博爬虫(一):小试牛刀——网易微博爬虫(自定义关键字爬取微博数据)(附软件源码)
- 使用Java爬虫得到CSDN博客信息并保存(一)
- 使用Java爬虫得到CSDN博客信息并保存(一)
- 第二个爬虫:查看csdn博客阅读数
- Java网络编程(一):利用Java技术读取网页做一个简单爬网页上邮箱的网络蜘蛛
- java网页爬虫测试源码
- mapreduce运行机制
- 使用MIT App Inventor 2开发Android应用
- 从 setNeedsLayout 说起
- Android源码下载
- Extjs gridPanel 小计保存不上
- Java网络编程(一) - Java网页爬虫 - 爬取自己的CSDN博客标题和阅读数(附源码)
- iOS setter和getter方法
- Android源码编译
- 解决部分手机发送get请求时传递中文参数服务器获取不到正确参数
- 【BZOJ1212】[HNOI2004]L语言【Trie】【暴力】
- mybatis反向生成表 oracle数据库篇
- 2016/5/27 1003. Modify StackOfIntegers
- python目录
- backpackII