转自:http://blog.csdn.net/u013292160/article/details/68490758
一、项目简介
使用Oracle、Tomcat、Modjk、SpringMVC、Hibernate3、ActiveMQ技术搭建一个(最)简单的分布式小说爬虫系统。
二、项目架构图
三、服务器介绍和核心代码
爬虫服务器(spider):使用jsoup进行网页内容爬取,分析文章标题,内容进行存库(原始数据服务器),发送消息通知ActiveMQ,原始数据服务器中有新的需要进行数据清洗的数据。
/** * * *<p>Description:JSOUP爬取小说</p> * * @author:SongJia * * @date: 2017-3-30下午5:29:25 * * @param url * @return */ @SuppressWarnings("deprecation") @RequestMapping(value = "getnovels") @ResponseBody public String getNovels(String url){ try { NovelInitial novelBean = new NovelInitial(); Document doc = Jsoup.connect(url).timeout(10000).get(); String id = UUID.randomUUID().toString().replace("-", ""); String content = doc.getElementById("content").text(); String title = doc.getElementsByTag("H1").toString(); Element nextElement = doc.getElementById("pager_next"); String nextUrl = nextElement.attr("href"); novelBean.setId(id); novelBean.setTitle(title); novelBean.setCurrentUrl(url); novelBean.setNextUrl(nextUrl); Blob createBlob = Hibernate.createBlob(content.getBytes()); novelBean.setContent(createBlob); novelBean.setFlag("1"); System.out.println(content); novelService.save(novelBean); SendMessage.send("{\"id\":\""+id+"\",\"title\":\""+title+"\"}"); return nextUrl; } catch (IOException e) { e.printStackTrace(); } return ""; }
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
ActiveMQ服务器(active):通知数据清洗服务器,原始数据服务器中有需要数据清洗服务器进行数据清洗的数据。
/** * * *<p>Description:消费者,从MQ获取消息</p> * * @author:SongJia * * @date: 2017-3-30上午11:39:57 * * @param args */ public static void main(String[] args) { ConnectionFactory connectionFactory; Connection connection = null; Session session; Destination destination; MessageConsumer consumer; Properties prop = MyProperties.init(); String activemqHost = prop.getProperty("activemq_host"); String activemqPort = prop.getProperty("activemq_port"); connectionFactory = new ActiveMQConnectionFactory( ActiveMQConnection.DEFAULT_USER, ActiveMQConnection.DEFAULT_PASSWORD, "tcp://"+activemqHost+":"+activemqPort); try { connection = connectionFactory.createConnection(); connection.start(); session = connection.createSession(Boolean.FALSE,Session.AUTO_ACKNOWLEDGE); destination = session.createQueue("Novel"); consumer = session.createConsumer(destination); while (true) { TextMessage message = (TextMessage) consumer.receive(10000); if (null != message) { System.out.println("收到消息" + message.getText()); ProcessNovelFormat format = new ProcessNovelFormat(); format.formatNovel(message.getText()); } } } catch (Exception e) { e.printStackTrace(); } finally { try { if (null != connection) connection.close(); } catch (Throwable ignore) { } } }
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
/** * * *<p>Description:MQ调用小说处理程序接口处理小说</p> * * @author:SongJia * * @date: 2017-3-30上午11:42:35 * * @param msg */public void formatNovel(String msg){ Gson gson = new Gson(); MessageBean bean = gson.fromJson(msg, new TypeToken<MessageBean>(){}.getType()); String url = "http://localhost:8080/process/process/processnovel?"; String result = HttpRequestUtil.sendGet(url,bean, "UTF-8"); System.out.println("处理程序返回的结果:"+result); }
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
数据清洗服务器(process):进行数据清洗(去除文章中的乱码,非法字符),清洗完成之后,进行数据存库(数据服务器)。
@RequestMapping(value = "processnovel") @ResponseBody public String processNovel(HttpServletRequest request, HttpServletResponse response,String id){ MessageBean<NovelProcess> bean = new MessageBean<NovelProcess>() Gson gson = new Gson() try { NovelInitial novelInitial = (NovelInitial)novelService.findById(NovelInitial.class, id) //对文章进行处理,去除特殊字符 String content = new String(novelInitial.getContent().getBytes((long)1, (int)novelInitial.getContent().length())) System.out.println("处理之前:"+content) String processContent = content.replace("?", "") System.out.println("处理之后:"+processContent) //存入到新表 NovelProcess process = new NovelProcess() process.setId(novelInitial.getId()) process.setCurrentUrl(novelInitial.getCurrentUrl()) @SuppressWarnings("deprecation") Blob createBlob = Hibernate.createBlob(processContent.getBytes()) process.setContent(createBlob) process.setName("斗破苍穹") process.setFlag(novelInitial.getFlag()) process.setNextUrl(novelInitial.getNextUrl()) process.setTitle(novelInitial.getTitle()) novelService.save(process) bean.setCode("0") bean.setMessage("ProcessSuccess") //如果MessageBean生成private static final long serialVersionUID = 1L String json = gson.toJson(bean,new TypeToken<MessageBean<NovelProcess>>(){}.getType()) return json } catch (SQLException e) { bean.setCode("1") bean.setMessage("ProcessSuccess") //如果MessageBean生成private static final long serialVersionUID = 1L String json = gson.toJson(bean,new TypeToken<MessageBean<NovelProcess>>(){}.getType()) return json } }
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
应用服务器:提取数据服务器中的数据,给客户端提供展示所需数据接口,应用服务器使用Modjkh和Tomcat进行简单负载均衡。
大家都应该能写出来吧!!!
四 总结
这四个服务器代码都非常的简单,稍微有点难度的就是写HTTP请求的工具类,然后就是SpringMVC和Hibernate环境整合,其次就是Hibernate对Oracle数据库中Blob数据的存取,这里只是提供很简单的一个例子。
这个程序进行稍微的进化,就比较厉害了,比如把爬虫服务换成Python,把数据清洗服务器换成NLP等等。
五 源码地址
http://download.csdn.net/download/u013292160/9799295