利用spring boot 写一个稳定的爬虫
来源:互联网 发布:不干胶排版软件 编辑:程序博客网 时间:2024/06/05 21:54
1、前言
这篇文章是利用spring boot 写一个稳定的爬虫,爬取的网页数据包含未执行js的网页数据、http/https接口的请求数据、和经过网页渲染的js数据(需要chorme浏览器),数据库使用mysql,程序的运行逻辑定去抓取网页数据,解析数据,存入mysql数据库中,爬取百度股市通的数据为例。
2、创建项目
使用idea开发,首先创建一个spring boot 项目,Group设置为com.crawler,Artifact设置为example,创建项目如图1所示
勾选web模块
设置项目名称为example
3、爬取的数据和存储的数据表结构
1、爬虫百度股市通 https://gupiao.baidu.com/concept/ 上面的概要数据
2、获取某一只股票的今日价格数据
https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358×tamp=
概要数据爬取的内容包含热点概念,驱动事件和具体的股票数据,概要数据的数据抽象成msql表如下:
CREATE TABLE `baidu_hot` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title_line1` varchar(255) DEFAULT NULL, `title_line2` int(11) DEFAULT NULL, `title_line3` varchar(255) DEFAULT NULL, `title_line4` varchar(255) DEFAULT NULL, `dirver_thing` text, `hot_stock_name_1` varchar(255) DEFAULT NULL, `hot_stock_code_1` varchar(11) DEFAULT NULL, `hot_stock_price_1` double DEFAULT NULL, `hot_stock_increment_1` varchar(20) DEFAULT NULL, `hot_stock_name_2` varchar(255) DEFAULT NULL, `hot_stock_code_2` varchar(11) DEFAULT NULL, `hot_stock_price_2` double DEFAULT NULL, `hot_stock_increment_2` varchar(20) DEFAULT NULL, `hot_stock_name_3` varchar(255) DEFAULT NULL, `hot_stock_code_3` varchar(11) NOT NULL, `hot_stock_price_3` double DEFAULT NULL, `hot_stock_increment_3` varchar(20) DEFAULT NULL, `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;
百度股市的接口数据直接存为json数据,其表抽象如下
CREATE TABLE `stock` ( `id` int(11) NOT NULL AUTO_INCREMENT, `stock_id` varchar(30) DEFAULT NULL, `data` json DEFAULT NULL, `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;
数据库名字取为:
baidugushi
4、spring boot 项目配置
4.1、日志配置
日志使用logback,每天生成一个info和error级别的日志,logback配置文件如下,配置文件名称为 logback-spring.xml ,logback配置文件放到resources目录下面即可生效:
<?xml version="1.0" encoding="utf-8"?><configuration> <appender name="consoleLog" class="ch.qos.logback.core.ConsoleAppender"> <layout class="ch.qos.logback.classic.PatternLayout"> <pattern>%d - %msg%n</pattern> </layout> </appender> <appender name="fileInfoLog" class="ch.qos.logback.core.rolling.RollingFileAppender"> <filter class="ch.qos.logback.classic.filter.LevelFilter"> <level>ERROR</level> <onMatch>DENY</onMatch> <onMismatch>ACCEPT</onMismatch> </filter> <encoder> <pattern>%msg%n</pattern> </encoder> <!--滚动策略--> <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> <!--路径--> <!-- winwods --> <fileNamePattern>D:\logs\example.info.%d.log</fileNamePattern> <!--<fileNamePattern>/data/log/crawler.info.%d.log</fileNamePattern>--> </rollingPolicy> </appender> <appender name="fileErrorLog" class="ch.qos.logback.core.rolling.RollingFileAppender"> <filter class="ch.qos.logback.classic.filter.ThresholdFilter"> <level>ERROR</level> </filter> <encoder> <pattern>%msg%n</pattern> </encoder> <!--滚动策略--> <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> <!--路径--> <fileNamePattern>D:\logs\example.error.%d.log</fileNamePattern> <!--<fileNamePattern>/data/log/crawler.error.%d.log</fileNamePattern>--> </rollingPolicy> </appender> <root level="info"> <appender-ref ref="consoleLog"/> <appender-ref ref="fileInfoLog"/> <appender-ref ref="fileErrorLog"/> </root></configuration>
4.2、mysql配置
首先在项目的example包底下创建driver,entity,map,web包,如下图所示
mysql使用mybaits去连接数据库
配置之前首先需要在maven导入的包如下:
<dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>6.0.6</version> </dependency> <dependency> <groupId>org.mybatis.spring.boot</groupId> <artifactId>mybatis-spring-boot-starter</artifactId> <version>1.1.1</version> </dependency>
application.properties配置文件的内容如下:
mybatis.type-aliases-package=com.crawler.example.entityspring.datasource.driverClassName = com.mysql.cj.jdbc.Driver#本机调试spring.datasource.url = jdbc:mysql://127.0.0.1:3306/baidugushi?useUnicode=true&characterEncoding=UTF-8&useSSL=false&autoReconnect=true&serverTimezone=UTCspring.datasource.username = rootspring.datasource.password = pwd
其中mybatis.type-aliases-package 是指定mybaits的数据库中表对应类的包,用户名和密码请自行修改
4.3、tomcat 设置
项目最终发布的形式是一个war包,jar包不方便部署和管理。
修改项目的pom.xml文件,packagin修改为war
<packaging>war</packaging>
删除自带的tomcat,和添加必要的依赖
<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> <exclusions> <exclusion> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-tomcat</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>javax.servlet</groupId> <artifactId>javax.servlet-api</artifactId> <version>3.1.0</version> <scope>provided</scope> </dependency>
修改完pom.xml文件还需要添加有一个启动类,其中ExampleApplication是spring boot自动生成的启动类
package com.crawler.example;import org.springframework.boot.builder.SpringApplicationBuilder;import org.springframework.boot.web.support.SpringBootServletInitializer;public class SpringBootStartApplication extends SpringBootServletInitializer { @Override protected SpringApplicationBuilder configure(SpringApplicationBuilder builder) { // 注意这里要指向原先用main方法执行的Application启动类 return builder.sources(ExampleApplication.class); }}
4.4、idea项目调试设置
首先电脑上要安装了tomcat,随后打开Idea 的Run –Run… edit configurations 删除掉spring boot自带启动设置,随后新建tomcat的配置,tomcat配置如下面两张图所示,将After Launchq勾线去掉,否则项目启动时会启动浏览器。
上图主要需要配置的是要选择tomcat的位置
上图是配置启动时部署的war包
5、抓取静态页面数据
5.1、定时任务编写
爬虫是定时抓取网页数据,html页面解析使用Jsoup,jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
需要增加的pom依赖为
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.3</version> </dependency>
页面下载程序为:
package com.crawler.example.web;import com.crawler.example.dirver.BaiDuHotProcess;import com.crawler.example.entity.BaiduHot;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import com.crawler.example.map.BaiduHotMap;import org.slf4j.Logger;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.web.bind.annotation.RestController;import java.io.IOException;import java.util.ArrayList;//百度股市热门下载@RestControllerpublic class BaiduHotDown { public static Logger logger; @Autowired BaiduHotMap baiduHotMap; @Scheduled(cron = "0/20 * * * * ? ") public void downBaiduHot(){ String url = "https://gupiao.baidu.com/concept/"; try { Document doc = Jsoup.connect(url).get(); ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc); for(BaiduHot b:abh){ baiduHotMap.InsertBaiduHot(b); } } catch (IOException e) { e.printStackTrace(); } }}
5.2、数据提取解析为:
package com.crawler.example.dirver;import com.crawler.example.entity.BaiduHot;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import java.util.ArrayList;public class BaiDuHotProcess { public ArrayList<BaiduHot> processBaiduHot(Document doc){ ArrayList<BaiduHot> abh = new ArrayList<BaiduHot>(); //提取数据 Elements divsBig = doc.getElementsByClass("hot-concept clearfix"); for(int i=0;i<divsBig.size();i++){ BaiduHot baiduHot = new BaiduHot(); //获得行业数据 Elements cloumn1 = divsBig.get(i).getElementsByClass("concept-header column1"); //获取行数数据 baiduHot.title_line1 = cloumn1.get(0).getElementsByClass("text-ellipsis").get(0).ownText(); //获取热搜指数 baiduHot.title_line2 =Integer.parseInt( cloumn1.get(0).getElementsByTag("h3").get(0).getElementsByTag("span").get(0).ownText()); //获得发布时间 baiduHot.title_line3 = cloumn1.get(0).getElementsByTag("p").get(0).ownText(); //获得简要内容 baiduHot.title_line4 = cloumn1.get(0).getElementsByTag("p").get(1).ownText(); //概述内容 baiduHot.dirver_thing = divsBig.get(i).getElementsByClass("concept-event column3").get(0).ownText(); //获得推荐股价 Elements stockUl = divsBig.get(i).getElementsByClass("no-click"); //System.out.println(stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").size()); //股票1名称 baiduHot.hot_stock_name_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText(); //股票1代码 baiduHot.hot_stock_code_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText(); //股票1价格 baiduHot.hot_stock_price_1 = Double.parseDouble(stockUl.get(0).getElementsByClass("column2").get(1).ownText()); //股票1涨幅 baiduHot.hot_stock_increment_1 = stockUl.get(0).child(2).ownText(); //股票2名称 baiduHot.hot_stock_name_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText(); //股票2代码 baiduHot.hot_stock_code_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText(); //股票2价格 baiduHot.hot_stock_price_2 = Double.parseDouble(stockUl.get(1).getElementsByClass("column2").get(1).ownText()); //股票2涨幅 baiduHot.hot_stock_increment_2 = stockUl.get(1).child(2).ownText(); //股票3名称 baiduHot.hot_stock_name_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText(); //股票3代码 baiduHot.hot_stock_code_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText(); //股票3价格 baiduHot.hot_stock_price_3 = Double.parseDouble(stockUl.get(2).getElementsByClass("column2").get(1).ownText()); //股票3涨幅 baiduHot.hot_stock_increment_3 = stockUl.get(2).child(2).ownText(); abh.add(baiduHot); } return abh; }}
定时周期使用corn表达式实现,可以使用在线网址:http://cron.qqe2.com/,点点鼠标即可生成想要的定时周期
实体和数据库接口类如下
package com.crawler.example.map;import com.crawler.example.entity.BaiduHot;import org.apache.ibatis.annotations.Insert;import org.apache.ibatis.annotations.Mapper;//百度接口数据@Mapperpublic interface BaiduHotMap { @Insert("insert into baidu_hot(title_line1,title_line2,title_line3,title_line4,dirver_thing,hot_stock_name_1," + "hot_stock_code_1,hot_stock_price_1,hot_stock_increment_1,hot_stock_name_2,hot_stock_code_2,hot_stock_price_2," + "hot_stock_increment_2,hot_stock_name_3,hot_stock_code_3,hot_stock_price_3,hot_stock_increment_3) values(" + "#{title_line1},#{title_line2},#{title_line3},#{title_line4},#{dirver_thing},#{hot_stock_name_1},#{hot_stock_code_1}," + "#{hot_stock_price_1},#{hot_stock_increment_1},#{hot_stock_name_2},#{hot_stock_code_2},#{hot_stock_price_2},#{hot_stock_increment_2}," + "#{hot_stock_name_3},#{hot_stock_code_3},#{hot_stock_price_3},#{hot_stock_increment_3})") public void InsertBaiduHot(BaiduHot baiduHot);}
实体类:
package com.crawler.example.entity;public class BaiduHot implements Cloneable { public int id; public String title_line1; public int title_line2; public String title_line3; public String title_line4; public String dirver_thing; public String hot_stock_name_1; public String hot_stock_code_1; public double hot_stock_price_1; public String hot_stock_increment_1; public String hot_stock_name_2; public String hot_stock_code_2; public double hot_stock_price_2; public String hot_stock_increment_2; public String hot_stock_name_3; public String hot_stock_code_3; public double hot_stock_price_3; public String hot_stock_increment_3; }
启动类加上注解:@EnableScheduling
最后获得的数据如下图所示:
6、抓取http接口的数据
6.1、获取某只股票当日的数据
http下载程序为
<dependency> <groupId>org.json</groupId> <artifactId>json</artifactId> <version>20160810</version> </dependency>
程序源码如下,启动定时器sh60035下载当日的股票数据:
package com.crawler.example.web;import com.crawler.example.dirver.GetJson;import com.crawler.example.entity.StockPrice;import com.crawler.example.map.StockPriceMap;import org.json.JSONObject;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.web.bind.annotation.RestController;//查询股票的价格@RestControllerpublic class BaiduStockPrice { @Autowired StockPriceMap stockPriceMap; //下载股票曲线图 //@Scheduled(cron = "0/20 * * * * ? ") public void downStockPrice(){ //url 生成 String url = "https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358×tamp=" + System.currentTimeMillis(); //访问获得json数据 JSONObject stock = new GetJson().getHttpJson(url,1); StockPrice stockPrice = new StockPrice(); stockPrice.stock_id = "sh60035"; stockPrice.data = stock.toString(); //将json数据存入数据库中 stockPriceMap.insertIntoStock(stockPrice); }}
http下载并返回json
package com.crawler.example.dirver;import org.json.JSONObject;import javax.net.ssl.HttpsURLConnection;import java.io.ByteArrayOutputStream;import java.io.IOException;import java.io.InputStream;import java.net.HttpURLConnection;import java.net.MalformedURLException;import java.net.URL;public class GetJson { public JSONObject getHttpJson(String url,int comefrom){ try { URL realUrl = new URL(url); HttpURLConnection connection = (HttpURLConnection)realUrl.openConnection(); connection.setRequestProperty("accept", "*/*"); connection.setRequestProperty("connection", "Keep-Alive"); connection.setRequestProperty("user-agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)"); // 建立实际的连接 connection.connect(); //请求成功 if(connection.getResponseCode()==200){ InputStream is=connection.getInputStream(); ByteArrayOutputStream baos=new ByteArrayOutputStream(); //10MB的缓存 byte [] buffer=new byte[10485760]; int len=0; while((len=is.read(buffer))!=-1){ baos.write(buffer, 0, len); } String jsonString=baos.toString(); baos.close(); is.close(); //转换成json数据处理 JSONObject jsonArray=getJsonString(jsonString,comefrom); return jsonArray; } } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException ex) { ex.printStackTrace(); } return null; } public JSONObject getHttpsJson(String url){ try { URL realUrl = new URL(url); HttpsURLConnection httpsConn = (HttpsURLConnection)realUrl.openConnection(); httpsConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"); httpsConn.setRequestProperty("connection", "Keep-Alive"); httpsConn.setRequestProperty("user-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36"); httpsConn.setRequestProperty("Accept-Charset","utf-8"); httpsConn.setRequestProperty("contentType", "utf-8"); httpsConn.connect(); if(httpsConn.getResponseCode()==200){ InputStream is = httpsConn.getInputStream(); ByteArrayOutputStream baos=new ByteArrayOutputStream(); //10MB的缓存 byte [] buffer=new byte[10485760]; int len=0; while((len=is.read(buffer))!=-1){ baos.write(buffer, 0, len); } String jsonString=baos.toString("utf-8"); baos.close(); is.close(); return new JSONObject(jsonString); } } catch (IOException e) { e.printStackTrace(); } return null; } public JSONObject getJsonString(String str, int comefrom){ JSONObject jo = null; if(comefrom==1){ return new JSONObject(str); }else if(comefrom==2){ int indexStart = 0; //字符处理 for(int i=0;i<str.length();i++){ if(str.charAt(i)=='('){ indexStart = i; break; } } String strNew = ""; //分割字符串 for(int i=indexStart+1;i<str.length()-1;i++){ strNew += str.charAt(i); } return new JSONObject(strNew); } return jo; }}
getHttpJson函数的后面的参数1,表示返回的是json数据,2表示http接口的数据在一个()中的数据
6.2、数据库实体类和数据库的程序如下
数据库实体类
package com.crawler.example.entity;import org.json.JSONObject;public class StockPrice { public String stock_id; public String data; public String getStock_id() { return stock_id; } public void setStock_id(String stock_id) { this.stock_id = stock_id; } public String getData() { return data; } public void setData(String data) { this.data = data; }}
数据库接口如下:
package com.crawler.example.map;import com.crawler.example.entity.StockPrice;import org.apache.ibatis.annotations.Insert;import org.apache.ibatis.annotations.Mapper;@Mapperpublic interface StockPriceMap { @Insert("insert into stock(stock_id,data) values(#{stock_id},#{data})") public void insertIntoStock(StockPrice stockPrice);}
最后的运行结果为:
7、获取动态页面数据
7.1、抓取需要渲染的网页
有时网页的数据经过渲染才能有,或者网站有饭爬虫措施,那么可以借助浏览器下载网页数据,这样爬虫部署到windows server上比较方便,centos 安装chorme 比较困难,这种数据抓取方式几乎可以为所欲为0.0。
操作chorme的java插件有
cdp4j - Chrome DevTools Protocol for Java链接地址为:https://github.com/webfolderio/cdp4j
还是抓取百度股市热点网页数据
抓取程序为:
package com.crawler.example.web;import com.crawler.example.dirver.BaiDuHotProcess;import com.crawler.example.entity.BaiduHot;import com.crawler.example.map.BaiduHotMap;import io.webfolder.cdp.Launcher;import io.webfolder.cdp.session.Session;import io.webfolder.cdp.session.SessionFactory;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.web.bind.annotation.RestController;import java.util.ArrayList;@RestControllerpublic class BaiduHotDownChorme { @Autowired BaiduHotMap baiduHotMap; @Scheduled(cron = "0/20 * * * * ? ") public void downBaiDuHot(){ ArrayList<String> command = new ArrayList<String>(); //不显示google 浏览器 command.add("--headless"); Launcher launcher = new Launcher(); try (SessionFactory factory = launcher.launch(command); Session session = factory.create()){ session.navigate("https://gupiao.baidu.com/concept/"); session.waitDocumentReady(); String content = (String) session.getContent(); //System.out.println(content); Document doc = Jsoup.parse(content); ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc); for(BaiduHot b:abh){ baiduHotMap.InsertBaiduHot(b); } }catch (Exception e){ e.printStackTrace(); } }}
上面就是java 抓取网页数据的三种方式,项目源码在github上,源码地址为:
https://github.com/xiaoyangmoa/java-crawler
- 利用spring boot 写一个稳定的爬虫
- 用Kotlin写一个基于Spring Boot的RESTful服务
- 用Kotlin写一个基于Spring Boot的RESTful服务
- 利用Maven快速创建一个简单的spring boot 实例
- Spring Boot 微框架学习(利用Spring Boot编写一个访问数据库的helloword)
- 利用python写一个简易的爬虫,基于慕课网对应课程
- Spring boot如何写一个自定义的auto-configuration(上)
- Spring boot如何写一个自定义的auto-configuration(下)
- 一个C#写的爬虫程序
- 一个C#写的爬虫程序
- 用perl写的一个网络爬虫
- Python写的一个爬虫程序
- 使用nodejs写一个简易的爬虫
- selenium2java写一个小小的爬虫程序
- scrapy初探:写一个简单的爬虫
- 用Python写一个简单的爬虫
- 利用Intellij Idea构建一个Spring Boot构建
- Spring Boot随写
- java 使用TreeSet将字符串中的数值进行排序
- React.js 中的组件通信问题
- spring定时器注解的用法详解
- 支付宝支付
- C#进阶5_多态
- 利用spring boot 写一个稳定的爬虫
- spring容器ApplicationContext
- Linux系统Vim命令抛出异常 Em25:ATTENTION 解决方案
- HTML5数据存储---使用clear()方法清除localStorage保存对象的全部数据
- 提高MySQL性能的7个技巧
- RelFinder 放在 tomcat 下访问另一台服务器中的数据库时的跨域问题
- 如何通俗易懂地理解皮尔逊相关系数?
- gRPC
- debian上使用dpkg -b打包