利用spring boot 写一个稳定的爬虫

来源:互联网 发布:不干胶排版软件 编辑:程序博客网 时间:2024/06/05 21:54

1、前言

这篇文章是利用spring boot 写一个稳定的爬虫,爬取的网页数据包含未执行js的网页数据、http/https接口的请求数据、和经过网页渲染的js数据(需要chorme浏览器),数据库使用mysql,程序的运行逻辑定去抓取网页数据,解析数据,存入mysql数据库中,爬取百度股市通的数据为例。

2、创建项目

使用idea开发,首先创建一个spring boot 项目,Group设置为com.crawler,Artifact设置为example,创建项目如图1所示
这里写图片描述

勾选web模块

这里写图片描述
设置项目名称为example
这里写图片描述

3、爬取的数据和存储的数据表结构

1、爬虫百度股市通 https://gupiao.baidu.com/concept/ 上面的概要数据
2、获取某一只股票的今日价格数据

https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358&timestamp=

概要数据爬取的内容包含热点概念,驱动事件和具体的股票数据,概要数据的数据抽象成msql表如下:

CREATE TABLE `baidu_hot` (  `id` int(11) NOT NULL AUTO_INCREMENT,  `title_line1` varchar(255) DEFAULT NULL,  `title_line2` int(11) DEFAULT NULL,  `title_line3` varchar(255) DEFAULT NULL,  `title_line4` varchar(255) DEFAULT NULL,  `dirver_thing` text,  `hot_stock_name_1` varchar(255) DEFAULT NULL,  `hot_stock_code_1` varchar(11) DEFAULT NULL,  `hot_stock_price_1` double DEFAULT NULL,  `hot_stock_increment_1` varchar(20) DEFAULT NULL,  `hot_stock_name_2` varchar(255) DEFAULT NULL,  `hot_stock_code_2` varchar(11) DEFAULT NULL,  `hot_stock_price_2` double DEFAULT NULL,  `hot_stock_increment_2` varchar(20) DEFAULT NULL,  `hot_stock_name_3` varchar(255) DEFAULT NULL,  `hot_stock_code_3` varchar(11) NOT NULL,  `hot_stock_price_3` double DEFAULT NULL,  `hot_stock_increment_3` varchar(20) DEFAULT NULL,  `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;

百度股市的接口数据直接存为json数据,其表抽象如下

CREATE TABLE `stock` (  `id` int(11) NOT NULL AUTO_INCREMENT,  `stock_id` varchar(30) DEFAULT NULL,  `data` json DEFAULT NULL,  `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;

数据库名字取为:
baidugushi

4、spring boot 项目配置

4.1、日志配置
日志使用logback,每天生成一个info和error级别的日志,logback配置文件如下,配置文件名称为 logback-spring.xml ,logback配置文件放到resources目录下面即可生效:

<?xml version="1.0" encoding="utf-8"?><configuration>    <appender name="consoleLog" class="ch.qos.logback.core.ConsoleAppender">        <layout class="ch.qos.logback.classic.PatternLayout">            <pattern>%d - %msg%n</pattern>        </layout>    </appender>    <appender name="fileInfoLog" class="ch.qos.logback.core.rolling.RollingFileAppender">        <filter class="ch.qos.logback.classic.filter.LevelFilter">            <level>ERROR</level>            <onMatch>DENY</onMatch>            <onMismatch>ACCEPT</onMismatch>        </filter>        <encoder>            <pattern>%msg%n</pattern>        </encoder>        <!--滚动策略-->        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">            <!--路径-->            <!-- winwods -->            <fileNamePattern>D:\logs\example.info.%d.log</fileNamePattern>            <!--<fileNamePattern>/data/log/crawler.info.%d.log</fileNamePattern>-->        </rollingPolicy>    </appender>    <appender name="fileErrorLog" class="ch.qos.logback.core.rolling.RollingFileAppender">        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">            <level>ERROR</level>        </filter>        <encoder>            <pattern>%msg%n</pattern>        </encoder>        <!--滚动策略-->        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">            <!--路径-->            <fileNamePattern>D:\logs\example.error.%d.log</fileNamePattern>            <!--<fileNamePattern>/data/log/crawler.error.%d.log</fileNamePattern>-->        </rollingPolicy>    </appender>    <root level="info">        <appender-ref ref="consoleLog"/>        <appender-ref ref="fileInfoLog"/>        <appender-ref ref="fileErrorLog"/>    </root></configuration>

4.2、mysql配置
首先在项目的example包底下创建driver,entity,map,web包,如下图所示
这里写图片描述
mysql使用mybaits去连接数据库
配置之前首先需要在maven导入的包如下:

        <dependency>            <groupId>mysql</groupId>            <artifactId>mysql-connector-java</artifactId>            <version>6.0.6</version>        </dependency>        <dependency>            <groupId>org.mybatis.spring.boot</groupId>            <artifactId>mybatis-spring-boot-starter</artifactId>            <version>1.1.1</version>        </dependency>

application.properties配置文件的内容如下:

mybatis.type-aliases-package=com.crawler.example.entityspring.datasource.driverClassName = com.mysql.cj.jdbc.Driver#本机调试spring.datasource.url = jdbc:mysql://127.0.0.1:3306/baidugushi?useUnicode=true&characterEncoding=UTF-8&useSSL=false&autoReconnect=true&serverTimezone=UTCspring.datasource.username = rootspring.datasource.password = pwd

其中mybatis.type-aliases-package 是指定mybaits的数据库中表对应类的包,用户名和密码请自行修改
4.3、tomcat 设置
项目最终发布的形式是一个war包,jar包不方便部署和管理。
修改项目的pom.xml文件,packagin修改为war

    <packaging>war</packaging>

删除自带的tomcat,和添加必要的依赖

        <dependency>            <groupId>org.springframework.boot</groupId>            <artifactId>spring-boot-starter-web</artifactId>            <exclusions>                <exclusion>                    <groupId>org.springframework.boot</groupId>                    <artifactId>spring-boot-starter-tomcat</artifactId>                </exclusion>            </exclusions>        </dependency>        <dependency>            <groupId>javax.servlet</groupId>            <artifactId>javax.servlet-api</artifactId>            <version>3.1.0</version>            <scope>provided</scope>        </dependency>

修改完pom.xml文件还需要添加有一个启动类,其中ExampleApplication是spring boot自动生成的启动类

package com.crawler.example;import org.springframework.boot.builder.SpringApplicationBuilder;import org.springframework.boot.web.support.SpringBootServletInitializer;public class SpringBootStartApplication extends SpringBootServletInitializer {    @Override    protected SpringApplicationBuilder configure(SpringApplicationBuilder builder) {        // 注意这里要指向原先用main方法执行的Application启动类        return builder.sources(ExampleApplication.class);    }}

4.4、idea项目调试设置
首先电脑上要安装了tomcat,随后打开Idea 的Run –Run… edit configurations 删除掉spring boot自带启动设置,随后新建tomcat的配置,tomcat配置如下面两张图所示,将After Launchq勾线去掉,否则项目启动时会启动浏览器。
这里写图片描述
上图主要需要配置的是要选择tomcat的位置
这里写图片描述
上图是配置启动时部署的war包

5、抓取静态页面数据

5.1、定时任务编写
爬虫是定时抓取网页数据,html页面解析使用Jsoup,jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
需要增加的pom依赖为

        <dependency>            <groupId>org.jsoup</groupId>            <artifactId>jsoup</artifactId>            <version>1.10.3</version>        </dependency>

页面下载程序为:

package com.crawler.example.web;import com.crawler.example.dirver.BaiDuHotProcess;import com.crawler.example.entity.BaiduHot;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import com.crawler.example.map.BaiduHotMap;import org.slf4j.Logger;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.web.bind.annotation.RestController;import java.io.IOException;import java.util.ArrayList;//百度股市热门下载@RestControllerpublic class BaiduHotDown {    public  static Logger logger;    @Autowired    BaiduHotMap baiduHotMap;    @Scheduled(cron = "0/20 * * * * ? ")    public void downBaiduHot(){        String url = "https://gupiao.baidu.com/concept/";        try {            Document doc = Jsoup.connect(url).get();            ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);            for(BaiduHot b:abh){                baiduHotMap.InsertBaiduHot(b);            }        } catch (IOException e) {            e.printStackTrace();        }    }}

5.2、数据提取解析为:

package com.crawler.example.dirver;import com.crawler.example.entity.BaiduHot;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import java.util.ArrayList;public class BaiDuHotProcess {    public ArrayList<BaiduHot> processBaiduHot(Document doc){        ArrayList<BaiduHot> abh = new ArrayList<BaiduHot>();        //提取数据        Elements divsBig = doc.getElementsByClass("hot-concept clearfix");        for(int i=0;i<divsBig.size();i++){            BaiduHot baiduHot = new BaiduHot();            //获得行业数据            Elements cloumn1 = divsBig.get(i).getElementsByClass("concept-header column1");            //获取行数数据            baiduHot.title_line1 = cloumn1.get(0).getElementsByClass("text-ellipsis").get(0).ownText();            //获取热搜指数            baiduHot.title_line2 =Integer.parseInt( cloumn1.get(0).getElementsByTag("h3").get(0).getElementsByTag("span").get(0).ownText());            //获得发布时间            baiduHot.title_line3 = cloumn1.get(0).getElementsByTag("p").get(0).ownText();            //获得简要内容            baiduHot.title_line4 = cloumn1.get(0).getElementsByTag("p").get(1).ownText();            //概述内容            baiduHot.dirver_thing = divsBig.get(i).getElementsByClass("concept-event column3").get(0).ownText();            //获得推荐股价            Elements stockUl = divsBig.get(i).getElementsByClass("no-click");            //System.out.println(stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").size());            //股票1名称            baiduHot.hot_stock_name_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();            //股票1代码            baiduHot.hot_stock_code_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();            //股票1价格            baiduHot.hot_stock_price_1 = Double.parseDouble(stockUl.get(0).getElementsByClass("column2").get(1).ownText());            //股票1涨幅            baiduHot.hot_stock_increment_1 = stockUl.get(0).child(2).ownText();            //股票2名称            baiduHot.hot_stock_name_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();            //股票2代码            baiduHot.hot_stock_code_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();            //股票2价格            baiduHot.hot_stock_price_2 = Double.parseDouble(stockUl.get(1).getElementsByClass("column2").get(1).ownText());            //股票2涨幅            baiduHot.hot_stock_increment_2 = stockUl.get(1).child(2).ownText();            //股票3名称            baiduHot.hot_stock_name_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();            //股票3代码            baiduHot.hot_stock_code_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();            //股票3价格            baiduHot.hot_stock_price_3 = Double.parseDouble(stockUl.get(2).getElementsByClass("column2").get(1).ownText());            //股票3涨幅            baiduHot.hot_stock_increment_3 = stockUl.get(2).child(2).ownText();            abh.add(baiduHot);        }        return  abh;    }}

定时周期使用corn表达式实现,可以使用在线网址:http://cron.qqe2.com/,点点鼠标即可生成想要的定时周期
实体和数据库接口类如下

package com.crawler.example.map;import com.crawler.example.entity.BaiduHot;import org.apache.ibatis.annotations.Insert;import org.apache.ibatis.annotations.Mapper;//百度接口数据@Mapperpublic interface BaiduHotMap {    @Insert("insert into baidu_hot(title_line1,title_line2,title_line3,title_line4,dirver_thing,hot_stock_name_1," +            "hot_stock_code_1,hot_stock_price_1,hot_stock_increment_1,hot_stock_name_2,hot_stock_code_2,hot_stock_price_2," +            "hot_stock_increment_2,hot_stock_name_3,hot_stock_code_3,hot_stock_price_3,hot_stock_increment_3) values(" +            "#{title_line1},#{title_line2},#{title_line3},#{title_line4},#{dirver_thing},#{hot_stock_name_1},#{hot_stock_code_1}," +            "#{hot_stock_price_1},#{hot_stock_increment_1},#{hot_stock_name_2},#{hot_stock_code_2},#{hot_stock_price_2},#{hot_stock_increment_2}," +            "#{hot_stock_name_3},#{hot_stock_code_3},#{hot_stock_price_3},#{hot_stock_increment_3})")    public void InsertBaiduHot(BaiduHot baiduHot);}

实体类:

package com.crawler.example.entity;public class BaiduHot implements Cloneable {    public int id;    public String title_line1;    public int title_line2;    public String title_line3;    public String title_line4;    public String dirver_thing;    public String hot_stock_name_1;    public String hot_stock_code_1;    public double hot_stock_price_1;    public String hot_stock_increment_1;    public String hot_stock_name_2;    public String hot_stock_code_2;    public double hot_stock_price_2;    public String hot_stock_increment_2;    public String hot_stock_name_3;    public String hot_stock_code_3;    public double hot_stock_price_3;    public String hot_stock_increment_3;  }

启动类加上注解:@EnableScheduling
最后获得的数据如下图所示:
这里写图片描述

6、抓取http接口的数据

6.1、获取某只股票当日的数据
http下载程序为

        <dependency>            <groupId>org.json</groupId>            <artifactId>json</artifactId>            <version>20160810</version>        </dependency>

程序源码如下,启动定时器sh60035下载当日的股票数据:

package com.crawler.example.web;import com.crawler.example.dirver.GetJson;import com.crawler.example.entity.StockPrice;import com.crawler.example.map.StockPriceMap;import org.json.JSONObject;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.web.bind.annotation.RestController;//查询股票的价格@RestControllerpublic class BaiduStockPrice {    @Autowired    StockPriceMap stockPriceMap;    //下载股票曲线图    //@Scheduled(cron = "0/20 * * * * ? ")    public void downStockPrice(){        //url 生成        String url = "https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358&timestamp=" + System.currentTimeMillis();        //访问获得json数据        JSONObject stock = new GetJson().getHttpJson(url,1);        StockPrice stockPrice = new StockPrice();        stockPrice.stock_id = "sh60035";        stockPrice.data = stock.toString();        //将json数据存入数据库中        stockPriceMap.insertIntoStock(stockPrice);    }}

http下载并返回json

package com.crawler.example.dirver;import org.json.JSONObject;import javax.net.ssl.HttpsURLConnection;import java.io.ByteArrayOutputStream;import java.io.IOException;import java.io.InputStream;import java.net.HttpURLConnection;import java.net.MalformedURLException;import java.net.URL;public class GetJson {    public JSONObject getHttpJson(String url,int comefrom){        try {            URL realUrl = new URL(url);            HttpURLConnection connection = (HttpURLConnection)realUrl.openConnection();            connection.setRequestProperty("accept", "*/*");            connection.setRequestProperty("connection", "Keep-Alive");            connection.setRequestProperty("user-agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");            // 建立实际的连接            connection.connect();            //请求成功            if(connection.getResponseCode()==200){                InputStream is=connection.getInputStream();                ByteArrayOutputStream baos=new ByteArrayOutputStream();                //10MB的缓存                byte [] buffer=new byte[10485760];                int len=0;                while((len=is.read(buffer))!=-1){                    baos.write(buffer, 0, len);                }                String jsonString=baos.toString();                baos.close();                is.close();                //转换成json数据处理                JSONObject jsonArray=getJsonString(jsonString,comefrom);                return jsonArray;            }        } catch (MalformedURLException e) {            e.printStackTrace();        } catch (IOException ex) {           ex.printStackTrace();        }        return null;    }    public JSONObject getHttpsJson(String url){        try {            URL realUrl = new URL(url);            HttpsURLConnection httpsConn = (HttpsURLConnection)realUrl.openConnection();            httpsConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");            httpsConn.setRequestProperty("connection", "Keep-Alive");            httpsConn.setRequestProperty("user-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36");            httpsConn.setRequestProperty("Accept-Charset","utf-8");            httpsConn.setRequestProperty("contentType", "utf-8");            httpsConn.connect();            if(httpsConn.getResponseCode()==200){                InputStream is = httpsConn.getInputStream();                ByteArrayOutputStream baos=new ByteArrayOutputStream();                //10MB的缓存                byte [] buffer=new byte[10485760];                int len=0;                while((len=is.read(buffer))!=-1){                    baos.write(buffer, 0, len);                }                String jsonString=baos.toString("utf-8");                baos.close();                is.close();                return new JSONObject(jsonString);            }        } catch (IOException e) {            e.printStackTrace();        }        return  null;    }    public JSONObject getJsonString(String str, int comefrom){           JSONObject jo = null;            if(comefrom==1){                return new JSONObject(str);            }else if(comefrom==2){                int indexStart = 0;                //字符处理                for(int i=0;i<str.length();i++){                    if(str.charAt(i)=='('){                        indexStart = i;                        break;                    }                }                String strNew = "";                //分割字符串                for(int i=indexStart+1;i<str.length()-1;i++){                    strNew += str.charAt(i);                }                return new JSONObject(strNew);            }           return jo;    }}

getHttpJson函数的后面的参数1,表示返回的是json数据,2表示http接口的数据在一个()中的数据
6.2、数据库实体类和数据库的程序如下
数据库实体类

package com.crawler.example.entity;import org.json.JSONObject;public class StockPrice {    public  String stock_id;    public String data;    public String getStock_id() {        return stock_id;    }    public void setStock_id(String stock_id) {        this.stock_id = stock_id;    }    public String getData() {        return data;    }    public void setData(String data) {        this.data = data;    }}

数据库接口如下:

package com.crawler.example.map;import com.crawler.example.entity.StockPrice;import org.apache.ibatis.annotations.Insert;import org.apache.ibatis.annotations.Mapper;@Mapperpublic interface StockPriceMap {    @Insert("insert into stock(stock_id,data) values(#{stock_id},#{data})")    public void insertIntoStock(StockPrice stockPrice);}

最后的运行结果为:
这里写图片描述

7、获取动态页面数据

7.1、抓取需要渲染的网页

有时网页的数据经过渲染才能有,或者网站有饭爬虫措施,那么可以借助浏览器下载网页数据,这样爬虫部署到windows server上比较方便,centos 安装chorme 比较困难,这种数据抓取方式几乎可以为所欲为0.0。
操作chorme的java插件有

cdp4j - Chrome DevTools Protocol for Java链接地址为:https://github.com/webfolderio/cdp4j

还是抓取百度股市热点网页数据
抓取程序为:

package com.crawler.example.web;import com.crawler.example.dirver.BaiDuHotProcess;import com.crawler.example.entity.BaiduHot;import com.crawler.example.map.BaiduHotMap;import io.webfolder.cdp.Launcher;import io.webfolder.cdp.session.Session;import io.webfolder.cdp.session.SessionFactory;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.web.bind.annotation.RestController;import java.util.ArrayList;@RestControllerpublic class BaiduHotDownChorme {    @Autowired    BaiduHotMap baiduHotMap;    @Scheduled(cron = "0/20 * * * * ? ")    public void downBaiDuHot(){        ArrayList<String> command = new ArrayList<String>();        //不显示google 浏览器        command.add("--headless");        Launcher launcher = new Launcher();        try (SessionFactory factory = launcher.launch(command);             Session session = factory.create()){            session.navigate("https://gupiao.baidu.com/concept/");            session.waitDocumentReady();            String content = (String) session.getContent();            //System.out.println(content);            Document doc = Jsoup.parse(content);            ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);            for(BaiduHot b:abh){                baiduHotMap.InsertBaiduHot(b);            }        }catch (Exception e){            e.printStackTrace();        }    }}

上面就是java 抓取网页数据的三种方式,项目源码在github上,源码地址为:
https://github.com/xiaoyangmoa/java-crawler

原创粉丝点击