爬虫如何实现每天爬取,定点爬取[以股票数据为例]

来源:互联网 发布:windows网络编程下载 编辑:程序博客网 时间:2024/04/28 13:40

  • 分析抓取的数据
  • 抓包
  • 框架
  • model
  • main
  • util
  • parse
  • db
  • 问题所在
  • 解决方法
    • job
    • jobmain

近期,有人将本人博客,复制下来,直接上传到百度文库等平台。
本文为原创博客,仅供技术学习使用。未经允许,禁止将其复制下来上传到百度文库等平台。如有转载请注明本文博客的地址(链接)

分析抓取的数据

本文是以东方财富网的数据为例,这里只做技术学习使用,请勿滥用。如本文要抓取的数据是东方财富网的汽车板块及石油板块数据。如下为其地址:http://quote.eastmoney.com/center/list.html#28002481_0_2
http://quote.eastmoney.com/center/list.html#28002464_0_2
如下截图为其数据格式。

这里写图片描述

抓包

写爬虫第一步是做网络抓包,这个我之前的博客中已经讲到即看数据请求的真实地址。关于本文为什么这样设计,请看我的专题博客,爬虫原理及相关基础:http://blog.csdn.net/column/details/14269.html。
这里写图片描述

从上图中,可以看出数据真实的请求地址及请求的方法。而获得的是json数组。如下图所示:
这里写图片描述

框架

本文使用的框架,如下图所示:
这里写图片描述

db:主要放的是数据库操作文件,包含MyDataSource【数据库驱动注册、连接数据库的用户名、密码】,MYSQLControl【连接数据库,插入操作、更新操作、建表操作等】。

model:用来封装对象,说的直白一些,封装的就是我要操作数据对应的属性名。有不明白的看之前写的一个简单的网络爬虫(http://blog.csdn.net/qy20115549/article/details/52203722)。

parse:这里面存放的是针对util获取的文件,进行解析,一般采用Jsoup解析;若是针对json数据,可采用正则表达式或者fastjson工具进行解析,建议使用fastjson,因其操作简单,快捷。

main:程序起点,也是重点,获取数据,执行数据库语句,存放数据。

job:用来执行的job任务。

jobmain:控制器,即合适执行job,如本文中的每天执行一次job。股票数据每天下午3点钟收盘,即设置为3点钟以后的某个时间点开始爬行相关股票数据。

model

model用来封装我要爬去的数据,如当天的日期,股票的id,股票的名称,股票价格等等。如下面程序:

package model;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */public class ExtMarketOilStockModel {    private String date;    private String stock_id;    private String stock_name;    private float stock_price;    private float stock_change;    private float stock_range;    private float stock_amplitude;    private int stock_trading_number;    private int stock_trading_value;    private float stock_yesterdayfinish_price;    private float stock_todaystart_price;    private float stock_max_price;    private float stock_min_price;    private float stock_fiveminuate_change;    private String craw_time;    public String getDate() {        return date;    }    public void setDate(String date) {        this.date = date;    }    public String getStock_id() {        return stock_id;    }    public void setStock_id(String stock_id) {        this.stock_id = stock_id;    }    public String getStock_name() {        return stock_name;    }    public void setStock_name(String stock_name) {        this.stock_name = stock_name;    }    public float getStock_price() {        return stock_price;    }    public void setStock_price(float stock_price) {        this.stock_price = stock_price;    }    public float getStock_change() {        return stock_change;    }    public void setStock_change(float stock_change) {        this.stock_change = stock_change;    }    public float getStock_range() {        return stock_range;    }    public void setStock_range(float stock_range) {        this.stock_range = stock_range;    }    public float getStock_amplitude() {        return stock_amplitude;    }    public void setStock_amplitude(float stock_amplitude) {        this.stock_amplitude = stock_amplitude;    }    public int getStock_trading_number() {        return stock_trading_number;    }    public void setStock_trading_number(int stock_trading_number) {        this.stock_trading_number = stock_trading_number;    }    public int getStock_trading_value() {        return stock_trading_value;    }    public void setStock_trading_value(int stock_trading_value) {        this.stock_trading_value = stock_trading_value;    }    public float getStock_yesterdayfinish_price() {        return stock_yesterdayfinish_price;    }    public void setStock_yesterdayfinish_price(float stock_yesterdayfinish_price) {        this.stock_yesterdayfinish_price = stock_yesterdayfinish_price;    }    public float getStock_todaystart_price() {        return stock_todaystart_price;    }    public void setStock_todaystart_price(float stock_todaystart_price) {        this.stock_todaystart_price = stock_todaystart_price;    }    public float getStock_max_price() {        return stock_max_price;    }    public void setStock_max_price(float stock_max_price) {        this.stock_max_price = stock_max_price;    }    public float getStock_min_price() {        return stock_min_price;    }    public void setStock_min_price(float stock_min_price) {        this.stock_min_price = stock_min_price;    }    public float getStock_fiveminuate_change() {        return stock_fiveminuate_change;    }    public void setStock_fiveminuate_change(float stock_fiveminuate_change) {        this.stock_fiveminuate_change = stock_fiveminuate_change;    }    public String getCraw_time() {        return craw_time;    }    public void setCraw_time(String craw_time) {        this.craw_time = craw_time;    }}

main

主方法,尽量要求简单,这里我就这样写了。这里面有注释,很好理解。

package navi.main;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */import java.util.ArrayList;import java.util.List;import db.MYSQLControl;import model.ExtMarketOilStockModel;import parse.ExtMarketOilStockParse;public class ExtMarketOilStockMain {    public static void main(String[] args) throws Exception {        List<String> urloillist=new ArrayList<String>();        List<String> urlcarlist=new ArrayList<String>();        List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>();        List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>();        //石油相关股票就两页,对应两个地址        String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";        String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";        urloillist.add(url1);        urloillist.add(url2);        for (int i = 0; i < urloillist.size(); i++) {            //解析url            oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));            //存储每页的数据            MYSQLControl.insertoilStocks(oilstocks);        }        //汽车相关股票有6页,对应6个地址        for (int i = 1; i <6; i++) {            String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";            urlcarlist.add(urli);        }        for (int i = 0; i < urlcarlist.size(); i++) {            //解析url            carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));            //存储数据            MYSQLControl.insertcarStocks(carstocks);        }    }}

util

这里有三个文件,HTTPUtils,TimeUtils(这是我自己经常用的一个类,主要是各种日期的转化,如String转化为date,获取当前时间等等),UumericalUtil(这是一个Float保留几位小数的类)。

package util;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.net.URL;import java.net.URLConnection;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */public abstract class HTTPUtils {    //这个方法是向后台请求数据,获取html或者json等    public static String  getRawHtml(String personalUrl) throws InterruptedException,IOException {        URL url = new URL(personalUrl);        URLConnection conn = url.openConnection();        InputStream in=null;        try {            conn.setConnectTimeout(3000);            in = conn.getInputStream();        } catch (Exception e) {        }        //将获取的数据转化为String        String html = convertStreamToString(in);        return html;    }    //这个方法是将InputStream转化为String    public static String convertStreamToString(InputStream is) throws IOException {        if (is == null)            return "";        BufferedReader reader = new BufferedReader(new InputStreamReader(is,"utf-8"));        StringBuilder sb = new StringBuilder();        String line = null;        try {            while ((line = reader.readLine()) != null) {                sb.append(line);            }        } catch (IOException e) {            e.printStackTrace();        } finally {            try {                is.close();            } catch (IOException e) {                e.printStackTrace();            }        }        reader.close();        return sb.toString();    }}

以下类是用来处理各种时间格式之间的转化,大家以后也可以使用。

package util;import java.text.DateFormat;import java.text.DecimalFormat;import java.text.ParseException;import java.text.SimpleDateFormat;import java.util.ArrayList;import java.util.Calendar;import java.util.Date;import java.util.List;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */public class TimeUtils {    public static void main( String[] args ) throws ParseException{        String time = getMonth("2002-1-08 14:50:38");        System.out.println(time);        System.out.println(getDay("2002-1-08 14:50:38"));        System.out.println(TimeUtils.parseTime("2016-05-19 19:17","yyyy-MM-dd HH:mm"));    }    //get current time    public static String GetNowDate(String formate){          String temp_str="";          Date dt = new Date();          SimpleDateFormat sdf = new SimpleDateFormat(formate);          temp_str=sdf.format(dt);          return temp_str;      }      public static String getMonth( String time ){        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM");        Date date = null;        try {            date = sdf.parse(time);            Calendar cal = Calendar.getInstance();            cal.setTime(date);        } catch (ParseException e) {            e.printStackTrace();        }        return sdf.format(date);    }    public static String getDay( String time ){        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");        Date date = null;        try {            date = sdf.parse(time);            Calendar cal = Calendar.getInstance();            cal.setTime(date);        } catch (ParseException e) {            e.printStackTrace();        }        return sdf.format(date);    }    public static Date parseTime(String inputTime) throws ParseException{        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");          Date date = sdf.parse(inputTime);         return date;    }    public static String dateToString(Date date, String type) {         DateFormat df = new SimpleDateFormat(type);          return df.format(date);      }    public static Date parseTime(String inputTime, String timeFormat) throws ParseException{        SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);          Date date = sdf.parse(inputTime);         return date;    }    public static Calendar parseTimeToCal(String inputTime, String timeFormat) throws ParseException{        SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);          Date date = sdf.parse(inputTime);         Calendar calendar = Calendar.getInstance();        calendar.setTime(date);        return calendar;    }    public static int getDaysBetweenCals(Calendar cal1, Calendar cal2) throws ParseException{        return (int) ((cal2.getTimeInMillis()-cal1.getTimeInMillis())/(1000*24*3600));    }    public static Date parseTime(long inputTime){        //  SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");        Date date= new Date(inputTime);        return date;    }    public static String parseTimeString(long inputTime){        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");        Date date= new Date(inputTime);        return sdf.format(date);    }    public static String parseStringTime(String inputTime){        String date=null;        try {            Date date1 = new SimpleDateFormat("yyyyMMddHHmmss").parse(inputTime);            date=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date1);        } catch (ParseException e) {            // TODO Auto-generated catch block            e.printStackTrace();        }        return date;    }    public static List<String> YearMonth(int year) {        List<String> yearmouthlist=new ArrayList<String>();        for (int i = 1; i < 13; i++) {            DecimalFormat dfInt=new DecimalFormat("00");            String sInt = dfInt.format(i);            yearmouthlist.add(year+sInt);        }        return yearmouthlist;    }     public static List<String> YearMonth(int startyear,int finistyear) {        List<String> yearmouthlist=new ArrayList<String>();        for (int i = startyear; i < finistyear+1; i++) {            for (int j = 1; j < 13; j++) {                DecimalFormat dfInt=new DecimalFormat("00");                String sInt = dfInt.format(j);                yearmouthlist.add(i +"-"+sInt);            }        }        return yearmouthlist;    }     public static List<String> TOAllDay(int year){        List<String> daylist=new ArrayList<String>();        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");         int m=1;//月份计数         while (m<13)         {             int month=m;             Calendar cal=Calendar.getInstance();//获得当前日期对象             cal.clear();//清除信息             cal.set(Calendar.YEAR,year);             cal.set(Calendar.MONTH,month-1);//1月从0开始             cal.set(Calendar.DAY_OF_MONTH,1);//设置为1号,当前日期既为本月第一天              System.out.println("##########___" + sdf.format(cal.getTime()));             int count=cal.getActualMaximum(Calendar.DAY_OF_MONTH);             System.out.println("$$$$$$$$$$________" + count);             for (int j=0;j<=(count - 2);)             {                 cal.add(Calendar.DAY_OF_MONTH,+1);                 j++;                 daylist.add(sdf.format(cal.getTime()));            }             m++;         }         return daylist;    }    //获取昨天的日期    public static String getyesterday(){        Calendar   cal   =   Calendar.getInstance();        cal.add(Calendar.DATE,   -1);        String yesterday = new SimpleDateFormat( "yyyy-MM-dd ").format(cal.getTime());        return yesterday;    }}

这个类实现的是保留几位小数。如股票价格等,保留两位小数。

package util;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */import java.math.BigDecimal;import java.text.DecimalFormat;public class UumericalUtil {    public static float FloatTO(float f, int number) {        BigDecimal   b  =   new BigDecimal(f);          float   f1   =  b.setScale(number, BigDecimal.ROUND_HALF_UP).floatValue();          return f1;      }      public static String NumberTO(int number) {        DecimalFormat dfInt=new DecimalFormat("00");        String sInt = dfInt.format(number);        System.out.println(sInt);        return sInt;    } }

parse

parse主要是通过Jsoup或者其他工具来解析html文件。并将解析后的数据,封装在List集合中,将数据通过层层返回到main方法中。如这里只是采用最简单的字符串解析的方式。如下为某一页的数据,这要针对的是此类型的数据进行解析:

var quote_123={rank:["2,002662,京威股份,15.62,0.38,2.49%,2.95,10294,15948185,15.24,15.28,15.65,15.20,-,-,-,-,-,-,-,-,0.00%,0.62,0.17,33.47","2,002536,西泵股份,13.15,0.32,2.49%,3.74,26558,34710121,12.83,12.88,13.27,12.79,-,-,-,-,-,-,-,-,0.00%,0.99,0.87,41.09","1,600741,华域汽车,16.22,0.39,2.46%,2.59,215140,346480560,15.83,15.85,16.26,15.85,-,-,-,-,-,-,-,-,0.12%,1.23,0.75,8.59","1,601689,拓普集团,29.74,0.68,2.34%,3.20,36329,107964394,29.06,29.06,29.94,29.01,-,-,-,-,-,-,-,-,-0.20%,1.34,2.13,34.32","1,603306,华懋科技,33.87,0.74,2.23%,4.50,9251,31242113,33.13,33.14,34.20,32.71,-,-,-,-,-,-,-,-,-0.03%,0.72,1.25,29.60","1,601799,星宇股份,37.40,0.80,2.19%,3.80,5522,20477010,36.60,36.40,37.50,36.11,-,-,-,-,-,-,-,-,0.03%,0.86,0.23,28.43","1,603166,福达股份,14.02,0.29,2.11%,2.91,47265,66170428,13.73,13.80,14.14,13.74,-,-,-,-,-,-,-,-,0.21%,0.96,3.15,95.59","2,002190,成飞集成,32.44,0.66,2.08%,2.99,25213,81219488,31.78,31.63,32.58,31.63,-,-,-,-,-,-,-,-,0.03%,0.86,0.73,93.58","1,600213,亚星客车,14.77,0.30,2.07%,3.46,18878,27820060,14.47,14.52,14.88,14.38,-,-,-,-,-,-,-,-,-0.07%,0.64,0.86,55.39","2,300432,富临精工,21.28,0.43,2.06%,4.70,28707,60945368,20.85,20.60,21.58,20.60,-,-,-,-,-,-,-,-,-0.14%,1.29,2.07,50.58","2,300375,鹏翎股份,21.25,0.42,2.02%,3.94,11367,24164157,20.83,20.83,21.45,20.63,-,-,-,-,-,-,-,-,-0.14%,0.83,1.44,30.27","2,002363,隆基机械,11.47,0.22,1.96%,2.49,33946,38796837,11.25,11.27,11.55,11.27,-,-,-,-,-,-,-,-,0.00%,0.80,0.88,61.45","1,600469,风神股份,11.55,0.22,1.94%,3.09,38444,44305565,11.33,11.33,11.63,11.28,-,-,-,-,-,-,-,-,0.09%,0.67,0.68,27.07","2,002454,松芝股份,12.98,0.24,1.88%,2.83,27839,36056020,12.74,12.70,13.06,12.70,-,-,-,-,-,-,-,-,0.00%,1.17,0.87,25.84","2,002488,金固股份,14.79,0.27,1.86%,2.48,29002,42872475,14.52,14.52,14.88,14.52,-,-,-,-,-,-,-,-,0.00%,0.72,0.75,-","2,002284,亚太股份,13.18,0.24,1.85%,3.32,61756,81198133,12.94,12.87,13.30,12.87,-,-,-,-,-,-,-,-,0.30%,1.10,0.90,58.15","1,603788,宁波高发,35.97,0.64,1.81%,3.40,6719,24160418,35.33,35.21,36.33,35.13,-,-,-,-,-,-,-,-,0.03%,0.59,1.37,34.10","2,000957,中通客车,14.36,0.25,1.77%,2.69,59696,85581415,14.11,14.07,14.45,14.07,-,-,-,-,-,-,-,-,0.00%,0.79,1.25,13.99","2,300304,云意电气,52.12,0.90,1.76%,5.70,179330,922614032,51.22,50.38,52.83,49.91,-,-,-,-,-,-,-,-,-0.04%,1.12,9.35,108.58","2,002607,亚夏汽车,10.03,0.17,1.72%,4.16,27760,27878904,9.86,9.89,10.19,9.78,-,-,-,-,-,-,-,-,-0.30%,0.97,1.03,57.87"],pages:6}
package parse;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */import java.util.ArrayList;import java.util.List;import model.ExtMarketOilStockModel;import util.HTTPUtils;import util.TimeUtils;import util.UumericalUtil;public class ExtMarketOilStockParse {    public static List<ExtMarketOilStockModel> parseurl(String url) throws Exception {        List<ExtMarketOilStockModel> list=new ArrayList<ExtMarketOilStockModel>();        String response=HTTPUtils.getRawHtml(url);        String html = response.toString();        String jsonarra=html.split("rank:")[1].split(",pages")[0];        String stocks[]=jsonarra.split("\",");        List<String> stocklist=new ArrayList<String>();        for (int i = 0; i < stocks.length; i++) {            stocklist.add(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));            System.out.println(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));        }        for (int i = 0; i < stocklist.size(); i++) {            String date=TimeUtils.GetNowDate("yyyy-MM-dd");            String stock_id=stocklist.get(i).split(",")[1];            String stock_name=stocklist.get(i).split(",")[2];            float stock_price=0;            float stock_change=0;            float stock_range=0;            float stock_amplitude=0;            int stock_trading_number=0;            int stock_trading_value=0;            float stock_yesterdayfinish_price=0;            float stock_todaystart_price=0;            float stock_max_price=0;            float stock_min_price=0;            float stock_fiveminuate_change=0;            if (!stocklist.get(i).split(",")[3].equals("-")) {                //价格                stock_price=Float.parseFloat(stocklist.get(i).split(",")[3]);                //涨跌额                stock_change=Float.parseFloat(stocklist.get(i).split(",")[4]);                System.out.println(stock_change);                //涨跌幅                stock_range=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[5].replace("%", ""))*0.01),4);                stock_amplitude=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[6].replace("%", ""))*0.01),4);;                stock_trading_number=Integer.parseInt(stocklist.get(i).split(",")[7].replace("%", ""));                stock_trading_value=Integer.parseInt(stocklist.get(i).split(",")[8].replace("%", ""));                stock_yesterdayfinish_price=Float.parseFloat(stocklist.get(i).split(",")[9]);                stock_todaystart_price=Float.parseFloat(stocklist.get(i).split(",")[10]);                stock_max_price=Float.parseFloat(stocklist.get(i).split(",")[11]);                stock_min_price=Float.parseFloat(stocklist.get(i).split(",")[12]);                stock_fiveminuate_change=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[21].replace("%", ""))*0.01),4);;                System.out.println(stock_fiveminuate_change);            }            String craw_time=TimeUtils.GetNowDate("yyyy-MM-dd HH:mm:ss");            ExtMarketOilStockModel model=new ExtMarketOilStockModel();            model.setDate(date);            model.setStock_id(stock_id);            model.setStock_name(stock_name);            model.setStock_price(stock_price);            model.setStock_change(stock_change);            model.setStock_range(stock_range);            model.setStock_amplitude(stock_amplitude);            model.setStock_trading_number(stock_trading_number);            model.setStock_trading_value(stock_trading_value);            model.setStock_yesterdayfinish_price(stock_yesterdayfinish_price);            model.setStock_todaystart_price(stock_todaystart_price);            model.setStock_max_price(stock_max_price);            model.setStock_min_price(stock_min_price);            model.setStock_fiveminuate_change(stock_fiveminuate_change);            model.setCraw_time(craw_time);            list.add(model);        }        return list;     }}

db

db中包含两个java文件,MyDataSource,MYSQLControl。这两个文件的作用已在前面说明了。

package db;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */import javax.sql.DataSource;import org.apache.commons.dbcp2.BasicDataSource;public class MyDataSource {    public static DataSource getDataSource(String connectURI){        BasicDataSource ds = new BasicDataSource();         //MySQL的jdbc驱动        ds.setDriverClassName("com.mysql.jdbc.Driver");        ds.setUsername("root");              //所要连接的数据库名        ds.setPassword("112233");                //MySQL的登陆密码        ds.setUrl(connectURI);        return ds;    }}
package db;import java.sql.SQLException;import java.util.List;import javax.sql.DataSource;import org.apache.commons.dbutils.QueryRunner;import org.apache.commons.dbutils.ResultSetHandler;import org.apache.commons.dbutils.handlers.BeanListHandler;import org.apache.commons.dbutils.handlers.ColumnListHandler;import org.apache.commons.dbutils.handlers.ScalarHandler;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import model.ExtMarketOilStockModel;/** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */public class MYSQLControl {    static final Log logger = LogFactory.getLog(MYSQLControl.class);    static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/datacollection");    static QueryRunner qr = new QueryRunner(ds);    //第一类方法    public static void executeUpdate(String sql){        try {            qr.update(sql);        } catch (SQLException e) {            logger.error(e);        }    }    //按照SQL查询单个结果    public static Object getScalaBySQL ( String sql ){        ResultSetHandler<Object> h = new ScalarHandler<Object>(1);        Object obj = null;        try {            obj = qr.query(sql, h);        } catch (SQLException e) {            e.printStackTrace();        }        return obj;    }    //按照SQL查询多个结果    public static <T> List<T> getListInfoBySQL (String sql, Class<T> type ){        List<T> list = null;        try {            list = qr.query(sql,new BeanListHandler<T>(type));        } catch (SQLException e) {            e.printStackTrace();        }        return list;    }    //查询一列    public static List<Object> getListOneBySQL (String sql,String id){        List<Object> list=null;        try {            list = (List<Object>) qr.query(sql, new ColumnListHandler(id));        } catch (SQLException e) {            e.printStackTrace();        }        return list;    }    //此种数据库操作方法需要优化    public static int insertoilStocks ( List<ExtMarketOilStockModel> oilstocks ) {        Object[][] params = new Object[oilstocks.size()][17];        int c = 0;  //success number of update        int[] sum;        for ( int i = 0; i < oilstocks.size(); i++ ){            params[i][0] = oilstocks.get(i).getDate();            params[i][1] = oilstocks.get(i).getStock_id();            params[i][2] = oilstocks.get(i).getStock_name();            params[i][3] = oilstocks.get(i).getStock_price();            params[i][4] = oilstocks.get(i).getStock_change();            params[i][5] = oilstocks.get(i).getStock_range();            params[i][6] = oilstocks.get(i).getStock_amplitude();            params[i][7] = oilstocks.get(i).getStock_trading_number();            params[i][8] = oilstocks.get(i).getStock_trading_value();            params[i][9] = oilstocks.get(i).getStock_yesterdayfinish_price();            params[i][10] = oilstocks.get(i).getStock_todaystart_price();            params[i][11] = oilstocks.get(i).getStock_max_price();            params[i][12] = oilstocks.get(i).getStock_min_price();            params[i][13] = oilstocks.get(i).getStock_fiveminuate_change();            params[i][14] = oilstocks.get(i).getCraw_time();            params[i][15] = null;            params[i][16] = null;        }        QueryRunner qr = new QueryRunner(ds);        try {            sum = qr.batch("INSERT INTO `datacollection`.`ext_market_oil_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params);        } catch (SQLException e) {            System.out.println(e);        }        System.out.println("石油数据入库完毕");        return c;    }    //此种数据库操作方法需要优化    public static int insertcarStocks ( List<ExtMarketOilStockModel> carstocks ) {        int c = 0;  //success number of update        int[] sum;        Object[][] params1 = new Object[carstocks.size()][17];        int c1 = 0; //success number of update        for ( int i = 0; i < carstocks.size(); i++ ){            params1[i][0] = carstocks.get(i).getDate();            params1[i][1] = carstocks.get(i).getStock_id();            params1[i][2] = carstocks.get(i).getStock_name();            params1[i][3] = carstocks.get(i).getStock_price();            params1[i][4] = carstocks.get(i).getStock_change();            params1[i][5] = carstocks.get(i).getStock_range();            params1[i][6] = carstocks.get(i).getStock_amplitude();            params1[i][7] = carstocks.get(i).getStock_trading_number();            params1[i][8] = carstocks.get(i).getStock_trading_value();            params1[i][9] = carstocks.get(i).getStock_yesterdayfinish_price();            params1[i][10] = carstocks.get(i).getStock_todaystart_price();            params1[i][11] = carstocks.get(i).getStock_max_price();            params1[i][12] = carstocks.get(i).getStock_min_price();            params1[i][13] = carstocks.get(i).getStock_fiveminuate_change();            params1[i][14] = carstocks.get(i).getCraw_time();            params1[i][15] = null;            params1[i][16] = null;        }        QueryRunner qr = new QueryRunner(ds);        try {        //插入的数据表及数据            sum = qr.batch("INSERT INTO `datacollection`.`ext_market_car_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params1);        } catch (SQLException e) {            System.out.println(e);        }        System.out.println("汽车数据入库完毕");        return c;    }}

这样按道理整个爬虫,程序就写完了,运行main方法就行了。如下图,为main方法获取数据的部分结果。

这里写图片描述

问题所在

问题1:针对股票这种数据,每周1到周五都会发布相关股票数据,那么如何每天定时定点让程序自动的去抓取,而不是手工每天运行一下呢?

问题二:股票节假日,是不会开盘的,当网页中存在此数据,即网页中的显示,没有时间标签。针对此,又该如何处理呢?

首先,我带大家来看看我的数据库设计。


这里写图片描述

解决方法

这里使用Quartz实线定期运行程序,即上面提的第一个问题。(http://blog.csdn.net/qy20115549/article/details/52723907)。
针对第二个问题使用是:即如何判断当天股票不开盘,采用的方法是从数据库中随机抽取三个股票(上次时间的,如今天是1月21日,周六,随机从数据库中抽取1月20日的三只股票。将1月20日的三只股票与今天相同id的股票价格进行比较,如果三个股票的价格都相同,则判断,改天为节假日,股票价格没有变动,无需将数据插入数据库)。

job

package job;import java.util.ArrayList;import java.util.List;import org.quartz.Job; import org.quartz.JobExecutionContext; import org.quartz.JobExecutionException;import db.MYSQLControl;import model.ExtMarketOilStockModel;import parse.ExtMarketOilStockParse;import timecontrol.TimeControl; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */public class ExtMarketOilStockJob implements Job {     @Override     public void execute(JobExecutionContext arg0) throws JobExecutionException {        //获取上次的插入股票日期,加入判断是否为节假日        List<ExtMarketOilStockModel> randomlist = MYSQLControl.getListInfoBySQL("select stock_id,stock_price,stock_change from ext_market_oil_stock where date = (select date from ext_market_oil_stock order by date desc limit 1) ",ExtMarketOilStockModel.class);        //表格更新时间        List<String> urloillist=new ArrayList<String>();        List<String> urlcarlist=new ArrayList<String>();        List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>();        List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>();        String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";        String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";        urloillist.add(url1);        urloillist.add(url2);        int judge=0;        for (int i = 0; i < urloillist.size(); i++) {            try {                oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));            } catch (Exception e) {                e.printStackTrace();            }            for (int j = 0; j < oilstocks.size(); j++) {                String stock_id=oilstocks.get(j).getStock_id();                float stock_price=oilstocks.get(j).getStock_price();                if (stock_id.equals(randomlist.get(0).getStock_id())) {                    if (stock_price==randomlist.get(0).getStock_price()) {                        judge++;                    }                }            }            for (int j = 0; j < oilstocks.size(); j++) {                String stock_id=oilstocks.get(j).getStock_id();                float stock_price=oilstocks.get(j).getStock_price();                if (stock_id.equals(randomlist.get(1).getStock_id())) {                    if (stock_price==randomlist.get(1).getStock_price()) {                        judge++;                    }                }            }            for (int j = 0; j < oilstocks.size(); j++) {                String stock_id=oilstocks.get(j).getStock_id();                float stock_price=oilstocks.get(j).getStock_price();                if (stock_id.equals(randomlist.get(2).getStock_id())) {                    if (stock_price==randomlist.get(2).getStock_price()) {                        judge++;                    }                }            }            if (judge!=3) {                MYSQLControl.insertoilStocks(oilstocks);            }        }        if (judge!=3) {            for (int i = 1; i <6; i++) {                String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";                urlcarlist.add(urli);            }            for (int i = 0; i < urlcarlist.size(); i++) {                try {                    carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));                } catch (Exception e) {                    e.printStackTrace();                }                MYSQLControl.insertcarStocks(carstocks);            }        }    } } 

jobmain

如下,控制的时间是每周一到周五,8点39执行job,即每天都去抓取数据。

package jobmain;import static org.quartz.CronScheduleBuilder.cronSchedule;import static org.quartz.JobBuilder.newJob;import static org.quartz.TriggerBuilder.newTrigger;import java.text.SimpleDateFormat;import java.util.Date;import org.quartz.CronTrigger;import org.quartz.JobDetail;import org.quartz.Scheduler;import org.quartz.SchedulerFactory;import org.quartz.impl.StdSchedulerFactory;import job.ExtMarketOilStockJob; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @  */public class ExtMarketOilStockJobMain {    public void go() throws Exception {         // 首先,必需要取得一个Scheduler的引用         SchedulerFactory sf = new StdSchedulerFactory();         Scheduler sched = sf.getScheduler();         //jobs可以在scheduled的sched.start()方法前被调用         JobDetail job = newJob(ExtMarketOilStockJob.class).withIdentity("stockjob", "stockgroup").build();         //每周一到周五8点39开始执行job        CronTrigger trigger = newTrigger().withIdentity("stocktrigger", "stockgroup").withSchedule(cronSchedule("0 39 20 ? * MON-FRI")).build();         Date ft = sched.scheduleJob(job, trigger);         SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS");         System.out.println(job.getKey() + " 已被安排执行于: " + sdf.format(ft) + ",并且以如下重复规则重复执行: " + trigger.getCronExpression());         sched.start();     }     public static void main(String[] args) throws Exception {         ExtMarketOilStockJobMain maingo = new ExtMarketOilStockJobMain();         maingo.go();     } }

运行jobmain中的类,便可以实现每天定点爬取数据。

0 0