java爬虫教务信息门户(java爬虫04)

来源:互联网 发布:度小月担仔面 知乎 编辑:程序博客网 时间:2024/05/01 04:13

我从去年12月开始接触爬虫,现在已有足足7个月了,中间一直没搞懂cookie和http协议,时隔这么久,总算弄明白了,也总算爬进去了!!!
昨天开始学习的httpClient,今天用它练手爬一下学校的信息门户吧!
http://myportal.sxu.edu.cn/login.portal

这里写图片描述

1. 抓包

以下信息是通过charm浏览器抓包(快捷键F12)获得的:

1. http://myportal.sxu.edu.cn/    请求:        Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8        Accept-Encoding:gzip, deflate, sdch        Accept-Language:zh-CN,zh;q=0.8        Connection:keep-alive        Cookie:JSESSIONID=0000MS7su8CHOtDnUq6dxd7YGdB:1b4e17ihg        Host:myportal.sxu.edu.cn        Upgrade-Insecure-Requests:1        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36    收到:        Cache-Control:no-cache="set-cookie, set-cookie2"        Content-Language:zh-CN        Content-Length:8252        Content-Type:text/html;charset=utf-8        Date:Sun, 09 Jul 2017 09:04:57 GMT        Expires:Thu, 01 Dec 1994 16:00:00 GMT        Server:IBM_HTTP_Server        Set-Cookie:iPlanetDirectoryPro=""; Expires=Thu, 01 Dec 1994 16:00:00 GMT; Path=/; Domain=.sxu.edu.cn        Set-Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; Path=/2. http://myportal.sxu.edu.cn/captchaGenerate.portal?s=0.5123204417293254    请求:        Accept:image/webp,image/*,*/*;q=0.8        Accept-Encoding:gzip, deflate, sdch        Accept-Language:zh-CN,zh;q=0.8        Connection:keep-alive        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg        Host:myportal.sxu.edu.cn        Referer:http://myportal.sxu.edu.cn/        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.363. http://myportal.sxu.edu.cn/captchaValidate.portal?captcha=mb75&what=captcha&value=mb75&_=    请求:        Accept:text/javascript, text/html, application/xml, text/xml, */*        Accept-Encoding:gzip, deflate, sdch        Accept-Language:zh-CN,zh;q=0.8        Connection:keep-alive        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg        Host:myportal.sxu.edu.cn        Referer:http://myportal.sxu.edu.cn/        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36        X-Prototype-Version:1.5.0        X-Requested-With:XMLHttpRequest4. http://myportal.sxu.edu.cn/userPasswordValidate.portal    Post请求:        Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8        Accept-Encoding:gzip, deflate        Accept-Language:zh-CN,zh;q=0.8        Cache-Control:max-age=0        Connection:keep-alive        Content-Length:173        Content-Type:application/x-www-form-urlencoded        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg        Host:myportal.sxu.edu.cn        Origin:http://myportal.sxu.edu.cn        Referer:http://myportal.sxu.edu.cn/        Upgrade-Insecure-Requests:1        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36    参数:        Login.Token1:2014241032        //密码        Login.Token2:**********        goto:http://myportal.sxu.edu.cn/loginSuccess.portal        gotoOnFail:http://myportal.sxu.edu.cn/loginFailure.portal    收到:        Cache-Control:no-cache        Content-Language:zh-CN        Content-Length:83        Content-Type:text/html; charset=UTF-8        Date:Sun, 09 Jul 2017 09:12:08 GMT        Expires:Thu, 01 Dec 1994 16:00:00 GMT        Server:IBM_HTTP_Server        Set-Cookie:iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23; Path=/; Domain=.sxu.edu.cn5. http://myportal.sxu.edu.cn/index.portal    请求:        Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8        Accept-Encoding:gzip, deflate, sdch        Accept-Language:zh-CN,zh;q=0.8        Connection:keep-alive        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23        Host:myportal.sxu.edu.cn        Referer:http://myportal.sxu.edu.cn/        Upgrade-Insecure-Requests:1        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36

2. 分析

从上面的抓包来看,爬取信息门户的关键是获得
以下两个cookie:

JSESSIONIDiPlanetDirectoryPro

JSESSIONID是在第一次请求登录网页时获得,
而iPlanetDirectoryPro是在请求userPasswordValidate.portal后获得
请求userPasswordValidate.portal需要一个JSESSIONID
还需要四个参数,其中:

//账号Login.Token1:2014241032//密码Login.Token2:**********

另外两个参数照抄.

由上分析可得:
我们的爬虫需要请求的页面如下:
1. 请求login.portal,获得JSESSIONID
2. 请求userPasswordValidate.portal,获得iPlanetDirectoryPro
3. 爬取数据

3. 写代码

package info_system;import java.io.IOException;import java.net.URI;import java.net.URISyntaxException;import org.apache.http.Header;import org.apache.http.HeaderElement;import org.apache.http.HeaderElementIterator;import org.apache.http.HeaderIterator;import org.apache.http.HttpResponse;import org.apache.http.client.ClientProtocolException;import org.apache.http.client.CookieStore;import org.apache.http.client.ResponseHandler;import org.apache.http.client.methods.HttpGet;import org.apache.http.client.methods.HttpPost;import org.apache.http.client.utils.URIBuilder;import org.apache.http.conn.ConnectionKeepAliveStrategy;import org.apache.http.impl.client.BasicCookieStore;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.impl.cookie.BasicClientCookie;import org.apache.http.message.BasicHeaderElementIterator;import org.apache.http.protocol.HTTP;import org.apache.http.protocol.HttpContext;import org.apache.http.util.EntityUtils;import utils.ImageUtils;public class Test {    public static final String host = "myportal.sxu.edu.cn";    public static final String url1 = "/login.portal";    public static final String url2 = "/captchaGenerate.portal";    public static final String url3 = "/captchaValidate.portal";    public static final String url4 = "/userPasswordValidate.portal";    public static final String url5 = "/index.portal";    public static void main(String[] args) throws URISyntaxException, ClientProtocolException, IOException {        ConnectionKeepAliveStrategy myStrategy = new ConnectionKeepAliveStrategy(){            @Override            public long getKeepAliveDuration(HttpResponse response, HttpContext context) {                // Honor 'keep-alive' header                HeaderElementIterator it = new BasicHeaderElementIterator(response.headerIterator(HTTP.CONN_KEEP_ALIVE));                while (it.hasNext()) {                    HeaderElement he = it.nextElement();                    String param = he.getName();                    String value = he.getValue();                    if (value != null && param.equalsIgnoreCase("timeout")) {                        try {                            return Long.parseLong(value) * 1000;                        } catch(NumberFormatException ignore) {                        }                    }                }                return 10*1000;            }        };        CookieStore cookieStore = new BasicCookieStore();        BasicClientCookie cookie = new BasicClientCookie("name", "value");        cookie.setPath("/");        cookie.setAttribute("JSESSIONID", "0000VrUJvmhi3ZW002mOu_e1czy:1b4e17j2v");        CloseableHttpClient httpclient = HttpClients.custom()                .setDefaultCookieStore(cookieStore)                .setKeepAliveStrategy(myStrategy)                .build();        //1.请求登录主页,获取登录主页的cookie        URI uri1 = new URIBuilder()                .setScheme("http")                .setHost(host)                .setPath(url1)                .build();        HttpGet httpGet = new HttpGet(uri1);        ResponseHandler<BasicClientCookie> responseHandler = new ResponseHandler<BasicClientCookie>() {            @Override            public BasicClientCookie handleResponse(HttpResponse response) throws ClientProtocolException, IOException {                HeaderIterator hi = response.headerIterator();                while(hi.hasNext()){                    Header h = (Header) hi.next();                    System.out.println(h.getName()+" --> "+h.getValue());                }                return null;            }        };        httpclient.execute(httpGet,responseHandler);        cookieStore.getCookies().forEach(e->System.out.println(e));        boolean b = false;/*        //2.请求验证码        URI uri2 = new URIBuilder()                .setScheme("http")                .setHost(host)                .setPath(url2)                .setParameter("s", "0.5123204417293254")                .build();        HttpGet httpGet2 = new HttpGet(uri2);        do{            ResponseHandler<Boolean> responseHandler2 = new ResponseHandler<Boolean>() {                @Override                public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {                    try {                        ImageUtils.writeImg("test.jpg", response.getEntity().getContent());                        return true;                    } catch (Exception e) {                        return false;                    }                }            };            b = httpclient.execute(httpGet2,responseHandler2);        }while(!b);        //手动输入验证码:        @SuppressWarnings("resource")        String captcha = new java.util.Scanner(System.in).nextLine();        //3. 请求验证码验证        URI uri3 = new URIBuilder()                .setScheme("http")                .setHost(host)                .setPath(url3)                .setParameter("captcha", captcha)                .setParameter("what", "captcha")                .setParameter("value", captcha)                .setParameter("_", "")                .build();        HttpGet httpGet3 = new HttpGet(uri3);        final String error = "验证码非法";        ResponseHandler<Boolean> responseHandler3 = new ResponseHandler<Boolean>() {            @Override            public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {                try {                    String s = EntityUtils.toString(response.getEntity());                    System.out.println(s);                    if(s.equals(error)){                        return false;                    }                    return true;                } catch (Exception e) {                    return false;                }            }        };        b = httpclient.execute(httpGet3,responseHandler3);        if(b)            System.out.println("验证码识别成功");*/              //休息一会,等待服务器响应        try {            Thread.sleep(1000);        } catch (InterruptedException e1) {            e1.printStackTrace();        }        //4. 请求账号和密码验证        URI uri4 = new URIBuilder()                .setScheme("http")                .setHost(host)                .setPath(url4)                .setParameter("Login.Token1", "2014241032")                //此处参数为密码                .setParameter("Login.Token2", "**********")                .setParameter("goto", "http://myportal.sxu.edu.cn/loginSuccess.portal")                .setParameter("gotoOnFail", "http://myportal.sxu.edu.cn/loginFailure.portal")                .build();        HttpPost httpPost4 = new HttpPost(uri4);        ResponseHandler<Boolean> responseHandler4 = new ResponseHandler<Boolean>() {            @Override            public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {                try {                    String s = EntityUtils.toString(response.getEntity());                    System.out.println(s);                    if(s.contains("用户不存在或密码错误")){                        return false;                    }                    return true;                } catch (Exception e) {                    return false;                }            }        };        b = httpclient.execute(httpPost4,responseHandler4);        if(b){            System.out.println("验证成功");        }        //5. 请求主页        URI uri5 = new URIBuilder()                .setScheme("http")                .setHost(host)                .setPath(url5)                .build();        HttpGet httpGet5 = new HttpGet(uri5);        ResponseHandler<Boolean> responseHandler5 = new ResponseHandler<Boolean>() {            @Override            public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {                try {                    String s = EntityUtils.toString(response.getEntity());                    //System.out.println(s);                    if(s.contains("<td class=\"STYLE1\">验证码:</td>")){                        return false;                    }                    return true;                } catch (Exception e) {                    return false;                }            }        };        b = httpclient.execute(httpGet5, responseHandler5);        if(b){            System.out.println("获取主页成功");        }else{            System.out.println("获取主页失败");        }    }}

//用于验证码图像保存至本地

package utils;import java.io.ByteArrayOutputStream;import java.io.File;import java.io.FileOutputStream;import java.io.InputStream;public class ImageUtils {      /**     * 把图像流读取成byte[]     * @param inStream     * @return     * @throws Exception     */    public static byte[] readImg(InputStream inStream) throws Exception{          ByteArrayOutputStream outStream = new ByteArrayOutputStream();          //创建一个Buffer字符串          byte[] buffer = new byte[1024];          //每次读取的字符串长度,如果为-1,代表全部读取完毕          int len = 0;          //使用一个输入流从buffer里把数据读取出来          while( (len=inStream.read(buffer)) != -1 ){              //用输出流往buffer里写入数据,中间参数代表从哪个位置开始读,len代表读取的长度              outStream.write(buffer, 0, len);          }          //关闭输入流          inStream.close();          //把outStream里的数据写入内存          return outStream.toByteArray();      }      /**     * 将imgIs图像流写入到本地imgPath中     * @param imgPath     * @param imgIs     * @throws Exception     */    public static void writeImg(String imgPath,InputStream imgIs) throws Exception{        //得到图片的二进制数据,以二进制封装得到数据,具有通用性          byte[] data = readImg(imgIs);          //new一个文件对象用来保存图片,默认保存当前工程根目录          File imageFile = new File(imgPath);          //创建输出流          FileOutputStream outStream = new FileOutputStream(imageFile);          //写入数据          outStream.write(data);          //关闭输出流          outStream.close();      }}  
原创粉丝点击