java爬虫教务信息门户(java爬虫04)
来源:互联网 发布:度小月担仔面 知乎 编辑:程序博客网 时间:2024/05/01 04:13
我从去年12月开始接触爬虫,现在已有足足7个月了,中间一直没搞懂cookie和http协议,时隔这么久,总算弄明白了,也总算爬进去了!!!
昨天开始学习的httpClient,今天用它练手爬一下学校的信息门户吧!
http://myportal.sxu.edu.cn/login.portal
1. 抓包
以下信息是通过charm浏览器抓包(快捷键F12)获得的:
1. http://myportal.sxu.edu.cn/ 请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000MS7su8CHOtDnUq6dxd7YGdB:1b4e17ihg Host:myportal.sxu.edu.cn Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 收到: Cache-Control:no-cache="set-cookie, set-cookie2" Content-Language:zh-CN Content-Length:8252 Content-Type:text/html;charset=utf-8 Date:Sun, 09 Jul 2017 09:04:57 GMT Expires:Thu, 01 Dec 1994 16:00:00 GMT Server:IBM_HTTP_Server Set-Cookie:iPlanetDirectoryPro=""; Expires=Thu, 01 Dec 1994 16:00:00 GMT; Path=/; Domain=.sxu.edu.cn Set-Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; Path=/2. http://myportal.sxu.edu.cn/captchaGenerate.portal?s=0.5123204417293254 请求: Accept:image/webp,image/*,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.363. http://myportal.sxu.edu.cn/captchaValidate.portal?captcha=mb75&what=captcha&value=mb75&_= 请求: Accept:text/javascript, text/html, application/xml, text/xml, */* Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 X-Prototype-Version:1.5.0 X-Requested-With:XMLHttpRequest4. http://myportal.sxu.edu.cn/userPasswordValidate.portal Post请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate Accept-Language:zh-CN,zh;q=0.8 Cache-Control:max-age=0 Connection:keep-alive Content-Length:173 Content-Type:application/x-www-form-urlencoded Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Origin:http://myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 参数: Login.Token1:2014241032 //密码 Login.Token2:********** goto:http://myportal.sxu.edu.cn/loginSuccess.portal gotoOnFail:http://myportal.sxu.edu.cn/loginFailure.portal 收到: Cache-Control:no-cache Content-Language:zh-CN Content-Length:83 Content-Type:text/html; charset=UTF-8 Date:Sun, 09 Jul 2017 09:12:08 GMT Expires:Thu, 01 Dec 1994 16:00:00 GMT Server:IBM_HTTP_Server Set-Cookie:iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23; Path=/; Domain=.sxu.edu.cn5. http://myportal.sxu.edu.cn/index.portal 请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23 Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
2. 分析
从上面的抓包来看,爬取信息门户的关键是获得
以下两个cookie:
JSESSIONIDiPlanetDirectoryPro
JSESSIONID是在第一次请求登录网页时获得,
而iPlanetDirectoryPro是在请求userPasswordValidate.portal后获得
请求userPasswordValidate.portal需要一个JSESSIONID
还需要四个参数,其中:
//账号Login.Token1:2014241032//密码Login.Token2:**********
另外两个参数照抄.
由上分析可得:
我们的爬虫需要请求的页面如下:
1. 请求login.portal,获得JSESSIONID
2. 请求userPasswordValidate.portal,获得iPlanetDirectoryPro
3. 爬取数据
3. 写代码
package info_system;import java.io.IOException;import java.net.URI;import java.net.URISyntaxException;import org.apache.http.Header;import org.apache.http.HeaderElement;import org.apache.http.HeaderElementIterator;import org.apache.http.HeaderIterator;import org.apache.http.HttpResponse;import org.apache.http.client.ClientProtocolException;import org.apache.http.client.CookieStore;import org.apache.http.client.ResponseHandler;import org.apache.http.client.methods.HttpGet;import org.apache.http.client.methods.HttpPost;import org.apache.http.client.utils.URIBuilder;import org.apache.http.conn.ConnectionKeepAliveStrategy;import org.apache.http.impl.client.BasicCookieStore;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.impl.cookie.BasicClientCookie;import org.apache.http.message.BasicHeaderElementIterator;import org.apache.http.protocol.HTTP;import org.apache.http.protocol.HttpContext;import org.apache.http.util.EntityUtils;import utils.ImageUtils;public class Test { public static final String host = "myportal.sxu.edu.cn"; public static final String url1 = "/login.portal"; public static final String url2 = "/captchaGenerate.portal"; public static final String url3 = "/captchaValidate.portal"; public static final String url4 = "/userPasswordValidate.portal"; public static final String url5 = "/index.portal"; public static void main(String[] args) throws URISyntaxException, ClientProtocolException, IOException { ConnectionKeepAliveStrategy myStrategy = new ConnectionKeepAliveStrategy(){ @Override public long getKeepAliveDuration(HttpResponse response, HttpContext context) { // Honor 'keep-alive' header HeaderElementIterator it = new BasicHeaderElementIterator(response.headerIterator(HTTP.CONN_KEEP_ALIVE)); while (it.hasNext()) { HeaderElement he = it.nextElement(); String param = he.getName(); String value = he.getValue(); if (value != null && param.equalsIgnoreCase("timeout")) { try { return Long.parseLong(value) * 1000; } catch(NumberFormatException ignore) { } } } return 10*1000; } }; CookieStore cookieStore = new BasicCookieStore(); BasicClientCookie cookie = new BasicClientCookie("name", "value"); cookie.setPath("/"); cookie.setAttribute("JSESSIONID", "0000VrUJvmhi3ZW002mOu_e1czy:1b4e17j2v"); CloseableHttpClient httpclient = HttpClients.custom() .setDefaultCookieStore(cookieStore) .setKeepAliveStrategy(myStrategy) .build(); //1.请求登录主页,获取登录主页的cookie URI uri1 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url1) .build(); HttpGet httpGet = new HttpGet(uri1); ResponseHandler<BasicClientCookie> responseHandler = new ResponseHandler<BasicClientCookie>() { @Override public BasicClientCookie handleResponse(HttpResponse response) throws ClientProtocolException, IOException { HeaderIterator hi = response.headerIterator(); while(hi.hasNext()){ Header h = (Header) hi.next(); System.out.println(h.getName()+" --> "+h.getValue()); } return null; } }; httpclient.execute(httpGet,responseHandler); cookieStore.getCookies().forEach(e->System.out.println(e)); boolean b = false;/* //2.请求验证码 URI uri2 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url2) .setParameter("s", "0.5123204417293254") .build(); HttpGet httpGet2 = new HttpGet(uri2); do{ ResponseHandler<Boolean> responseHandler2 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { ImageUtils.writeImg("test.jpg", response.getEntity().getContent()); return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpGet2,responseHandler2); }while(!b); //手动输入验证码: @SuppressWarnings("resource") String captcha = new java.util.Scanner(System.in).nextLine(); //3. 请求验证码验证 URI uri3 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url3) .setParameter("captcha", captcha) .setParameter("what", "captcha") .setParameter("value", captcha) .setParameter("_", "") .build(); HttpGet httpGet3 = new HttpGet(uri3); final String error = "验证码非法"; ResponseHandler<Boolean> responseHandler3 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { String s = EntityUtils.toString(response.getEntity()); System.out.println(s); if(s.equals(error)){ return false; } return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpGet3,responseHandler3); if(b) System.out.println("验证码识别成功");*/ //休息一会,等待服务器响应 try { Thread.sleep(1000); } catch (InterruptedException e1) { e1.printStackTrace(); } //4. 请求账号和密码验证 URI uri4 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url4) .setParameter("Login.Token1", "2014241032") //此处参数为密码 .setParameter("Login.Token2", "**********") .setParameter("goto", "http://myportal.sxu.edu.cn/loginSuccess.portal") .setParameter("gotoOnFail", "http://myportal.sxu.edu.cn/loginFailure.portal") .build(); HttpPost httpPost4 = new HttpPost(uri4); ResponseHandler<Boolean> responseHandler4 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { String s = EntityUtils.toString(response.getEntity()); System.out.println(s); if(s.contains("用户不存在或密码错误")){ return false; } return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpPost4,responseHandler4); if(b){ System.out.println("验证成功"); } //5. 请求主页 URI uri5 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url5) .build(); HttpGet httpGet5 = new HttpGet(uri5); ResponseHandler<Boolean> responseHandler5 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { String s = EntityUtils.toString(response.getEntity()); //System.out.println(s); if(s.contains("<td class=\"STYLE1\">验证码:</td>")){ return false; } return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpGet5, responseHandler5); if(b){ System.out.println("获取主页成功"); }else{ System.out.println("获取主页失败"); } }}
//用于验证码图像保存至本地
package utils;import java.io.ByteArrayOutputStream;import java.io.File;import java.io.FileOutputStream;import java.io.InputStream;public class ImageUtils { /** * 把图像流读取成byte[] * @param inStream * @return * @throws Exception */ public static byte[] readImg(InputStream inStream) throws Exception{ ByteArrayOutputStream outStream = new ByteArrayOutputStream(); //创建一个Buffer字符串 byte[] buffer = new byte[1024]; //每次读取的字符串长度,如果为-1,代表全部读取完毕 int len = 0; //使用一个输入流从buffer里把数据读取出来 while( (len=inStream.read(buffer)) != -1 ){ //用输出流往buffer里写入数据,中间参数代表从哪个位置开始读,len代表读取的长度 outStream.write(buffer, 0, len); } //关闭输入流 inStream.close(); //把outStream里的数据写入内存 return outStream.toByteArray(); } /** * 将imgIs图像流写入到本地imgPath中 * @param imgPath * @param imgIs * @throws Exception */ public static void writeImg(String imgPath,InputStream imgIs) throws Exception{ //得到图片的二进制数据,以二进制封装得到数据,具有通用性 byte[] data = readImg(imgIs); //new一个文件对象用来保存图片,默认保存当前工程根目录 File imageFile = new File(imgPath); //创建输出流 FileOutputStream outStream = new FileOutputStream(imageFile); //写入数据 outStream.write(data); //关闭输出流 outStream.close(); }}
阅读全文
0 0
- java爬虫教务信息门户(java爬虫04)
- 利用java爬虫QDU教务课表
- 教务系统,验证码识别,异步加载,java爬虫06
- java爬虫之登录到教务系统抓取成绩
- 基于java的URP教务系统爬虫实现
- JS爬虫,Java爬虫
- java爬虫实战(1):抓取信息门户网站中的图片及其他文件并保存至本地
- 教务系统爬虫
- 青岛大学教务爬虫
- 网络爬虫, Java爬虫,信息抓取的实现
- JAVA爬虫
- Java 爬虫
- Java 爬虫
- Java 爬虫
- java爬虫
- java 爬虫
- Java爬虫
- java爬虫
- 英伟达深度学习实习生面试指南
- Pointcut is not well-formed: expecting 'name pattern' at character position
- zoj1094 Matrix Chain Multiplication 模拟
- 大型网站技术架构解决方案归纳
- 关于绝对路径和相对路径
- java爬虫教务信息门户(java爬虫04)
- 更改Kali Linux MAC地址
- ntp导致其他线程卡顿原因总结
- CentOS下Nginx+fastcgi+python2搭建web.py服务环境
- Freemarker中如何遍历List
- JAVA栈与堆的区别
- bootstrap网格布局原理解析
- Android中给控件设置字体
- iOS性能优化Tips