读书笔记——自己动手写网络爬虫_第一章(1)
来源:互联网 发布:centos7下nginx配置 编辑:程序博客网 时间:2024/05/29 08:23
自己只是一个新手,所以仅列出所遇到问题以及解决方法.
NO.1
书中第一章,DownLoadFile类downloadFile方法。
书中代码
//4.处理HTTP相应内容byte[] responseBoby = getMethod.getResponseBody();filePath = "temp\\"+getFileNameByUrl(url,getMethod.getResponseHeader("Content-Type").getValue());
完整代码编译成功并运行后抛出java.io.FileNotFoundException异常。
修改如下
String rootPath=getClass().getResource("/").getFile().toString()+"SgwSpider\\"; if(!new File(rootPath).exists()) //判断该路径是否存在 { new File(rootPath).mkdir(); //如若不存在则创建该路径 } filePath = rootPath+getFileNameByUrl(url,getMethod.getResponseHeader("Content-Type").getValue());
getClass().getResource("/").getFile().toString(); //获取当前类的所在工程路径并转换为字符串
而后在抓取其他不同网页时出现如下警告:
org.apache.commons.httpclient.HttpMethodBase getResponseBody
警告: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.
报出警告代码与上处相同:
//4.处理HTTP相应内容byte[] responseBoby = getMethod.getResponseBody();
跟进源码后,在HttpMethodBase抽象类中
<pre name="code" class="java"> /** * Returns the response body of the HTTP method, if any, as an array of bytes. * If response body is not available or cannot be read, returns <tt>null</tt>. * Buffers the response and this method can be called several times yielding * the same result each time. * 返回HTTP响应体的方法,如果存在响应体,则作为一个bytes数组返回。 * 如果响应体不可用亦或者无法被读取,则返回NULL。 * 缓冲响应并且可以多次调用该方法 * Note: This will cause the entire response body to be buffered in memory. A * malicious server may easily exhaust all the VM memory. It is strongly * recommended, to use getResponseAsStream if the content length of the response * is unknown or resonably large. * 返回HTTP响应体的方法,如果存在响应体,则作为一个bytes数组返回。 * 这里将会引起整个响应体被缓存在内存中。 * 一个恶意的服务器可能很简单的就耗尽所有的虚拟内存。 * 强烈的建议,在不知道响应体大小时候亦或者相当大小时使用getResponseAsStream。 * @return The response body. * * @throws IOException If an I/O (transport) problem occurs while obtaining the * response body. */ public byte[] getResponseBody() throws IOException { if (this.responseBody == null) { InputStream instream = getResponseBodyAsStream(); if (instream != null) { long contentLength = getResponseContentLength(); if (contentLength > Integer.MAX_VALUE) { //guard below cast from overflow throw new IOException("Content too large to be buffered: "+ contentLength +" bytes"); } int limit = getParams().getIntParameter(HttpMethodParams.BUFFER_WARN_TRIGGER_LIMIT, 1024*1024); if ((contentLength == -1) || (contentLength > limit)) { /<span style="color:#ff0000;">/报出警告处</span> LOG.warn("Going to buffer response body of large or unknown size. " +"Using getResponseBodyAsStream instead is recommended."); } LOG.debug("Buffering response body"); ByteArrayOutputStream outstream = new ByteArrayOutputStream( contentLength > 0 ? (int) contentLength : DEFAULT_INITIAL_BUFFER_SIZE); byte[] buffer = new byte[4096]; int len; while ((len = instream.read(buffer)) > 0) { outstream.write(buffer, 0, len); } outstream.close(); setResponseStream(null); this.responseBody = outstream.toByteArray(); } } return this.responseBody; }
if (this.responseBody == null) { InputStream instream = getResponseBodyAsStream();
responseBody定义如下
<pre name="code" class="java"> /** Buffer for the response */ /** 响应的缓冲区 */ private byte[] responseBody = null;
期间responseBoby未被赋值
继续跟进<pre name="code" class="java">/** * Returns the response body of the HTTP method, if any, as an {@link InputStream}. * If response body is not available, returns <tt>null</tt>. If the response has been * buffered this method returns a new stream object on every call. If the response * has not been buffered the returned stream can only be read once. * 返回HTTP响应头的方法,如果有,则作为InputStream返回. * 如果响应头不可用,则返回Null. * 如果响应已经存在在缓冲区中,在每次调用这个方法时会返回一个新的流对象. * 如果响应没有缓冲,返回流只允许读取一次. * @return The response body or <code>null</code>. * * @throws IOException If an I/O (transport) problem occurs while obtaining the * response body. */ public InputStream getResponseBodyAsStream() throws IOException { if (responseStream != null) { return responseStream; } if (responseBody != null) { InputStream byteResponseStream = new ByteArrayInputStream(responseBody); LOG.debug("re-creating response stream from byte array"); return byteResponseStream; }return null; }
responseStream定义如下:
<pre name="code" class="java">/** The response body of the HTTP method, assuming it has not be * intercepted by a sub-class. */ /** 响应头的HTTP方法,假设它没有被子类截获 */private InputStream responseStream = null;
responseStream在如下代码运行时候被赋值
int statusCode = httpClient.executeMethod(getMethod);
而后
long contentLength = getResponseContentLength();获取contentLength值为-1,故弹出警告.
解决方案如下:
//4.处理HTTP相应内容byte[] responseBoby = getMethod.getResponseBody();修改为
BufferedReader responseBoby = new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream()));
原saveToLocal改为如下
private void saveToLocal(BufferedReader data,String filePath){try{String str;DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath))); while((str = data.readLine())!=null){ out.writeBytes(str); } out.flush();out.close();} catch (IOException e){e.printStackTrace();}}
0 0
- 读书笔记——自己动手写网络爬虫_第一章(1)
- 读书笔记——自己动手写网络爬虫_第一章(2)
- 自己动手写网络爬虫1
- 读书笔记——自己动手写网络爬虫--图的优先遍历
- 用python写网络爬虫读书笔记 第一章网络爬虫简介
- 自己动手写网络爬虫
- 自己动手写网络爬虫
- 自己动手写网络爬虫
- 自己动手写网络爬虫-----(1)
- 《自己动手写爬虫网络》笔记1
- 学习《自己动手写网络爬虫》之记录1
- 自己动手做网络爬虫系列——1
- 自己动手写网络爬虫(第一天)
- 自己动手写网络爬虫学习笔记
- 《自己动手写网络爬虫》笔记5-设计爬虫对列
- 《用python写网络爬虫》第一章
- 《自己动手写操作系统》读书笔记——初识保护模式
- 《自己动手写操作系统》读书笔记——初识保护模式
- [NWPU][2014][TRN][18]最短路问题 A - 模板 POJ 2387
- 黑马程序员 封装特性之继承和多态
- LSA
- 系统性训练,励志刷完挑战程序设计竞赛-代码整理1~42【初级篇】
- SVD
- 读书笔记——自己动手写网络爬虫_第一章(1)
- log4j:WARN No appenders could be found for logger
- Cannot send session cookie - headers already sent by
- c语言字符串 数字转换函数大全
- 文件操作
- Number Sequence(1005)
- Python学习杂记十
- ubuntu配置
- [NWPU][2014][TRN][21]数论入门 B - 扩展欧几里得 POJ 1061