读书笔记——自己动手写网络爬虫_第一章(1)

来源：互联网发布：centos7下nginx配置编辑：程序博客网时间：2024/05/29 08:23

自己只是一个新手，所以仅列出所遇到问题以及解决方法.

NO.1

书中第一章,DownLoadFile类downloadFile方法。

书中代码

//4.处理HTTP相应内容byte[] responseBoby = getMethod.getResponseBody();filePath = "temp\\"+getFileNameByUrl(url,getMethod.getResponseHeader("Content-Type").getValue());

完整代码编译成功并运行后抛出java.io.FileNotFoundException异常。

修改如下

String rootPath=getClass().getResource("/").getFile().toString()+"SgwSpider\\";  if(!new File(rootPath).exists())  //判断该路径是否存在 {      new File(rootPath).mkdir();  //如若不存在则创建该路径 }  filePath = rootPath+getFileNameByUrl(url,getMethod.getResponseHeader("Content-Type").getValue());

getClass().getResource("/").getFile().toString(); //获取当前类的所在工程路径并转换为字符串

而后在抓取其他不同网页时出现如下警告:

org.apache.commons.httpclient.HttpMethodBase getResponseBody
警告: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.

报出警告代码与上处相同:

//4.处理HTTP相应内容byte[] responseBoby = getMethod.getResponseBody();

跟进源码后,在HttpMethodBase抽象类中

<pre name="code" class="java">     /**     * Returns the response body of the HTTP method, if any, as an array of bytes.     * If response body is not available or cannot be read, returns <tt>null</tt>.     * Buffers the response and this method can be called several times yielding     * the same result each time.      * 返回HTTP响应体的方法，如果存在响应体，则作为一个bytes数组返回。          * 如果响应体不可用亦或者无法被读取，则返回NULL。         * 缓冲响应并且可以多次调用该方法         * Note: This will cause the entire response body to be buffered in memory. A     * malicious server may easily exhaust all the VM memory. It is strongly     * recommended, to use getResponseAsStream if the content length of the response     * is unknown or resonably large.     * 返回HTTP响应体的方法，如果存在响应体，则作为一个bytes数组返回。     * 这里将会引起整个响应体被缓存在内存中。     * 一个恶意的服务器可能很简单的就耗尽所有的虚拟内存。      * 强烈的建议,在不知道响应体大小时候亦或者相当大小时使用getResponseAsStream。     * @return The response body.     *      * @throws IOException If an I/O (transport) problem occurs while obtaining the      * response body.     */     public byte[] getResponseBody() throws IOException {        if (this.responseBody == null) {            InputStream instream = getResponseBodyAsStream();            if (instream != null) {                long contentLength = getResponseContentLength();                if (contentLength > Integer.MAX_VALUE) { //guard below cast from overflow                    throw new IOException("Content too large to be buffered: "+ contentLength +" bytes");                }                int limit = getParams().getIntParameter(HttpMethodParams.BUFFER_WARN_TRIGGER_LIMIT, 1024*1024);                if ((contentLength == -1) || (contentLength > limit)) {  /<span style="color:#ff0000;">/报出警告处</span>                    LOG.warn("Going to buffer response body of large or unknown size. "                            +"Using getResponseBodyAsStream instead is recommended.");                }                LOG.debug("Buffering response body");                ByteArrayOutputStream outstream = new ByteArrayOutputStream(                        contentLength > 0 ? (int) contentLength : DEFAULT_INITIAL_BUFFER_SIZE);                byte[] buffer = new byte[4096];                int len;                while ((len = instream.read(buffer)) > 0) {                    outstream.write(buffer, 0, len);                }                outstream.close();                setResponseStream(null);                this.responseBody = outstream.toByteArray();            }        }        return this.responseBody;    }

if (this.responseBody == null) {            InputStream instream = getResponseBodyAsStream();

responseBody定义如下

<pre name="code" class="java"> /** Buffer for the response */ /** 响应的缓冲区 */    private byte[] responseBody = null;

期间responseBoby未被赋值

继续跟进

<pre name="code" class="java">/**     * Returns the response body of the HTTP method, if any, as an {@link InputStream}.      * If response body is not available, returns <tt>null</tt>. If the response has been     * buffered this method returns a new stream object on every call. If the response     * has not been buffered the returned stream can only be read once.     * 返回HTTP响应头的方法,如果有,则作为InputStream返回.     * 如果响应头不可用,则返回Null.     * 如果响应已经存在在缓冲区中,在每次调用这个方法时会返回一个新的流对象.     * 如果响应没有缓冲,返回流只允许读取一次.     * @return The response body or <code>null</code>.     *      * @throws IOException If an I/O (transport) problem occurs while obtaining the      * response body.     */    public InputStream getResponseBodyAsStream() throws IOException {        if (responseStream != null) {            return responseStream;        }        if (responseBody != null) {            InputStream byteResponseStream = new ByteArrayInputStream(responseBody);            LOG.debug("re-creating response stream from byte array");            return byteResponseStream;        }return null;      }

responseStream定义如下:

<pre name="code" class="java">/** The response body of the HTTP method, assuming it has not be  * intercepted by a sub-class. */ /** 响应头的HTTP方法,假设它没有被子类截获 */private InputStream responseStream = null;

responseStream在如下代码运行时候被赋值

int statusCode = httpClient.executeMethod(getMethod);

而后

long contentLength = getResponseContentLength();

获取contentLength值为-1,故弹出警告.

解决方案如下:

//4.处理HTTP相应内容byte[] responseBoby = getMethod.getResponseBody();

修改为

BufferedReader responseBoby = new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream()));

原saveToLocal改为如下

private void saveToLocal(BufferedReader data,String filePath){try{String str;DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath))); while((str = data.readLine())!=null){  out.writeBytes(str); }  out.flush();out.close();} catch (IOException e){e.printStackTrace();}}

0 0