java网络爬虫开发笔记（5）

来源：互联网发布：景观大数据解压密码编辑：程序博客网时间：2024/05/22 16:03

程序猿的幽默是什么样的？嗯，大概，就是这样的吧哈哈哈哈。

（过年，放的晚了点，请见谅）

0x05 凉宫parse()的忧郁

在前面几篇文章中我们几乎把GenericCrawler里面的所有方法都改了个遍，除了一个：parse()方法。
原来的parse()方法长这样：

    private Document parse(String url) throws IOException {        CloseableHttpClient client = HttpClients.createDefault();        HttpGet get = new HttpGet(url);        HttpResponse response = client.execute(get);        return Jsoup.parse(response.getEntity().getContent(), "UTF-8", url);    }

然而这段代码在实践中问题重重，比如：

我们仍未知道那天所看见的UA的名字

在爬phodal的时候，发现网站设了UA过滤。HttpClient不做任何设置的UA是：

User-Agent: Apache-HttpClient/4.5 (Java/1.8.0_05)

（版本号不解释，常识）
然而这个UA会被服务器过滤掉，得到的结果是：

<html><head><title>403 Forbidden</title></head><body bgcolor="white"><center><h1>403 Forbidden</h1></center><hr><center>nginx/1.11.5</center></body></html>

嗯，403 Forbidden。
问题的解决方法也很简单，只要把UA设置成一个合适的值就行了，为此，我们来看一眼浏览器的UA是啥：

console.log(navigator.userAgent);

输出：

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36

（同样不解释）
为了不让服务器认出我是爬虫（其实这个话题非常大，具体到服务器反爬虫反DDoS和爬虫、DDoS攻击程序的伪装有很多值得探讨的内容，由于phodal的站其实只有UA过滤，在此暂且不表），只要在Header把请求的UA设置成浏览器的UA就行了：

    private Document parse(String url) throws IOException {        CloseableHttpClient client = HttpClients.createDefault();        HttpGet get = new HttpGet(url);        get.setHeader("User-Agent",                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36");        HttpResponse response = client.execute(get);        return Jsoup.parse(response.getEntity().getContent(), "UTF-8", url);    }

再跑就没问题了：

<!doctype html><html lang="zh-cmn-Hans"><head><title> Phodal - 狼和凤凰 | Growth Engineer</title><!-- 此处省略一大堆meta和link标签 --></head><body itemscope itemtype="http://schema.org/WebPage"><!-- 太长全部省略，反正只要知道最后程序跑的很好网页成功输出来了就行了 --></body></html>

我的http字符集物语一定有问题

抛弃Jsoup的网络模块不用还带来了一个严重的问题：字符集。
因为HttpClient是一个纯粹的网络组件，它只负责创建http链接和获取内容，而Jsoup抛掉网络模块就是一个纯粹的html解析组建，然而问题在于：html文档的字符集可能来源于两个不同的地方：

http response里面的Content-Type Header，例如Content-Type: text/html; charset=gb2312，其中的charset就是字符集。
html文档的<head>中的<meta>元素，例如<meta http-equiv="Content-Type" content="text/html; charset=gb2312">。

注意到1.优先级高于2.。问题就在于，如果同时使用Jsoup的网络和html解析模块的话，其内部会自动确定实际使用的字符集定义在哪里，然而现在我们不得不把这两者拆开，就也不得不手动确定字符集是什么了。

索性在两个部分中确定字符集的行为在这两个库中也各自都有实现，我们只用找到它们的实现并且调用/复制粘贴就行了。
先来看Jsoup的部分（Jsoup.java中的parse(InputStream, String, String)）：

    /**    Read an input stream, and parse it to a Document.    @param in input stream to read. Make sure to close it after parsing.    @param charsetName (optional) character set of file contents. Set to {@code null} to determine from {@code http-equiv} meta tag, if present, or fall back to {@code UTF-8} (which is often safe to do).    @param baseUri The URL where the HTML was retrieved from, to resolve relative links against.    @return sane HTML    @throws IOException if the file could not be found, or read, or if the charsetName is invalid.    */    public static Document parse(InputStream in, String charsetName, String baseUri) throws IOException {        return DataUtil.load(in, charsetName, baseUri);    }

也就是说，如果在http response中找到了Content-Type的header，就取出这个字符集作为charsetName传入，如果没有找到就传入null，Jsoup就会自动根据<meta>标签的内容来确定字符集。

于是我们的任务还剩下另一部分：查找http response中是否有Content-Type的header，如果有，就从中解析出字符集的名称来。
这里我们可以看EntityUtils.java的toString(HttpEntity, Charset)中用于确定字符集的这一段：

public static String toString(final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {    // 省略若干行    Charset charset = null;    try {        final ContentType contentType = ContentType.get(entity);        if (contentType != null) {            charset = contentType.getCharset();        }    } catch (final UnsupportedCharsetException ex) {        if (defaultCharset == null) {            throw new UnsupportedEncodingException(ex.getMessage());        }    }    if (charset == null) {        charset = defaultCharset;    }    if (charset == null) {        charset = HTTP.DEF_CONTENT_CHARSET;    }    // 省略若干行}

也就是说，它会根据Content-Type（如果有）中的定义来确定一个字符集，如果没找到（或者不支持），就以defaultCharset->HTTP.DEF_CONTENT_CHARSET的顺序来采取默认值。
当然这里我们要的不是默认值，而是在Content-Type中没有找到的情况下直接返回null，于是我们只要截取EntityUtils.toString()中的这个try/catch块就行了：

    HttpEntity entity = response.getEntity();    Charset charset = null;    try {        ContentType contentType = ContentType.get(entity);        if (contentType != null) charset = contentType.getCharset();    } catch (final UnsupportedCharsetException ignored) {        // 保持charset为null    }    String charsetName = null;    if (charset != null) charsetName = charset.name();    Document doc = Jsoup.parse(entity.getContent(), charsetName, url);

为了测试这个方法的可用性，我们首先要把html的字符集来源分成三种情况（两边都没有定义的算特殊情况，正常人是不会这么写的，所以就不管了）：

Content-Type有定义，不存在<meta>标签。
Content-Type未定义，在<meta>标签中定义。
Content-Type和<meta>标签中都有定义。

第二种比如http://www.sfls.cn（是gb2312），第三种很常见，比如http://www.zhangxinxu.com（是utf-8），第一种的测试的话，我在本地的服务器上面写了一个小页面：
（取gb2312是因为默认是utf-8）（title的值是“恭贺新禧”，正好过年嘛）

<%@page pageEncoding="gb2312" %><%String title = "\u606d\u8d3a\u65b0\u79a7";%><html>    <head>        <title><%=title %></title>    </head>    <body>    </body></html>

测试代码（写在GenericCrawler里面）：

public static void main(String[] args) throws Throwable {    String url = "各个测试用例分别的url，就是上面那些，不列出来了";    GenericCrawler crawler = new GenericCrawler(null); // 空壳对象，用于调用实例方法    Document doc = parse(url);    System.out.println(doc.charset());    System.out.println(doc.title());}

测试结果：
test case 1:

GB2312
恭贺新禧

test case 2:

GB2312
上海外国语大学附属外国语学校 >> 首页

test case 3:

UTF-8
首页 » 张鑫旭-鑫空间-鑫生活

完美解决问题。

完整的parse()方法

把以上这么多放到一起，现在的parse()是：

    private Document parse(String url) throws IOException {        // establish HTTP connection        CloseableHttpClient client = HttpClients.createDefault();        HttpGet get = new HttpGet(url);        get.setHeader("User-Agent",                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36");        HttpResponse response = client.execute(get);        HttpEntity entity = response.getEntity();        // retrieve charset        Charset charset = null;        try {            ContentType contentType = ContentType.get(entity);            if (contentType != null) charset = contentType.getCharset();        } catch (ParseException | UnsupportedCharsetException ignored) {        }        String charsetName = null;        if (charset != null) charsetName = charset.name();        // parse HTML page        Document doc = Jsoup.parse(entity.getContent(), charsetName, url);        client.close();        return doc;    }

小结

过年期间拜年忙，没有太多时间写博客，所以本篇略有些短，发的也略有些晚，抱歉。不知道这种风格大家喜不喜欢，喜欢的话，求关注不解释，不喜欢的话，也欢迎提出意见和建议。

文字功底不行，说出来的话也不免于滥俗之地，因此，我想，新年伊始，就对大家说一句好了：祝各位，鸡年大吉吧！

0 0