JavaWEB学习记录--HtmlUnit爬网页数据

来源:互联网 发布:美工专用笔记本 编辑:程序博客网 时间:2024/06/10 11:09

Java–HtmlUnit爬网页数据

标签(空格分隔): java


一直使用免费的SS账号,但是一定时间都过期,还要手动去换密码之类的,身为程序员,就决定让这一切都自动化.


htmlunit是一款开源的java 页面分析工具,读取页面后,可以有效的使用htmlunit分析页面上的内容。项目可以模拟浏览器运行,被誉为java浏览器的开源实现。最大的优势可以让js执行,获取ajax执行后的结果.

1.抓取准备

目标: https://www.mianvpn.com

这里写图片描述

分析:点击Surge后会出来一个模态框,则模态框中显示配置的链接地址.这个过程并没发送请求,所以链接密码都是js直接生成的.所以后台要做的事情,模拟点击Surge,然后等js执行后抓取对应dom里面的内容.

(该链接点击后,会有一个js把modal内容改为正在获取中,然后再把生成的结果写入modal中,所以点击后需要配置js延时,不然会获取不到正确结果)

对应dom:<div class="modal-body" id="watext">

maven引入:

        <dependency>            <groupId>net.sourceforge.htmlunit</groupId>            <artifactId>htmlunit</artifactId>            <version>2.23</version>        </dependency>        <dependency>            <groupId>com.alibaba</groupId>            <artifactId>fastjson</artifactId>            <version>1.2.14</version>        </dependency>

2.配置WebClient

WebClient是htmlunit的内置浏览器,理解为一个没有图形显示的浏览器.需要配置其一些参数.
waitForBackgroundJavaScript()这个相当重要,不然很可能js还没执行完,代码就去获取新的页面内容了,导致没获取到正确结果.

import com.gargoylesoftware.htmlunit.BrowserVersion;import com.gargoylesoftware.htmlunit.WebClient;/** * @author Niu Li * @date 2016/10/8 */public enum  WebClientUtil {    INSTANCE;    public WebClient webClient;    WebClientUtil() {        webClient = new WebClient(BrowserVersion.CHROME);        webClient.getOptions().setUseInsecureSSL(true);//支持https        webClient.getOptions().setJavaScriptEnabled(true); // 启用JS解释器,默认为true        webClient.getOptions().setCssEnabled(false); // 禁用css支持        webClient.getOptions().setThrowExceptionOnScriptError(false); // js运行错误时,是否抛出异常        webClient.getOptions().setTimeout(10000); // 设置连接超时时间 ,这里是10S。如果为0,则无限期等待        webClient.getOptions().setDoNotTrackEnabled(false);        webClient.setJavaScriptTimeout(8000);//设置js运行超时时间        webClient.waitForBackgroundJavaScript(500);//设置页面等待js响应时间,    }}

3.抓取

思路是获取整个页面,然后获取全部的a标签(因为Surge本质是个a标签),再对a标签遍历找到内容为Surge的标签,再模拟点击,获取页面结果,分析结果,构造ss的配置文件gui-config.json,写入到指定路径.

构造gui-config.json对应实体类

public class SSModel {    /**     * configs : [{""}]     * index : 8     * random : false     * global : false     * enabled : true     * shareOverLan : false     * isDefault : false     * localPort : 1080     * pacUrl : null     * useOnlinePac : false     * reconnectTimes : 0     * randomAlgorithm : 0     * TTL : 0     * proxyEnable : false     * proxyType : 0     * proxyHost : null     * proxyPort : 0     * proxyAuthUser : null     * proxyAuthPass : null     * authUser : null     * authPass : null     * autoban : false     */    private int index = 0;    private boolean random = false;    private boolean global = false;    private boolean enabled = true;    private boolean shareOverLan = false;    private boolean isDefault = false;    private int localPort = 1080;    private String pacUrl;    private boolean useOnlinePac = false;    private int reconnectTimes = 0;    private int randomAlgorithm = 0;    private int TTL = 0;    private boolean proxyEnable = false;    private int proxyType = 0;    private String proxyHost;    private int proxyPort = 0;    private String proxyAuthUser = "";    private String proxyAuthPass = "";    private String authUser = "";    private String authPass = "";    private boolean autoban = false;    private List<ConfigsBean> configs;    //省略get和set}public class ConfigsBean {        private String remarks;        private String server;        private int server_port;        private String password;        private String method;        private String obfs;        private String obfsparam = "";        private String remarks_base64 = "";        private boolean tcp_over_udp = false;        private boolean udp_over_tcp = false;        private String protocol = "origin";        private boolean obfs_udp = false;        private boolean enable = true;        private String id;        //省略get和set}

具体获取方法:

package cn.mrdear.core;import com.gargoylesoftware.htmlunit.WebClient;import com.gargoylesoftware.htmlunit.html.DomElement;import com.gargoylesoftware.htmlunit.html.DomNodeList;import com.gargoylesoftware.htmlunit.html.HtmlPage;import java.io.IOException;import java.util.List;import java.util.stream.Collectors;import cn.mrdear.model.ConfigsBean;import cn.mrdear.util.ModelUtil;/** * @author Niu Li * @date 2016/10/8 */public class MianVpn {    private static final java.lang.String HOME_PAGE = "https://www.mianvpn.com";    public List<ConfigsBean> fetch(WebClient webClient) throws IOException {        //拿到整个页面        final HtmlPage page = webClient.getPage(HOME_PAGE);        //拿到全部a标签        DomNodeList<DomElement> domNodeList = page.getElementsByTagName("a");        List<ConfigsBean> results = domNodeList.stream()                //找到内容为Surge的a标签                .filter(domElement -> {                    if (domElement.getTextContent().equals("Surge")) {                        System.out.println(domElement.getTextContent());                        return true;                    }                    return false;                })                //模拟点击,并取出结果                .map(domElement -> {                    HtmlPage tempPage = null;                    try {                        webClient.waitForBackgroundJavaScript(500);                        tempPage = domElement.click();                        //这里如果仍然获取不到,可以让线程sleep下,再获取                        DomElement surge_url = tempPage.getElementById("surge_url");                        if (surge_url != null) {                            String href = surge_url.getAttribute("href");                            System.out.println(href);                            //转换为想要的结果                            return parseUrl(href);                        }                    } catch (IOException e) {                        e.printStackTrace();                    }                    return null;                })                //过滤掉为null的结果                .filter(configsBean -> configsBean != null)                //转换为list                .collect(Collectors.toList());            return results;    }    /**     * https://user.mianvpn.com/api/ss/surge/?host=47.88.188.62&port=10001&method=rc4-md5&pw=9575     * 解析得到的结果     */    private ConfigsBean parseUrl(String url) {        String paramStr = url.substring(url.indexOf('?')+1);        String[] paramArr = paramStr.split("&");        String host = paramArr[0].substring(paramArr[0].indexOf('=')+1);        Integer port = Integer.parseInt(paramArr[1].substring(paramArr[1].indexOf('=')+1));        String method = paramArr[2].substring(paramArr[2].indexOf('=')+1);        String pwd = paramArr[3].substring(paramArr[3].indexOf('=')+1);        ConfigsBean configsBean = new ConfigsBean();        configsBean.setRemarks(host);        configsBean.setServer(host);        configsBean.setServer_port(port);        configsBean.setMethod(method);        configsBean.setPassword(pwd);        configsBean.setObfs("http_simple");        configsBean.setId(ModelUtil.generateId());        return configsBean;    }}

上面方法返回一个list集合,所以另起一个主方法调用,这样的话就可以写多个抓取方法,最后综合结果.

主调用方法:
写入文件和读取文件,均使用fastjson

public class Main {    private static final String SS_PATH = "D:\\tools\\翻墙\\gui-config.json";    public static void main(String[] args) {        try (final WebClient webClient = WebClientUtil.INSTANCE.webClient;             InputStream inputStream = new FileInputStream(new File(SS_PATH));             OutputStream outputStream = new FileOutputStream(new File(SS_PATH));        ) {            MianVpn mianVpn = new MianVpn();            List<ConfigsBean> mianVpns = mianVpn.fetch(webClient);            for (ConfigsBean vpn : mianVpns) {                System.out.println(vpn);            }            //读取原配置文件            SSModel model = JSON.parseObject(inputStream, null, SSModel.class);            if (model == null) {                model = new SSModel();                model.setConfigs(mianVpns);            }            //写入config那部分.            JSON.writeJSONString(outputStream, model);        } catch (IOException e) {            e.printStackTrace();        }    }}

抓取结果:
这里写图片描述

另外可以再抓取其他网站的账号密码,一起再主方法中调用

4.使用bat脚本

该项目打包后是一个jar,每次密码失效的时候都需要去运行一下.这样的工作完全可以让脚本来替代,写个bat脚本执行java -jar XX.jar即可.

@echo offcolor 1fclsecho.echo 1获取账号echo.echo 2退出echo.SET t=SET /P t=请选择1/2:IF /I '%t:~0,1%'=='1' GOTO startIF /I '%t:~0,1%'=='2' GOTO stopexit:startecho 正在获取,请稍后java -jar E://jar/mrdear-1.0.jarstart D:\tools\翻墙\ShadowsocksR-dotnet4.0.exegoto finish:stopecho 正在退出,请稍后goto end:endexet

5.遇到其他问题

一开始maven打包后引入的其他jar架包打包不进去,每次都找不到主main入口,后来查了下,需要额外一个插件才可以运行起来.

该插件会把启动方法写入到MANIFEST.MF当中.

        <plugin>                <groupId>org.apache.maven.plugins</groupId>                <artifactId>maven-shade-plugin</artifactId>                <version>1.2.1</version>                <executions>                    <execution>                        <phase>package</phase>                        <goals>                            <goal>shade</goal>                        </goals>                        <configuration>                            <transformers>                            //这里配置主main方法.                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">                                    <mainClass>cn.mrdear.core.Main</mainClass>                                </transformer>                            </transformers>                        </configuration>                    </execution>                </executions>            </plugin>

6.源码地址

github: https://github.com/nl101531/JavaWEB

0 0