网页抓取方式(一)--HttpClient

来源:互联网 发布:广联达破解软件下载 编辑:程序博客网 时间:2024/05/29 15:29

一、HttpClient简介

通过HttpClient,我们可以进行网页抓取,优点是:这种方式高效快速。

缺点是:当然另一方面对js是不支持的,缺乏文档解析方法。通常可以作为普通的抓取方式。

二、实例

1、添加maven依赖

<dependency>    <groupId>org.apache.httpcomponents</groupId>    <artifactId>httpclient</artifactId>    <version>4.5.3</version></dependency>

二、代码实例

public class HttpClientCrawlerMain {    public static void main(String[] args) throws Exception {        httpClientCrawler();    }    static void httpClientCrawler() throws Exception {        String url = "http://www.ifeng.com/";        CloseableHttpClient httpclient = HttpClients.createDefault();        HttpGet httpGet = new HttpGet(url);        CloseableHttpResponse response = httpclient.execute(httpGet);        HttpEntity entity = response.getEntity();        if (!Objects.isNull(entity)) {            String content = EntityUtils.toString(entity, "UTF-8");            //正则方式提取头条信息            Pattern headlinePat = Pattern.compile("<div id=\"headLineDefault\">[\\s\\S]*<h1><a href=\"http://news.ifeng.com/mainland.*?target=\"_blank\">(.*?)</a>");            Matcher m = headlinePat.matcher(content);            if (m.find()) {                String result = m.group(1);                System.out.println("ifeng headline is : " + result);            }        }    }}

运行结果:

ifeng headline is : 习近平出席上合成员国元首理事会会议

原创粉丝点击