httpclient爬取网页

来源:互联网 发布:js改变display样式 编辑:程序博客网 时间:2024/05/16 23:56

1、设置爬取的url

String url = "http://www.gametoutiao.com/toutiao/index.html";

2、建立爬取的客户端

HttpClient client = new HttpClient();client.getHttpConnectionManager().getParams().setConnectionTimeout(90000);client.getHttpConnectionManager().getParams().setSoTimeout(90000);

3、建立爬取的请求头

HashMap<String, String> headerMap = new HashMap<>();headerMap.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");headerMap.put("Accept-Language", "zh-CN,zh;q=0.8");headerMap.put("Host", "http://www.gametoutiao.com");

4、创建请求的get方法

GetMethod getmethod = new GetMethod(url);getmethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler());getmethod.getParams().setParameter("http.protocol.cookie-policy", CookiePolicy.BROWSER_COMPATIBILITY);getmethod.setRequestHeader("Accept-Encoding", "gzip,deflate");

5、将请求头添加到get方法中去

for(Entry<String, String> entry:headerMap.entrySet()){    getmethod.addRequestHeader(entry.getKey(), entry.getValue());}

6、执行方法

int statusCode = client.executeMethod(getmethod);if(statusCode != HttpStatus.SC_OK){    System.out.println("fail!!!!");}

7、获得方法的返回结果

byte[] result_body = getmethod.getResponseBody();String body = new String(result_body);

8、对结果的处理,找到最大的页数

Pattern pattern = Pattern.compile("(\\d+).html\">尾页");Matcher matcher = pattern.matcher(body);if(matcher.find()){    System.out.println(matcher.group(1));}else{    System.out.println("not find");}

9、获取每篇文章的title,url

//<li class="col-sm-4 col-md-4">Pattern pattern_content = Pattern.compile("<li class=\"col-sm-4 col-md-4\">(.*?)</li>",Pattern.DOTALL);Matcher matcher_content = pattern_content.matcher(body);while(matcher_content.find()){    String lis = matcher_content.group(1);    // <img src="/d/file/toutiao/1603c85157af7077cc29bfa9256f68e8.jpg" width="220" height="145" alt="对游戏行业而言 垄断或是市场竞争的一种最优状态"/>    Matcher m_title_url = Pattern.compile("<img src=\"(.*?)\" width=\"220\" height=\"145\" alt=\"(.*?)\"/>").matcher(lis);    if(m_title_url.find()){        System.out.println(m_title_url.group(1));        System.out.println(m_title_url.group(2));    }}

10、释放链接

getmethod.releaseConnection();
0 0