WebCrawler-HttpClient

来源：互联网发布：魔卡幻想淘宝编辑：程序博客网时间：2024/05/29 09:25

HttpClient的学习

虽然在 JDK 的 java net包中已经提供了访问 HTTP 协议的基本功能，但是对于大部分应用程序来说，JDK 库本身提供的功能还不够丰富和灵活。HttpClient 是 Apache Jakarta Common 下的子项目，用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包，并且它支持 HTTP 协议最新的版本和建议。

URL与URI

URI（Universal Resource Identify），指通用资源标识符，而URL（Uniform Resource Locator），是指统一资源定位符，那么两个有什么区别，其实两者就是范围大小的问题，URI是包含URL的，URI由访问资源的命名机制、存放资源的主机名、资源自身的路径组成，而URL由协议、资源的主机IP地址、主机资源的具体地址组成，有他们两个的组成我们可以清楚的看出，其实URL就是我们平时输入浏览器的地址，如“http://www.baidu.com”，URL是URI的具体表现形式而已，URI是包含URL的。

HTTPClient

以下代码基于HTTPClient4.0版本，如果要运行这些代码，所要导入的jar包需得版本不低于4.0

创建一个客户端，使用HttpClient，用它来处理与http相关的操作，我们可以理解为创建一个浏览器那样：

HttpClient httpClient = new DefaultHttpClient();

创建一个HttpGet类，相当于与在浏览器中打开一个URL，该类的构造接受一个String类型的参数，就是我们要输入的URL了：

HttpGet httpGet = new HttpGet("http://www.baidu.com");

通过HttpClient的execute方法，参数为HttpGet类型的参数，相当于打进网址后回车，这个我们可以得到HttpResponse，这个是代表请求后对应的响应：

HttpResponse response = httpClient.execute(httpGet);

通过这个response我们可以拿到一个HttpEntity类的实体，这个实体里面有着Http报文的许多信息，当然包括我们想要的内容：

HttpEntity entity = response.getEntity();

通过entity这个实体，我们可以调用它的getContent方法，拿到的就是网页的内容，但这个内容是InputStream，不过有了InputStearm，什么都好办了：

InputStream instream = entity.getContent();

以下代码是是抓取百度首页内容的程序：

import org.apache.http.HttpEntity;import org.apache.http.HttpResponse;import org.apache.http.client.ClientProtocolException;import org.apache.http.client.HttpClient;import org.apache.http.client.methods.HttpGet;import org.apache.http.client.methods.HttpPost;import org.apache.http.client.methods.HttpRequestBase;import org.apache.http.config.RegistryBuilder;import org.apache.http.cookie.CookieSpec;import org.apache.http.cookie.CookieSpecProvider;import org.apache.http.impl.client.BasicCookieStore;import org.apache.http.impl.client.DefaultHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.impl.cookie.DefaultCookieSpec;import org.apache.http.message.BasicHeader;import org.apache.http.util.EntityUtils;public class Crawler {     public void testGet() throws Exception {            HttpClient httpClient = new DefaultHttpClient();            HttpGet httpGet = new HttpGet("http://www.baidu.com");            HttpResponse response = httpClient.execute(httpGet);            HttpEntity entity = response.getEntity();            if (entity != null) {                InputStream instream = entity.getContent();                int l;                byte[] temp = new byte[2048];                while ((l = instream.read(temp)) != -1) {                    System.out.println(new String(temp, 0, l, "utf-8"));                }            }        }    public static void main(String[] args) throws Exception {         Crawler crawler=new Crawler();         //crawler.testGet();         crawler.Login();    }}

阅读全文

0 0