利用jsoup和httpclient来进行网站的爬取

来源：互联网发布：工信部大数据认证考试编辑：程序博客网时间：2024/05/20 09:06

建议：事先定义一个线程池进行线程托管，推荐线程数20需定义：pool、worker、task、queue等参数(在此并不进行线程的讨论）

一、请求模拟

定义默认的一个closeableHttpClient
CloseableHttpClient httpClient = HttpClients.createDefault();
模拟get请求
HttpGet httpGet = new HttpGet(url);
设置请求相关参数
RequestConfig config= RequestConfig.custom().setConnectTimeout(10*1000).setConnectionRequestTimeout(3*1000).setSocketTimeout(10 * 1000).build();
httpGet.setConfig(config);
利用client进行请求的发送
httpclient.execute(httpGet)
获得返回的状态码(200 成功)
closeableHttpResponse.getStatusLine().getStatusCode()==20
ISO-8859-1转码
EntityUtils.toString(closeableHttpResponse.getEntity(), CHARSET);
jsoup解析当前转码后的string
Document doc=jsop.parse(String)

二、document内容判断

爬取获得所有的链接
Elements links = document.select("a[href]")；//<a href="xxxx"/>
根据规则删除返回或列表等链接留下可用连接
for (Element link : links) { ***}
使用线程Thread1进行异步把可用链接进行迭代
thread1.start()；//下文的三为在此执行
抓取完毕线程销毁
threads.destory();

三、（上文的异步的性质）可以链接里的需要获得的内容进行抓取下载例如（img）

定义一个新的连接
URL url = new URL(downloadFileUrl);
URLConnection urlConnection = url.openConnection();
连接超过10秒若还未成功中则失败
urlConnection.setConnectTimeout(10*1000);
读取时间超过7秒若还未成功则失败
urlConnection.setReadTimeout(7*1000);
请求头的属性设置 在此使用的是mozilla内核参数
urlConnection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
流的获取
InputStream inputStream = urlConnection.getInputStream();
写入缓冲数据流
byte[] bytes = new byte[1024 * 4];
int len=0;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
while ((len=inputStream.read(bytes))!=-1){
    byteArrayOutputStream.write(bytes,0,len);
}
byte[] dataArrByte = byteArrayOutputStream.toByteArray();
byteArrayOutputStream.close();
关闭流释放资源将数据的二进制组留下准备写出
File downloadFile = new File(dirPath);
if (!downloadFile.exists()){
    downloadFile.mkdirs();
}
判断文件是否存在存在跳过
if (!new File(dirPath+CrawlerUrl.fileName(downloadFileUrl)).exists()){
本地文件的创建
    File file = new File(downloadFile + File.separator + CrawlerUrl.fileName(downloadFileUrl));
输出流的写入
    FileOutputStream fileOutputStream = new FileOutputStream(file);
    fileOutputStream.write(dataArrByte);
    if (fileOutputStream!=null){
        fileOutputStream.close();
    }
    System.out.println("文件:["+CrawlerUrl.fileName(downloadFileUrl)+"]本地路径为"+dirPath+CrawlerUrl.fileName(downloadFileUrl));
}else {
    System.out.println("文件:["+CrawlerUrl.fileName(downloadFileUrl)+"]已存在！！！！！！！");
}
流的关闭
if (inputStream!=null){
    inputStream.close();
}

2 1