一个最简单的网络爬虫的实现

来源:互联网 发布:centos帮助文档 编辑:程序博客网 时间:2024/04/29 20:24

网络爬虫 听起来有点复杂,但最基本的原理却不难,就是给你一个网址,然后你把该网站的内容下载下来,筛选出上面其他的url地址,保存在一个队列里,然后访问其中一个url,再下载,再筛选,直到满足某个条件。

当然,其中还牵扯到各种策略,什么广度优先,深度优先,但我们这里是最简单的网络爬虫,所以不讨论。


好,那么我们从最简单的原理入手。首先,我们要建立一个保存网址的数据结构。

public class queue {        private LinkedList queue;            //构造函数      public queue()      {          queue=new LinkedList();      }      //入队列      public void enQueue(Object elem)      {          queue.addLast(elem);      }      //出队列      public Object deQueue()      {          return queue.removeFirst();      }      //判断队列是否为空      public boolean isEmpty()      {          return queue.isEmpty();      }      //判断队列中是否含有某个元素      public boolean contains(Object elem)      {          return queue.contains(elem);      }  }  
再然后,我们再写一个类来保存我们的网址。

public class MyQueue {// 已访问的URL的队列private Set visitedQueue;// 未访问的URL的队列private queue unVisitedQueue;// 构造函数public MyQueue() {visitedQueue = new HashSet<String>();unVisitedQueue = new queue();}// 加入已访问的队列public void addURL(String url) {visitedQueue.add(url);}// 返回已访问的队列public Set getVisited() {return this.visitedQueue;}// 移除访问过的URLpublic void removeUrl(String url) {visitedQueue.remove(url);}// 未访问过的URL出队列public String getUnVURL() {return (String) unVisitedQueue.deQueue();}    public boolean contains(String url){    if(!unVisitedQueue.contains(url))    return true;    return false;    }// 加入未访问过的URLpublic void addUnVURL(String url) {if ((url != null)&& (!url.trim().equals("") && (!visitedQueue.contains(url)) && (!unVisitedQueue.contains(url) ))&&url.contains("http") ) {unVisitedQueue.enQueue(url);}}// 获得已访问的URL的数目public int getVisitedNum() {return visitedQueue.size();}// 判断未访问的队列是否为空public boolean isEmpty() {return unVisitedQueue.isEmpty();}}

最后,我们写主程序。当你输入一个网址,我们用Httpclient(需要你自己下载JAR包)来下载它的内容,并用正则表达式来筛选其中的url地址,并在控制台上输入爬去到的网址,当网址大于2000的时候,程序终止。

public class HttpDownLoader {static int count = 0;public static void main(String[] args) {HttpClient httpClient = new HttpClient();// 设置 Http 连接超时 5shttpClient.getHttpConnectionManager().getParams().setConnectionTimeout(5000);Scanner sc = new Scanner(System.in);MyQueue mq = new MyQueue();mq.addUnVURL(sc.next());while (count < 1000 && !mq.isEmpty()) {String sh = mq.getUnVURL();GetMethod getMethod = new GetMethod(sh);// 设置 get 请求超时 5sgetMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT,5000);// 设置请求重试处理getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,new DefaultHttpMethodRetryHandler());mq.addURL(sh);try {StringBuilder sb = new StringBuilder();int status = httpClient.executeMethod(getMethod);BufferedReader br = null;if (status == HttpStatus.SC_OK) {br = new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream()));String line = null;while ((line = br.readLine()) != null) {sb.append(line);}} else if ((status == HttpStatus.SC_MOVED_PERMANENTLY)|| (status == HttpStatus.SC_MOVED_TEMPORARILY)|| (status == HttpStatus.SC_SEE_OTHER)|| (status == HttpStatus.SC_TEMPORARY_REDIRECT)) {Header head = getMethod.getResponseHeader("location");if (head != null) {String newURL = head.getValue();if ((newURL == null) || newURL.equals("")) {newURL = "/";GetMethod getMethod1 = new GetMethod(newURL);httpClient.equals(getMethod1);br = new BufferedReader(new InputStreamReader(getMethod1.getResponseBodyAsStream()));String line = null;while ((line = br.readLine()) != null) {sb.append(line);}}}}// String shtml = getMethod.getResponseBodyAsString();String mode = "(?<=(href=\")).*?(?=\")";// String mode ="<[aA]\\s*(href=[^>]+)>(.*?)</[aA]>" ;System.out.println(sb.toString());Pattern p = Pattern.compile(mode);Matcher m = p.matcher(sb.toString());while (m.find()) {String url = m.group();if (url.contains("http") && mq.contains(url)) {System.out.println(url);mq.addUnVURL(url);count++;}}} catch (IOException e) {// TODO Auto-generated catch block// e.printStackTrace();}getMethod.releaseConnection();}// getMethod.releaseConnection();System.out.println("访问了" + mq.getVisitedNum() + "个网页");}}
然后我们的简易网络爬虫就完成了。

0 0
原创粉丝点击