一个简单的宽度优先网络爬虫

来源:互联网 发布:mcs-51单片机寻址方式 编辑:程序博客网 时间:2024/05/17 07:20

         在实际应用中,使用网络爬虫遍历互联网,把网络中我们感兴趣的网页全部抓取过来。为便于理解,我们把整个Internet看做一个超级大图,每个页面作为图中的一个节点,页面中的超链接可看做图中的有向边。爬虫在抓取网页过程中有两种遍历方式:深度优先遍历和宽度优先遍历。由于在深度优先遍历中,随着遍历深度的增加,可能抓取到的网页与主题的相关性降低,所以一般不采用这种遍历方式。在实际中开发者总喜欢将相关主题的链接放在同一个页面中,故按照宽度优先遍历的方式抓取的网页与主题相关性大大增强。

         在本网络爬虫中我们提供一个初始链接作为种子链接。宽度优先爬虫就是从种子节点开始,提取出网页中的超链接,放入队列中依次去抓取,整个爬虫维持两张表:visited表和unvisited表,分别保存访问过的超链接和未访问的超链接。爬虫的工作主要是解析URL,提取新的URL,进行以下工作:(1)将解析出的URLVisited表和UnVisited中的URL对比,如果没有出现则放入UnVisited表中,表示其未被访问过。(2)处理完本网页中所有超链接后,将本URL放入Visited表中,从UnVisited表中找出表头URL数据,进行第一步骤。(3)重复以上工作直至UnVisited表为空或者达到指定数目为止。

         现在介绍使用java实现一个简易的爬虫,其中用到HttpClientHtmlParser两个开源的工具包。程序结构图如下:

 

首先定义URL队列,使用LinkedList来实现这个队列:

public class Queue {
        
privateLinkedList queue = new LinkedList();
         publicvoid enQueue(Object url){
                   queue.addLast(url);
         }
         publicObject deQueue(){
                   returnqueue.removeFirst();
         }
         publicboolean isQueueEmpty(){
                   returnqueue.isEmpty();
         }
         //判断队列是否包含url
        
publicboolean contains(Object url){
                   returnqueue.contains(url);
         }
}

其次使用LinkQueue来区分访问过的URL和未访问的URL,为保证效率我们使用hashset作为存储结构:

public class LinkQueue {
        
privatestatic Set visitedUrl = new HashSet();
         privatestatic Queue unvisitedUrl = new Queue();
         privatestatic PriorityQueue unVisitedURL = new PriorityQueue();
         publicstatic Queue getUnVisistedUrl(){
                   returnunvisitedUrl;
         }
         //添加到访问过的URL队列中
        
publicstatic void addVisistedUrl(String url){
                   visitedUrl.add(url);
         }
         //移除访问过的URL
        
publicstatic void removeVisistedUrl(String url){
                   visitedUrl.remove(url);
         }
         //未访问过的URL出队列
        
publicstatic Object unVisitedUrlDeQueue(){
                   returnunvisitedUrl.deQueue();
         }
         //保证每个URL只被访问一次
        
publicstatic void addUnvisitedUrl(String url){
                   if(url!=null&& !url.trim().equals("") &&
                                     !visitedUrl.contains(url)&& !unvisitedUrl.contains(url)){
                            unvisitedUrl.enQueue(url);
                   }
         }
         //获得已经方位的URL数目
        
publicstatic int getVisitedURLNum(){
                   returnvisitedUrl.size();
         }
         //测试未访问的URL队列是否为空
        
publicstatic boolean unVisitedEmpty(){
                   returnunvisitedUrl.isQueueEmpty();
         }
}

再次,需要实现网页下载并处理:

public class DownLoadFile {
        
publicString getFileNameByUrl(String url,String contentType){
                   //移除http://
                  
url= url.substring(7);
                   //text/html类型
                  
if(contentType.indexOf("html")!=-1){
                            url= url.replaceAll("[\\?/:*|<>\"]","_")+".html";
                            returnurl;
                   }
                   //如application/pdf类型
                  
else{
                            returnurl.replaceAll("[\\?/:*|<>\"]","_"+"."+
                                               contentType.substring(contentType.lastIndexOf("/"))+1);
                   }
         }
         /*
          * 保存网页字节数组到本地文件,filepath为要保存的文件的相对地址
        
 */
         privatevoid saveToLocal(byte[]data,String filepath){
                   …………
         }
         //下载URL指向的网页
        
publicString downloadFile(String url){
                   Stringfilepath = null;
                   //生成HttpClient对象并设置参数
                  
HttpClienthttpClient = new HttpClient();
                   //设置http连接超时5s
                  
httpClient.getHttpConnectionManager().getParams()
                            .setConnectionTimeout(5000);
                   //生成getMethod方法并设置参数
                  
GetMethodgetMethod = new GetMethod(url);
                   //设置getMethod超时5s
                  
getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT,5000);
                   //设置请求重试处理
                  
getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, 
                                     newDefaultHttpMethodRetryHandler());
                   try{
                            intstatusCode = httpClient.executeMethod(getMethod);
                            if(statusCode!= HttpStatus.SC_OK){
                                     System.out.println("Methodfailed:"+getMethod.getStatusLine());
                                     filepath= null;
                            }
                            byte[]responseBody = getMethod.getResponseBody();//读取为字节数组
                           
//根据url生成保存时的文件名
                           
filepath="temp\\"+getFileNameByUrl(url,getMethod.getResponseHeader("Content-Type").getValue());
                            saveToLocal(responseBody,filepath);
                   }catch(HTTPExceptione){
                            System.out.println("Pleasecheck you provided URL!");
                            e.printStackTrace();
                   }catch(IOExceptione){
                            e.printStackTrace();
                   }finally{
                            getMethod.releaseConnection();
                   }
                   returnfilepath;
         }
}

接下来,从网页中提取URL,我们使用java中一个非常著名的开源工具包HtmlParser,它能从html页面中提取你感兴趣的任何类容:

public class HtmlParserTool {
        
publicstatic Set<String> extracLinks(String url, LinkFilter filter) {
                   Set<String>links = new HashSet<String>();
                   try{
                            Parserparser = new Parser(url);
                            parser.setEncoding("gb2312");
                            //过滤 <frame >标签的 filter,用来提取 frame标签里的 src 属性所表示的链接
                           
NodeFilterframeFilter = new NodeFilter() {
                                     publicboolean accept(Node node) {
                                               if(node.getText().startsWith("frame src=")) {
                                                        returntrue;
                                               } 
                                               else{
                                                        returnfalse;
                                               }
                                     }
                            };
                            //OrFilter 来设置过滤 <a>标签,和<frame>标签
                           
OrFilterlinkFilter = new OrFilter(new NodeClassFilter(LinkTag.class), frameFilter);
                            //得到所有经过过滤的标签
                           
NodeListlist = parser.extractAllNodesThatMatch(linkFilter);
                            for(int i = 0; i < list.size(); i++) {
                                     Nodetag = list.elementAt(i);
                                     if(tag instanceof LinkTag)// <a> 标签
                                    
{
                                               LinkTaglink = (LinkTag) tag;
                                               StringlinkUrl = link.getLink();// url
                                               if(filter.accept(linkUrl))
                                                        links.add(linkUrl);
                                     } 
                                     else//<frame> 标签
                                    
{
                                               //提取 frame src属性的链接如 <frame src="test.html"/>
                                              
Stringframe = tag.getText();
                                               intstart = frame.indexOf("src=");
                                               frame= frame.substring(start);
                                               intend = frame.indexOf(" ");
                                               if(end == -1)
                                                        end= frame.indexOf(">");
                                               StringframeUrl = frame.substring(5, end - 1);
                                               if(filter.accept(frameUrl))
                                                        links.add(frameUrl);
                                     }
                            }
                   }catch (ParserException e) {
                            e.printStackTrace();
                   }
                   returnlinks;
         }
}

最后,实现宽度优先遍历爬虫的主程序:

public class MyCrawler {
        
privatevoid initCrawlerWithSeeds(String []seeds){
                   for(inti=0;i<seeds.length;i++)
                            LinkQueue.addUnvisitedUrl(seeds[i]);
         }
         publicvoid crawling(String []seeds){
                   //定义过滤器,提取以http://www.baidu.com开头的链接为例
                  
LinkFilterfilter = new LinkFilter(){
                            publicboolean accept(String url){
                                     if(url.startsWith("http://www.baidu.com")||
                                                        url.startsWith("http://www.sina.com")||
                                                        url.startsWith("http://www.google.com"))
                                               returntrue;
                                     else
                                               returnfalse;
                            }
                   };
                   //初试化url队列
                  
initCrawlerWithSeeds(seeds);
                   //循环条件:待抓取的链接不为空且不超过2000
                  
while(!LinkQueue.unVisitedEmpty()&&LinkQueue.getVisitedURLNum()<=2000){
                            Stringvisitedurl = (String)LinkQueue.unVisitedUrlDeQueue();
                            if(visitedurl==null){
                                     continue;
                            }
                            DownLoadFiledownloader = new DownLoadFile();
                            //下载网页
                           
downloader.downloadFile(visitedurl);
                            LinkQueue.addVisistedUrl(visitedurl);
                            Set<String>links = HtmlParserTool.extracLinks(visitedurl, filter);
                            for(Stringlink:links){
                                     LinkQueue.addUnvisitedUrl(link);
                                     System.out.println(link);
                            }
                   }
         }
         publicstatic void main(String[] args){
                   MyCrawlercrawler = new MyCrawler();
                   String[]seedUrl = new String[3];
                   Stringtemp = null;
                   System.out.println("Inputthe seed URL:");
                   Scannerin = new Scanner(System.in);
                   temp= in.next();
                   seedUrl[0]=temp;
                   System.out.println("SpiderProgram started!");
                   crawler.crawling(seedUrl);
                   System.out.println("SpiderProgram Ended!");
         }
}

注:主程序使用了一个LinkFilter接口,并实现为一个内部类,用于过滤提取出来的URL,使得提取出的URL智慧和我们指定的种子网站有关。其代码如下:

public interface LinkFilter {

         public  boolean accept(Stringurl);

}

 

原创粉丝点击