搜索引擎的设计与实现

来源：互联网发布：淘宝商家资质中心编辑：程序博客网时间：2024/04/30 15:12

一、项目背景

面对浩瀚的网络资源，搜索引擎为所有网上冲浪的用户提供了一个入口，毫不夸张的说，所有的用户都可以从搜索出发到达自己想去的网上任何一个地方。因此它也成为除了电子邮件以外最多人使用的网上服务。

搜索引擎技术伴随着WWW的发展是引人注目的。搜索引擎大约经历了三代的更新发展：

第一代搜索引擎出现于1994年。这类搜索引擎一般都索引少于1，000，000个网页，极少重新搜集网页并去刷新索引。而且其检索速度非常慢，一般都要等待10秒甚至更长的时间。在实现技术上也基本沿用较为成熟的IR（Information Retrieval）、网络、数据库等技术，相当于利用一些已有技术实现的一个WWW上的应用。在1994年3月到4月，网络爬虫World Web Worm (WWWW)平均每天承受大约1500次查询。

大约在1996年出现的第二代搜索引擎系统大多采用分布式方案（多个微型计算机协同工作）来提高数据规模、响应速度和用户数量，它们一般都保持一个大约50，000，000网页的索引数据库，每天能够响应10，000，000次用户检索请求。1997年11月，当时最先进的几个搜索引擎号称能建立从2，000，000到100，000，000的网页索引。Altavista搜索引擎声称他们每天大概要承受20，000，000次查询。

2000年搜索引擎2000年大会上，按照Google公司总裁Larry Page的演讲，Google正在用3,000台运行Linux系统的个人电脑在搜集Web上的网页，而且以每天30台的速度向这个微机集群里添加电脑，以保持与网络的发展相同步。每台微机运行多个爬虫程序搜集网页的峰值速度是每秒100个网页，平均速度是每秒48.5个网页，一天可以搜集超过4，000，000网页

搜索引擎一词在国内外因特网领域被广泛使用，然而他的含义却不尽相同。在美国搜索引擎通常指的是基于因特网的搜索引擎，他们通过网络机器人程序收集上千万到几亿个网页，并且每一个词都被搜索引擎索引，也就是我们说的全文检索。著名的因特网搜索引擎包括First Search、Google、HotBot等。在中国，搜索引擎通常指基于网站目录的搜索服务或是特定网站的搜索服务，本人这里研究的是基于因特网的搜索技术。

二、项目设计

搜索引擎是根据用户的查询请求，按照一定算法从索引数据中查找信息返回给用户。为了保证用户查找信息的精度和新鲜度，搜索引擎需要建立并维护一个庞大的索引数据库。一般的搜索引擎由网络机器人程序、索引与搜索程序、索引数据库等部分组成。

WWW

文档

网络机器人程序

建立Lucene索引

从数据库中搜索信息

Tomcat服务器

Lucene索引数据库

WWW浏览器

JSP

网络机器人程序

系统结构图

网络机器人也称为“网络蜘蛛”(Spider)，是一个功能很强的WEB扫描程序。它可以在扫描WEB页面的同时检索其内的超链接并加入扫描队列等待以后扫描。因为WEB中广泛使用超链接，所以一个Spider程序理论上可以访问整个WEB页面。

为了保证网络机器人遍历信息的广度和深度需要设定一些重要的链接并制定相关的扫描策略。

在同一时间URL只能在一个队列中，我们把它称为URL的状态。

发现URL

等待队列

运行队列

完成队列

错误队列

完成URL

以上的图表示了队列的变化过程，在这个过程中，当一个URL被加入到等待队列中时Spider程序就会开始运行。只要等待队列中有一个网页或Spider程序正在处理一个网页，程序就会继续他的工作。当等待队列为空并且当前没有任何网页时，Spider程序就会停止它的工作。

在构造Spider程序之前我先了解下程序的各个部分是如何共同工作的。以及如何对这个程序进行扩展。流程图如下所示：

把URL加入等待队列

Spider程序工作完成

等待队列中是否有URL？

否

下载从等待队列中得到的网页，并将他送入运行队列中。

是

这个网页包含其他超级连接吗？

将这一网页送入完成队列并继续

查看网页上的下一个超连接

是否为指向Web的连接？

报告其他类型连接

连接是否与网页所在主机不同且只处理本地连接？

报告外部连接

报告网页连接

将连接加入等候队列

否

是

否

是

否

是

开发工具、平台及资源：

① Eclipse—J2EE 3.0

② Sun JDK 1.6.7

③ Jakarta Tomcat 6.2.0

④ Jakarta Lucene

⑤ Package Htmlparser

三、项目实现

模块实现：

图1 Spider 图2 服务器发布

四、运行效果

五、源代码

1、 Spdier类

public class webspider {

private int webDepth = 2; //解析深度

private int ThreadNum = 10; //线程数

private String startUrl = ""; //首页地址

private String fPath = "C://web"; //储存网页文件的目录名

private ArrayList<String> nodUrls = new ArrayList<String>(); //存储未处理URL

private ArrayList<String> arrUrls = new ArrayList<String>(); //存储已处理URL

private Hashtable<String,Integer> allUrls = new Hashtable<String,Integer>(); //存储所有URL的网页号

private Hashtable<String,Integer> deepUrls = new Hashtable<String,Integer>();//存储所有URL深度

private int intWebIndex = 0;//网页对应文件下标，从0开始

public webspider(String stratUrl,int webDepth)

{

this.startUrl = stratUrl;

this.webDepth = webDepth;//深度初始化

}

public synchronized String getNodUrl()//得到未处理的网页地址

{

String tmpAUrl = nodUrls.get(0);

nodUrls.remove(0);

return tmpAUrl;

}

public synchronized String getArrUrl()//得到已处理的网页地址

{

String tmpUrl = arrUrls.get(0);

arrUrls.remove(0);

return tmpUrl;

}

//主函数

public static void main(String[] args)

{

webspider gw = new webspider("http://www.gougou.com",2);

gw.getWebByHomePage();

}

public void getWebByHomePage()

{

System.out.println("Homepage = " + startUrl);//输出主页地址

nodUrls.add(startUrl);//向未处理的URL链表中加入主页地址

arrUrls.add(startUrl);//向已处理的URL链表中加入主页地址

allUrls.put(startUrl,0);//向储存网页的Hashtable中加入主页地址和编号

deepUrls.put(startUrl,1);//向储存网页深度的Hashtable中加入主页地址和深度

File fDir = new File(fPath);

if(!fDir.exists())

{

fDir.mkdir();//创建目录文件

}

System.out.println("Start!");

this.downloadPage(getNodUrl());

int i = 0;

for (i=0;i<ThreadNum;i++)

{

new Thread(new Processer(this)).start();

}

while (true)

{

if(nodUrls.isEmpty() && Thread.activeCount() == 1)

{

String strIndex = "";

String tmpUrl = "";

while (!arrUrls.isEmpty())

{

tmpUrl = getArrUrl();

strIndex += "Web depth:" + deepUrls.get(tmpUrl) + " Filepath: " + fPath + "/" + URLtoFileName(tmpUrl) + " url:" + tmpUrl + "/n/n";

}

try

{

PrintWriter pwIndex = new PrintWriter(new FileOutputStream("fileindex.txt"));

pwIndex.println(strIndex);//写入文件

pwIndex.close();

}

catch(Exception e)

{

System.out.println("生成索引文件失败!");

}

break;

}

private String URLtoFileName(String url){

String filename = url.replaceAll("http://", "");

filename =filename.replaceAll(".html", "");

filename = filename.replace("/", "#");

filename=filename+".txt";

return filename;

}

public void downloadPage(String strUrl)

{String fileName=URLtoFileName(strUrl);

try

{

System.out.println("Getting web by url: " + strUrl);

URL url = new URL(strUrl);

URLConnection conn = url.openConnection();

conn.setDoOutput(true);//得到Output流

InputStream is = null;

is = url.openStream();//得到连接对象的输入流

String filePath = fPath + "/" + fileName;//得到文件的保存路径及名称

PrintWriter pw = null;

FileOutputStream fos = new FileOutputStream(filePath);//得到文件输出流fos

OutputStreamWriter writer = new OutputStreamWriter(fos);

pw = new PrintWriter(writer);

BufferedReader bReader = new BufferedReader(new InputStreamReader(is));

StringBuffer sb = new StringBuffer();

String rLine = null;

String tmp_rLine = null;

while ( (rLine = bReader.readLine()) != null)

{

StringBean sBean = new StringBean();

sBean.setLinks(false); // 是否显示web页面的连接(Links)

// 为了取得页面的整洁美观一般设置上面两项为true , 如果要保持页面的原有格式, 如代码页面的空格缩进 可以设置为false

sBean.setCollapse(true); // 如果是true的话把一系列空白字符用一个字符替代.

sBean.setReplaceNonBreakingSpaces(true);// If true regular space

sBean.setURL(strUrl);

tmp_rLine = rLine;

int str_len = tmp_rLine.length();

if (str_len > 0)

{

sb.append(sBean.getStrings());

pw.println(sBean.getStrings());//把字符串写入filePath指向的文件中

pw.flush();

if (deepUrls.get(strUrl) < webDepth)

getUrlByString(tmp_rLine,strUrl);//如果该strUrl地址的深度小于所设置的深度weDepth时

}

tmp_rLine = null;

}

is.close();

pw.close();

System.out.println("Get web successfully! " + strUrl);

}

catch (Exception e)

{

System.out.println("Get web failed! " + strUrl);

}

public void getUrlByString(String url,String strUrl)

{

String tmpStr = url;

String regUrl = "(?<=(href=)[/"]?[/']?)[http://][^//s/"/'//?]*(" + "" + ")[^//s/"/'>]*";

Pattern p = Pattern.compile(regUrl,Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(tmpStr);

boolean blnp = m.find();

while (blnp == true)

{

if (!allUrls.containsKey(m.group(0)))

{

System.out.println("Find a new url,depth:" + (deepUrls.get(strUrl)+1) + " "+ m.group(0));

nodUrls.add(m.group(0));

arrUrls.add(m.group(0));

allUrls.put(m.group(0),++intWebIndex);//把此页面加入到hashtable中

deepUrls.put(m.group(0),(deepUrls.get(strUrl)+1));//把此页面加入到深度hashtable中

}

tmpStr = tmpStr.substring(m.end(),tmpStr.length());

m = p.matcher(tmpStr);

blnp = m.find();

}

class Processer implements Runnable

{

webspider spider;

public Processer(webspider spider)

{

this.spider =spider;

}

public void run()

{

while (!nodUrls.isEmpty())

{

String tmp = getNodUrl();

downloadPage(tmp);

}

2删除索引类：

public class Indexdelete

{

public static void main(String args[]) throws Exception

{

File indexdir = new File("D://Index");//索引存放路径

File dataDir = new File("E://web"); //被索引数据文件存放路径

File[] dataFiles = dataDir.listFiles();

IndexReader ir = IndexReader.open(indexdir);

for(int i = 0; i < dataFiles.length; i++)

{

Term term =new Term("path",dataFiles[i].getAbsolutePath()) ;

ir.deleteDocuments(term);

}

//ir.deleteDocument(22);

//ir.delete(new term("path","c://file_to_index/lucene.txt"));

ir.close();

Analyzer luceneanalyzer = new StandardAnalyzer();

IndexWriter indexwriter = new IndexWriter(indexdir,luceneanalyzer,false);

indexwriter.optimize();

System.out.println("索引已删除！");

indexwriter.close();

}

3 建立索引类

public class TxtFileIndex {

public static void main(String[] args) throws Exception{

//设置索引地址

File indexDir = new File("D://Index");

//设置数据地址

File dataDir = new File("E://web");

//建立分词 ,从文本中提取出索引项

Analyzer luceneAnalyzer = new StandardAnalyzer();

//取得目录下所有Files

File[] dataFiles = dataDir.listFiles();

//建立indexWrite indexWrite主要作用是添加索引 ,第一个参数指定了存储索引文件的路径。

//第二个参数指定了在索引过程中使用什么样的分词器。

//最后一个参数是个布尔变量，如果值为真，那么就表示要创建一个新的索引(即为增量索引)，如果值为假，就表示打开一个已经存在的索引。

IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);

//取得程序开启时间

long startTime = new Date().getTime();

//循环文件

for(int i = 0; i < dataFiles.length; i++){

File file=new File("E://web//web"+i+".txt");

//取出txt后缀的文档

if(file.isFile() && file.getName().endsWith(".txt")){

System.out.println("Indexing file "+i + file.getCanonicalPath());

//新建一个Document

Document document = new Document();

//读取数据

Reader txtReader = new FileReader(file);

/**

* 两个域的名字分别是"body"和"path"。分别存储了我们需要索引的文本文件的内容和路径。

* 最后一行把准备好的文档添加到了索引当中。

//Document添加path

document.add(new Field("path", file.getName(), Field.Store.YES, Field.Index.UN_TOKENIZED));

//Document添加正文

document.add(new Field("body",txtReader));

//System.out.println(dataFiles[i].getName());

//添加索引

indexWriter.addDocument(document);

}

indexWriter.optimize();

indexWriter.close(); //关闭索引，使索引保存到磁盘中去

long endTime = new Date().getTime();

//输出程序所用时间

System.out.println("It takes " + (endTime - startTime)

+ " milliseconds to create index for the files in directory "

+ dataDir.getPath());

}

六、参考书

① 《Programming Spiders,Bots,and Aggregator in Java》[美]Jeff Heaton著

② 《搜索引擎与信息获取技术》徐宝文、张卫丰著

③ 《基于Java的全文搜索引擎Lucene》车东著

④ 《主题搜索引擎的设计与实现》罗旭著

⑤ 《Thinking in Java 》[美]Bruce Eckel著