mongodb搭建校内搜索引擎——内容查询与排序1.0

来源：互联网发布：锦衣卫知乎编辑：程序博客网时间：2024/05/16 14:52

目标：

对已经存储好的数据进行查询，比如说我想查询”计科2015年研究生录取名单“，那么我想要的得到一系列的网页链接，其中前几个的网页中必须是得包含我需要的内容。

概要：

在已经存储好数据的情况下，运用BM25算法对查询的语句和网页的相关度进行相关度的计算。在实践中运用BM25算法，从1.0版本到2.0版本大大提高的查询的速度，普遍提高了1个量级。

实现过程：

版本1.0及其思考：

根据BM25算法，我首先将查询的语句进行分词，然后对词语映射到的链接取并集，然后对分别的连接进行进行与查询的相关度的计算。
相应的数据库的结构如下：
这里写图片描述

“_id”存储的是关键词，”url_list “存储的是包含此关键词的网页链接和这个网页的文本长度以及这个词在文本中的频率，”count”存储的是这个词在多少网页中出现，即“url_list “中的元素个数。
其中所含数据皆为我校计算机系到目前为止的数据。在这个版本中，我没有为集合key创建索引，数据库集合的情况如下：
这里写图片描述
代码如下：

import mathimport pymongoimport sysimport jiebaimport jieba.analyseimport cProfileimport timefrom functools import wrapsreload(sys)sys.setdefaultencoding("utf-8")setences=sys.argv[1]connection=pymongo.MongoClient("mongodb://localhost")db=connection.njukey=db.keydef fn_timer(function):#用于计算时间，方便有目的优化  @wraps(function)  def function_timer(*args, **kwargs):    t0 = time.time()    result = function(*args, **kwargs)    t1 = time.time()    print ("Total time running %s: %s seconds" %        (function.func_name, str(t1-t0))        )    return result  return function_timer@fn_timer #计算时间，并运行函数   def cut(setence): #对用户输入的查询进行分词    list=jieba.lcut(setence)    return listdef add_score(N,ni,fi,dl,avdl): #计算相关度，运用的是BM25算法    k1=1.2    b=0.45    K=k1*((1-b)+1.0*b*dl/avdl)    score_first=math.log((N/(ni+1.0)),10)    score_second=(k1+1)/(K+fi)*fi    score=score_first*score_second*100    return score@fn_timerdef calculate(list_ask,list_result): #查询数据库得到需要的参数，并调用add_score计算相关度    list_url=[]    for words in list_ask:        ask_get=key.find_one({"_id":words}) #获得相关的url        print words        if ask_get==None: #因为分词并不准确或者用户输入新词，那么输出使之能被观测            print words,"  :None"        else:            example_list=ask_get["url_list"]            for j in example_list:   #need youhua                if j in list_url: #添加url到列表，如果列表中已经有此链接，不在添加                    pass                else:                    list_url.append(j["url"])    for k in list_url: #对url和链接的相关度进行打分        key_score=0        # print key_score        for words in list_ask:            ask_get=key.find_one({"_id":words})            if ask_get==None:                pass                #print words,"   :None"            else:                ni=ask_get["count"]                example_list=ask_get["url_list"]                fi=0                dl=0                for j in example_list:                    # print group_kid["url"]                    if j["url"]==k:                        fi=j["frequency"]                        dl=j["length"]                        break                key_score=key_score+add_score(N,ni,fi,dl,avdl)        list_result.append({"url":k,"score":key_score}) #将结果保存下来，结构化输出    return list_resultdef exchange(list,a,b):#排序-交换    temp_0=list[a]["url"]    temp_1=list[a]["score"]    list[a]["url"]=list[b]["url"]    list[a]["score"]=list[b]["score"]    list[b]["url"]=temp_0    list[b]["score"]=temp_1def partition(list,lo,high):排序-快速排序    j=high    v=list[lo]["score"]    i=lo+1    while True:        while (v>=list[i]["score"]):            if i==j:                break            i+=1        while (list[j]["score"]>=v):            if j==i:                break            j-=1        if i>=j:            break        exchange(list,i,j)    if i==j+1:        exchange(list,lo,j)        return j    elif list[j]["score"]>v:        exchange(list,lo,j-1)        return j-1    else:        exchange(list,lo,j)        return jdef insert_sort(list,lo,hi):排序-插入排序    i=lo    while i<hi:        j=i+1        while j>lo:            if list[j]["score"]<list[j-1]["score"]:                exchange(list,j,j-1)            j-=1        i+=1def quick_sort(list,lo,hi): #排序的策略是列表长度>10时，用快排，列表长度<10时用插入排序    if hi<lo+10:        insert_sort(list,lo,hi)    else:        j=partition(list,lo,hi)        quick_sort(list,lo,j-1)        quick_sort(list,j+1,hi)N=2465avdl=184.6list_ask=cut(setences)list_result=[]list_result=calculate(list_ask,list_result)lo=0hi=len(list_result)print hiquick_sort(list_result,lo,hi-1)for i  in list_result:    print iprint len(list_result)

反思:

当返回结果过多时，计算模块def calculate(list_ask,list_result):耗时线性递增，最夸张的一次测试是查询返回结果返回2000多条，耗时60多分钟，完全不可接受，还有导致快速排序递归调用次数过多，报错：RuntimeError: maximum recursion depth exceeded
结巴分词模块导入时间要6秒左右，对用户查询切分要2秒左右，耗时较多
结果并不理想，但是用户输入的词语越多，越准确，但是是缺少语义上的处理，纯属概率模型

推荐：mongodb搭建校内搜索引擎——内容查询与排序2.0

1 0