单词匹配

来源：互联网发布：ubuntu server 编辑：程序博客网时间：2024/05/17 22:40

今天突然看到一到面试题是单词匹配，就想着自己做做看
从网上找了一个常用单词的文件
思考一下，自己的实现方案

遍历匹配
二分查找法
建立单词数
同时再和Python内置的set比较一下

首先编写计算时间的函数

def time_clock(func):    import time    import functools    @functools.wraps(func)    def _(*args,**kwargs):        before = time.time()        result = func(*args,**kwargs)        print 'duration = %s'%(time.time()-before)        return result    return _

然后是读取文件转为list

words = map(lambda s:s.strip('\n').lower(),open('words.txt').readlines())

由于单个数据不具有代表性，所以我们对列表中所有的元素进行查找计算时长

@time_clockdef match_all(func):    import sys    sys.stdout.write('func_name:%s  '%(func.__name__).ljust(18))    for word in words:        if not func(word):            print word            raise Exception("don't match")

遍历法

def match_word_for(word):    for w in words:        if w==word:            return True    return False

最简单，无耻的一种方法，效率也是低到惨无人道。。。。。

二分查找

def match_word_middle(word):    start = 0    end = len(words)-1    while start <= end:        middle_index = (start+end)/2        middle = words[middle_index]        if word == middle:            return True        elif word < middle:            end = middle_index-1        else:            start = middle_index+1    return False

不过二分查找之前需要进行排序

    words.sort()

单词树

首先是建立单词树

class Char_node(object):    def __init__(self,char=0):        self.char = char        self.children = {}    def find_char(self,char):        return self.children.get(char,None)    def add_char(self,char):        node = self.children.get(char,None)        if not node:            node = Char_node(char)            self.children[char] = node        return node    def __repr__(self):        return "<Char_node : %s>"%self.chartree_root = Char_node(0)def add_word(word):    node = tree_root    for char in word:        node = node.add_char(char)    else:        node.add_char('$')def build_char_tree(words):    for word in words:        add_word(word)

然后再是在单词树中进行查询，此处我用了一下Python里面的dict数据结构，理因自己实现相关的数据结构，，，，

def match_word_tree(word):    node = tree_root    for i in range(len(word)):        node = node.find_char(word[i])        if not node:            return False        if i == len(word)-1 and node.children.get('$'):            return True    return False

我们来比较一下他们的速度快慢

if __name__ == '__main__':    build_char_tree(words)    words.sort()    match_all(match_word_for)    match_all(match_word_middle)    match_all(match_word_tree)

func_name:match_word_for      duration = 0.192780017853func_name:match_word_middle   duration = 0.0136761665344func_name:match_word_tree     duration = 0.0193219184875[Finished in 0.3s]

可以看得出遍历确实慢的难以忍受！

不过二分查找竟然比单词树还快，比较吃惊
因为单词树的查找次数就是单词的长度而已，二分法的查找次数O(log2n)，我的单词表是2000个，按理应该单词树比较快

然后我们再看一下Python自带的数据结构的查询

set
list

set

words_set = set(words)def match_word_set(word):    return word in words_set

list

def match_word_list(word):    return word in words

来个总对比

if __name__ == '__main__':    build_char_tree(words)    words.sort()    match_all(match_word_set)    match_all(match_word_list)    match_all(match_word_for)    match_all(match_word_middle)    match_all(match_word_tree)

func_name:match_word_set      duration = 0.000779867172241func_name:match_word_list     duration = 0.0541019439697func_name:match_word_for      duration = 0.183423042297func_name:match_word_middle   duration = 0.0140800476074func_name:match_word_tree     duration = 0.0185759067535[Finished in 0.4s]

set的速度快的令人发指，据说底层是采用散列表存储
但list的倒是比较慢，但也比单纯的遍历快多了，

总结：在判断一个元素是否在一个序列中的时候用set，速度完全和list不再一个数量级上的

不过我认为单词树仍是个不错的选择，假如单词表非常非常大的情况下~

0 0