只有21行的拼写检查器

来源：互联网发布：网易魔兽对战平台 mac 编辑：程序博客网时间：2024/05/16 12:14

我是爱生活爱学习爱工作的布知，英语初阶机器学习入门的土澳渣硕，在这个系列里以后会给大家分享一些代码少于100行有趣的Python新手向小应用和背后的算法原理,感受到编程带来的小小乐趣。

Python简介
资源推荐
21行的拼写检查器
- 数学原理
- 代码解析
  - 生成候选词
  - 语言模型
  - 误差模型
性能评估
参考资料

Python简介:

这里粗暴地引用下廖大的介绍

Python是一种脚本语言
Python程序简单易懂，易于入门，易于深入
Python的哲学就是简单优雅，尽量写容易看明白的代码，尽量写少的代码
Python提供了非常完善的基础代码库，Python还有大量的第三方库，大大加快开发进度
许多大型网站就是用Python开发的，例如YouTube、Instagram，还有国内的豆瓣。很多大公司，包括Google、Yahoo等，甚至NASA（美国航空航天局）都大量地使用Python。
Python适合开发的类型
- 网络应用，包括网站、后台服务等；
- 许多日常需要的小工具，包括系统管理员需要的脚本任务等；
- 把其他语言开发的程序再包装起来，方便使用。

资源推荐

本系列里不会对基础语法进行非常详尽地介绍，因为已经有足够多优秀的资源了，这里推荐两个

廖雪峰的官方网站
中国大学MOOC

网上还有很多优秀的博客资源，童鞋们可以自行搜索学习

21行的拼写检查器

好了，下面介绍下我们今天的主角，Norvig大神写的拼写检查器,挑选这个例子放在第一讲是因为它充分展示了
- Python代码的简洁优雅：21行代码实现完整功能
- 数学算法的隽永之美：贝叶斯公式(要好好学数学啊，童鞋们，真的很重要)

是机器学习的一个小应用

import refrom collections import Counterdef words(text): return re.findall(r'\w+', text.lower())WORDS = Counter(words(open('big.txt').read()))def P(word, N=sum(WORDS.values())):     "Probability of `word`."    return WORDS[word] / Ndef correction(word):     "Most probable spelling correction for word."    return max(candidates(word), key=P)def candidates(word):     "Generate possible spelling corrections for word."    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])def known(words):     "The subset of `words` that appear in the dictionary of WORDS."    return set(w for w in words if w in WORDS)def edits1(word):    "All edits that are one edit away from `word`."    letters    = 'abcdefghijklmnopqrstuvwxyz'    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]    deletes    = [L + R[1:]               for L, R in splits if R]    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]    inserts    = [L + c + R               for L, R in splits for c in letters]    return set(deletes + transposes + replaces + inserts)def edits2(word):     "All edits that are two edits away from `word`."    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

上面就是完整的代码，下面我们再看下它的运行情况

>>> correction("englihs")'english'>>> correction("englsh")'english'>>> correction("engliish")'english'

可以看出，代码对单词中字母的换位、缺失和重复起到了检查的作用

数学原理

看见下面这条公式，大家是不是很眼熟

P(A|B)=P(B|A)P(A)P(B)
就是大家高数上的贝叶斯公式

现在我把它换一种写法

P(c|w)=P(w|c)P(c)P(w)
然后，介绍一下公示的含义

w代表输入的原始单词w(ord)，c代表更正后的单词c(andidate)
P(c|w)代表输入原始单词w的情况下，输出更正单词c的概率，就是我们想要的结果

令P(c|w)最大的c就是最后的更正单词

考虑到P(w)概率是固定的，公式就可以简化为

P(c|w)=P(w|c)P(c)

最后的目标就是

argmaxc∈candidatesP(c)P(w|c)

代码解析

import re from collections import Counter
调用相关的库

re：re(gex)正则表达式相关的函数库
collection.Counter：生成一个可迭代的Counter类，类似于dict

生成候选词

def edits1(word):    "All edits that are one edit away from `word`."    letters    = 'abcdefghijklmnopqrstuvwxyz'    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]    deletes    = [L + R[1:]               for L, R in splits if R]    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]    inserts    = [L + c + R               for L, R in splits for c in letters]    return set(deletes + transposes + replaces + inserts)

edits1函数列出了常见的单词拼写错误分离(split)、删除(deletes)、交换(transposes)和替换(replaces)、插入(inserts),值得注意的是这里的编辑距离均为1，意味着只有相邻字母的交换一个字母的缺失一个字母的插入，最后使用set()函数去一下重复

def edits2(word):     "All edits that are two edits away from `word`."    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

edit2用了两层的for循环，在edit1的基础上得到了编辑距离为2的错误单词

WORDS = Counter(words(open('big.txt').read()))

big.txt是一个非常大的语料库，这里我们使用Counter()函数对他进行词频统计

>>> len(WORDS)32192>>> sum(WORDS.values())1115504>>> WORDS.most_common(10)[('the', 79808), ('of', 40024), ('and', 38311), ('to', 28765), ('in', 22020), ('a', 21124), ('that', 12512), ('he', 12401), ('was', 11410), ('it', 10681), ('his', 10034), ('is', 9773), ('with', 9739), ('as', 8064), ('i', 7679), ('had', 7383), ('for', 6938), ('at', 6789), ('by', 6735), ('on', 6639)]

我们可以看下大致的统计情况，一共有32192个词，它们一共出现了1115504次，其中出现频率最高的是the

def known(words):     "The subset of `words` that appear in the dictionary of WORDS."    return set(w for w in words if w in WORDS)

known函数用来判断生成的单词是否真的是一个已知的单词，通过检查他是否在WORDS中实现

def candidates(word):     "Generate possible spelling corrections for word."    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]

通过candidate()函数合并去重就可以得到候选的更正词c

语言模型

def words(text): return re.findall(r'\w+', text.lower())
words函数用到了正则表达式,re.findall()函数搜索文本中所有符合正则表达式的文本，通过正则表达式\w+提取文本中的所有单词,正则表达式\w+表示匹配任意一个字母数字和下划线到空格处停止，text.lower()表示将字母全部小写

def P(word, N=sum(WORDS.values())):     "Probability of `word`."    return WORDS[word] / Ndef correction(word):     "Most probable spelling correction for word."    return max(candidates(word), key=P)

correction()函数根据P()函数的最大值来筛选更正词c

误差模型

其实，解读完整个程序可以发现，实际上是用到的公式是

argmaxc∈candidatesP(c)

并没有乘上P(w|c)这一项，原因在于这是一个有缺陷的模型，直接默认对于已知单词，编辑距离为0的可能性永远大于编辑距离为1的，编辑距离为1的可能性永远大于编辑距离为2的，在这种给定的优先级的情况下，p(w|c)的概率是一样的，所以就省去这一项

性能评估

原谅我明天还要考试，直接拿作者的展示下，就不写测试样例了

def unit_tests():    assert correction('speling') == 'spelling'              # insert    assert correction('korrectud') == 'corrected'           # replace 2    assert correction('bycycle') == 'bicycle'               # replace    assert correction('inconvient') == 'inconvenient'       # insert 2    assert correction('arrainged') == 'arranged'            # delete    assert correction('peotry') =='poetry'                  # transpose    assert correction('peotryy') =='poetry'                 # transpose + delete    assert correction('word') == 'word'                     # known    assert correction('quintessential') == 'quintessential' # unknown    assert words('This is a TEST.') == ['this', 'is', 'a', 'test']    assert Counter(words('This is a test. 123; A TEST this is.')) == (           Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))    assert len(WORDS) == 32192    assert sum(WORDS.values()) == 1115504    assert WORDS.most_common(10) == [     ('the', 79808),     ('of', 40024),     ('and', 38311),     ('to', 28765),     ('in', 22020),     ('a', 21124),     ('that', 12512),     ('he', 12401),     ('was', 11410),     ('it', 10681)]    assert WORDS['the'] == 79808    assert P('quintessential') == 0    assert 0.07 < P('the') < 0.08    return 'unit_tests pass'def spelltest(tests, verbose=False):    "Run correction(wrong) on all (right, wrong) pairs; report results."    import time    start = time.clock()    good, unknown = 0, 0    n = len(tests)    for right, wrong in tests:        w = correction(wrong)        good += (w == right)        if w != right:            unknown += (right not in WORDS)            if verbose:                print('correction({}) => {} ({}); expected {} ({})'                      .format(wrong, w, WORDS[w], right, WORDS[right]))    dt = time.clock() - start    print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '          .format(good / n, n, unknown / n, n / dt))def Testset(lines):    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."    return [(right, wrong)            for (right, wrongs) in (line.split(':') for line in lines)            for wrong in wrongs.split()]print(unit_tests())spelltest(Testset(open('spell-testset1.txt'))) # Development setspelltest(Testset(open('spell-testset2.txt'))) # Final test set

使用Birkbeck spelling error corpus 这个拼写错误语料库，使用上面的程序进行更正，可以得到70%左右的正确率

unit_tests pass75% of 270 correct at 41 words per second68% of 400 correct at 35 words per secondNone

对于一个构造非常简单的小应用来说，做到这种程度已经足够高效了，至于为什么会造成识别错误这里不进行探讨，可以直接参看这里

参考资料

How to Write a Spelling Corrector
21行python代码实现拼写检查器

这次就写到这里，我是爱生活爱学习爱工作的布知，下次我会给大家介绍怎么使用Python做又酷又炫的词云图

阅读全文

0 0