SpellChecker
来源:互联网 发布:足球联赛赛程排序软件 编辑:程序博客网 时间:2024/06/06 00:50
A Spell Checker allows to suggest a list of words similar to a misspelled word. This implementation is based on David Spencer's code using the n-gram method and the Levenshtein distance.
Structure of a dictionary index
An index (the dictionary) with all the possible words (a lucene index) must be created. The structure of this index is (for a 3-4 gram) this:
Index Structure
Example
word
kings
gram3
kin, ing, ngs
gram4
king, ings
start3
kin
start4
king
end3
ngs
end4
ings
Import: Adding Words to the Dictionary
We can add the words coming from a Lucene Index (more precisely from a set of Lucene fields), and from a text file with a list of words.
- Example: we can add all the keywords of a given Lucene field of my index.
SpellChecker spell= new SpellChecker(dictionaryDirectory);spell.indexDictionary(new LuceneDictionary(my_luceneReader,my_fieldname));
Getting a List of Suggested Words
The suggestSimilar method returns a list of suggested words sorted by:
- the Levenshtein distance (the most similar word to the misspelled word is the first in the list).
- (optionally) the popularity of the word in a given Lucene Field.
Furthermore, that list can be restricted only to the words present in a given Lucene Field.
- First example: the suggestSimilar(misspelled_word, num_list) method.
The num_list is the maximum number of words returned.In this example the list is just sorted with the Levenshtein distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2); //l[0] = "seventy"
- Second example: the suggestSimilar(misspelled_word, num_list, myIndexReader,myField, morePopular)
Note: if myIndexReader and myField are null this method is the same as the first method
The returned words are restricted only to the words presents in the fieldmyField of the Lucene Index "myIndexReader"
- The list is also sorted with a second criterium: the popularity (the frequency) of the word in the user field
If morePopular is true and the mispelled word exists in the user field, return only the words more frequent than this.
Changes
Version 1.1 :
- sort fixed (the sort was inversed!)
- set gram dynamically (depending of the length of the word)
use the FuzzyQuery score: ((edit distance)/(length of word))
new Dictionary interface + LuceneDictionary and PlaintextDictionary implementation
- replace addWords method by indexDictionary(Dictionnary dic)
- add a new public method: boolean exist(word)
- add a build.xml
Credits
- Maisonneuve Nicolas
- Spencer David
- SpellChecker
- spellChecker原理分析
- spellChecker原理分析
- Lucene/spellChecker拼写纠错
- [ lucene扩展 ] spellChecker原理分析
- Mule Example SpellChecker - Basic Studio Tutorial
- lucene之旅(三十一)——SpellChecker上
- 记录frameworks SpellChecker从IME端获取单词提示的过程
- 用Lucene的SpellChecker实现Google的“您是不是要找”功能
- spellchecker inspection helps locate typos and misspelled in your code,comments and l
- spellchecker inspection helps locate typeos and misspelling in your code, comments and literals, and fix them in one click
- 本机挂载中心服
- 1207 The 3n + 1 problem
- Hadoop JAVA程序-files功能测试
- Longest Prefix
- Protocol Buffers在mac下的使用
- SpellChecker
- Flash内存泄露
- MySQL Index详解
- 搜索引擎的Robots规则,如何完全屏蔽百度、google的收录!
- JAVA操作属性文件,可进行读、写、更改
- 使用NFS根文件系统登录开发板
- struts2中自己出现的异常,发现一个更新一个(持续更新中)
- JAVA查询类,方法的源代码
- ubuntu修改虚拟内存(swap空间)