Lucene/spellChecker拼写纠错

来源：互联网发布：上瘾网络剧豆瓣评分编辑：程序博客网时间：2024/05/16 11:41

spellChecker是用来对用户输入的”检索内容”进行校正。Lucene的suggest包中包括了spellchecker。

Lucene实现拼写检查的步骤

1.建立spellchecker所需的索引文件

spellchecker需要借助lucene的索引实现。

PlainTextDictionary

/**
* 根据字典文件创建spellchecker所使用的索引。
*
* @param spellIndexPath
* spellchecker索引文件路径
* @param idcFilePath
* 原始字典文件路径
* @throws IOException
*/
public void createSpellIndex(String spellIndexPath, String idcFilePath)
throws IOException {
Directory spellIndexDir = FSDirectory.open(new File(spellIndexPath));
SpellChecker spellChecker = new SpellChecker(spellIndexDir);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
null);
spellChecker.indexDictionary(new PlainTextDictionary(new File(
idcFilePath)), config, false);
// close
spellIndexDir.close();
spellChecker.close();
}

LuceneDictionary

/**
* @param oriIndexPath
* 指定原始索引
* @param fieldName
* 索引字段（某个字段的字典）
* @param spellIndexPath
* spellchecker索引文件路径
* @throws IOException
*/
public void createSpellIndex(String oriIndexPath, String fieldName, String spellIndexPath) {
IndexReader oriIndex = IndexReader.open(FSDirectory.open(new File(oriIndexPath)));
LuceneDictionary dict = new LuceneDictionary(oriIndex, fieldName);
Directory spellIndexDir = FSDirectory.open(new File(spellIndexPath));
SpellChecker spellChecker = new SpellChecker(spellIndexDir);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
null);
spellChecker.indexDictionary(dict, config, true);
}

建立索引的单词文件可以有多种Dictionary:
1.PlainTextDictionary
每行一个单词。
2.FileDictionary
一行一个字符串数组，用tab分隔
3.LuceneDictionary
用现有的index的term建立索引
4.HighFrequencyDictionary
在LuceneDictionary的基础上，满足term在各个document中的次数达到一定数量才被spellchecker采用。

2.spellchecker检查

使用第一步创建的索引，利用spellChecker.suggestSimilar进行拼写检查。

给出推荐结果

Directory directory=FSDirectory.open(new File(spellcheckindexpath));
SpellChecker spellchecker=new SpellChecker(directory);
IndexReader oriIndex=IndexReader.open(FSDirectory.open(new File(oriIndexPath)));
LuceneDictionary dict=new LuceneDictionary(oriIndex,fieldName);
//设置精度
spellchecker.setAccuracy(0.5);
//suggestionNumber 推荐的最大数目
String[] suggestion=spellchecker.suggestSimilar(queryString,suggestionNumber);

检测querystring是否存在

Iterator (String) ite = dict.getWordsIterator();
while (ite.hasNext()) {
if (ite.next().equals(queryString))
return true;
}

算法与原理

1.相似度计算

在spellchecker中，StringDistance接口表征词之间相似度，有三个实现类:

JaroWinklerDistance
LevensteinDistance
编辑距离，两个字串之间，由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符，插入一个字符，删除一个字符。
NGramDistance

0 0