英文分词算法(Porter stemmer)
来源:互联网 发布:js实现遮罩层 编辑:程序博客网 时间:2024/04/20 11:06
题记
最近需要对英文进行分词处理,希望能够实现还原英文单词原型,比如 boys 变为 boy 等。
简介
发现一个不错的工具Porter stemmer,主页是http://tartarus.org/~martin/PorterStemmer/。它被实现为N多版本,C、Java、Perl等。
下面是它的简单介绍:
Stemming, in the parlance of searching and information retrieval, is the operation of stripping the suffices from a word, leaving its stem. Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when you ask for a web page that contains the word connect.
There are basically two ways to implement stemming. The first approach is to create a big dictionary that maps words to their stems. The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. The second approach is to use a set of rules that extract stems from words. The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes. But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen.
In 1979, Martin Porter developed a stemming algorithm that, with minor modifications, is still in use today; it uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right. Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html;
以前也曾经尝试过这个算法,但是因为下面的原因就放弃了!
比如输入 "create" 和 "created" ,得到的结果是 "creat" 。这点让我大失所望!这根本就没有把单词还原为原来的样子啊?
这次没办法,还是需要实现这样的功能,Google了半天,就发现Lucene里面有英文分词模块,可惜太复杂了,不适合我的这种简单应用。后来才知道,其实lucene里用的也就是这种方法。
于是乎,硬着头皮看了下他的主页,在FQA里发现了下面这句话!恍然大悟。
实例
比如我输入 "create" 和 "created" ,它解析得到 "creat"
那么,只需要在查询时也做同样的处理即可!比如查询 "create created",在数据库里查的时候,都只需要检索"creat"即可!
附录
简单词汇处理前后的对比:http://snowball.tartarus.org/algorithms/porter/diffs.txt
主程序(相当精悍啊):http://tartarus.org/martin/PorterStemmer/java.txt
(全文完)
- 英文分词算法(Porter stemmer)
- 英文分词算法(Porter stemmer)
- 英文分词算法(Porter stemmer)
- 简易英文分词算法(python)
- 英文分词的算法和原理
- 英文分词的算法和原理
- 英文分词的算法和原理
- Porter Algorithm ---------词干提取算法
- Porter Algorithm ---------词干提取算法
- (1)英文分词——波特词干提取算法
- 词干提取算法Porter Stemming Algorithm解读
- 词干提取算法Porter Stemming Algorithm解读
- 英文分词+提取词干
- JavaScript英文分词
- Python 英文分词
- Python 英文分词
- 【英文分词】Stemming Segmentation,基于词干分词
- 分词算法
- Android 4.0.1上ethernet的移植(一)
- C复杂的函数声明
- 编译android sdk时的出错out/host/linux-x86/obj/STATIC_LIBRARIES/libutils_intermediates/Asset.o
- 安装配置IBM MQ Series 服务端和客户端
- The king of Geeks
- 英文分词算法(Porter stemmer)
- jstl问题
- 随机产生时间的SQL函数
- wireshark的使用教程
- eclipse编辑文件过程中出现乱码
- 163邮件收听力
- ACCESS 日期查询及操作SQL语句的写法
- 对BroadcastReceiver的理解
- gcc的一些选项