英文分词算法(Porter stemmer)
来源:互联网 发布:手机淘宝怎么撤销举报 编辑:程序博客网 时间:2024/04/26 19:27
题记
最近需要对英文进行分词处理,希望能够实现还原英文单词原型,比如 boys 变为 boy 等。
简介
发现一个不错的工具Porter stemmer,主页是http://tartarus.org/~martin/PorterStemmer/。它被实现为N多版本,C、Java、Perl等。
下面是它的简单介绍:
Stemming, in the parlance of searching and information retrieval, is the operation of stripping the suffices from a word, leaving its stem. Google, for instance, uses stemming to search for web pages containing the wordsconnected, connecting, connection and connections when you ask for a web page that contains the wordconnect.
There are basically two ways to implement stemming. The first approach is to create a big dictionary that maps words to their stems. The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. The second approach is to use a set of rules that extract stems from words. The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes. But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen.
In 1979, Martin Porter developed a stemming algorithm that, with minor modifications, is still in use today; it uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right. Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html;
以前也曾经尝试过这个算法,但是因为下面的原因就放弃了!
比如输入 "create" 和 "created" ,得到的结果是 "creat" 。这点让我大失所望!这根本就没有把单词还原为原来的样子啊?
这次没办法,还是需要实现这样的功能,Google了半天,就发现Lucene里面有英文分词模块,可惜太复杂了,不适合我的这种简单应用。后来才知道,其实lucene里用的也就是这种方法。
于是乎,硬着头皮看了下他的主页,在FQA里发现了下面这句话!恍然大悟。
实例
比如我输入 "create" 和 "created" ,它解析得到 "creat"
那么,只需要在查询时也做同样的处理即可!比如查询 "create created",在数据库里查的时候,都只需要检索"creat"即可!
附录
简单词汇处理前后的对比:http://snowball.tartarus.org/algorithms/porter/diffs.txt
主程序(相当精悍啊):http://tartarus.org/martin/PorterStemmer/java.txt
转自:http://blog.csdn.net/whuslei/article/details/7398443
- 英文分词算法(Porter stemmer)
- 英文分词算法(Porter stemmer)
- 英文分词算法(Porter stemmer)
- 简易英文分词算法(python)
- 英文分词的算法和原理
- 英文分词的算法和原理
- 英文分词的算法和原理
- Porter Algorithm ---------词干提取算法
- Porter Algorithm ---------词干提取算法
- (1)英文分词——波特词干提取算法
- 词干提取算法Porter Stemming Algorithm解读
- 词干提取算法Porter Stemming Algorithm解读
- 英文分词+提取词干
- JavaScript英文分词
- Python 英文分词
- Python 英文分词
- 【英文分词】Stemming Segmentation,基于词干分词
- 分词算法
- android:scaleType属性
- 16周实验报告 任务4
- UNIX TOOLBOX - 中文版
- ExpandableListView / ExpandableListActivity 使用及数据更新
- hdu 2544 最短路
- 英文分词算法(Porter stemmer)
- 17周实验报告 任务1
- MyBatis学习 之 三、动态SQL语句
- 如何使用二郎助手
- 景区称将缔造财富传奇
- mysql 查看当前连接数
- MyBatis学习 之 四、MyBatis配置文件
- ios5的safari浏览器的电话号码识别功能的禁用
- 关闭当前页面,打开新页面