
来源:互联网 发布:诱导交友app源码 编辑:程序博客网 时间:2024/06/15 08:18



[[analyzers]] method = "ngram-word" ngram = 1 [[analyzers.filter]] type = "whitespace-tokenizer" [[analyzers.filter]] type = "lowercase" [[analyzers.filter]] type = "alpha" [[analyzers.filter]] type = "length" min = 2 max = 35 [[analyzers.filter]] type = "list" file = "lemur-stopwords.txt" [[analyzers.filter]] type = "porter2-stemmer"

This tells MeTA how to process the text before indexing the documents. “ngram=1” configures MeTA to use unigrams (single words). Each “[[analyzers.filter]]” tag defines a text filter that applies a special function on the text. These filters are being “chained” together; text will first be processed by a whitespace tokenizer which separates words based on white spaces, then all the tokenized words will be converted to lowercase. This is followed by a couple of filters that end up with stopword removal and stemming. These filters can be usually changed depending on the application. For more information on how to use and configure the filters in MeTA see MeTA's Analyzers and Filters documentation.

1、分词:使用空格(type = "whitespace-tokenizer")作为分割符号,方法是1-gram,即一个空格分一个词,如果是2-gram,则是两个空格分一个词;汉语有专门的分词工具。

2、字母转换为小写:type = "lowercase" ,便于将Me和me看做同一个词。

3、可以根据不同的应用,选择其他过滤器,如type = "length" 、type = "alpha"等。

4、去停顿词和功能词:file = "lemur-stopwords.txt",简单说是没有任何含义的词汇(a、about、above、上午、下午、中午),中英文都有相应的词集。

5、词干分析:type = "porter2-stemmer",看下面就懂了。

abandon                       abandon
abandoned                     abandon
abandoning                    abandon
abandonment                   abandon
abandons                      abandon




1 0