文本建模常用的预处理方法

来源：互联网发布：诱导交友app源码编辑：程序博客网时间：2024/06/15 08:18

最近看文本建模，给一大段文本，如何建模？？？

以MeTa代码为例：

[[analyzers]] method = "ngram-word" ngram = 1 [[analyzers.filter]] type = "whitespace-tokenizer" [[analyzers.filter]] type = "lowercase" [[analyzers.filter]] type = "alpha" [[analyzers.filter]] type = "length" min = 2 max = 35 [[analyzers.filter]] type = "list" file = "lemur-stopwords.txt" [[analyzers.filter]] type = "porter2-stemmer"

This tells MeTA how to process the text before indexing the documents. “ngram=1” configures MeTA to use unigrams (single words). Each “[[analyzers.filter]]” tag defines a text filter that applies a special function on the text. These filters are being “chained” together; text will first be processed by a whitespace tokenizer which separates words based on white spaces, then all the tokenized words will be converted to lowercase. This is followed by a couple of filters that end up with stopword removal and stemming. These filters can be usually changed depending on the application. For more information on how to use and configure the filters in MeTA see MeTA's Analyzers and Filters documentation.

1、分词：使用空格（type = "whitespace-tokenizer"）作为分割符号，方法是1-gram，即一个空格分一个词，如果是2-gram，则是两个空格分一个词；汉语有专门的分词工具。

2、字母转换为小写：type = "lowercase" ，便于将Me和me看做同一个词。

3、可以根据不同的应用，选择其他过滤器，如type = "length" 、type = "alpha"等。

4、去停顿词和功能词：file = "lemur-stopwords.txt"，简单说是没有任何含义的词汇（a、about、above、上午、下午、中午），中英文都有相应的词集。

5、词干分析：type = "porter2-stemmer"，看下面就懂了。

abandon abandon
abandoned abandon
abandoning abandon
abandonment abandon
abandons abandon

（2-5对应于前段时间做的微博分析，则是转换所有图片为picture、转换所有超链接为http等处理，一定要做适合自己的过滤）

6、筛选有用词汇：通过信息增益、CHI-test或其他TF-IDF等方法

文本建模系列会不断更新。。。。

1 0