Mahout贝叶斯算法源码分析（2-1）

来源：互联网发布：淘宝网店铺搜索编辑：程序博客网时间：2024/05/17 01:04

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles，从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息，分别是：（1）DocumentTokenizer（2）WordCount（3）MakePartialVectors（4）MergePartialVectors（5）VectorTfIdf Document Frequency Count（6）MakePartialVectors（7）MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息：

Usage:                                                                           [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize           <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma      <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>      --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>        --overwrite --help --sequentialAccessVector --namedVector --logNormalize]       Options                                                                           --minSupport (-s) minSupport        (Optional) Minimum Support. Default                                             Value: 2                                    --analyzerName (-a) analyzerName    The class name of the analyzer              --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB    --output (-o) output                The directory pathname for output.          --input (-i) input                  Path to job input directory.                --minDF (-md) minDF                 The minimum document frequency.  Default                                        is 1                                        --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors                                         to be used, expressed in times the                                              standard deviation (sigma) of the                                               document frequencies of these vectors.                                          Can be used to remove really high                                               frequency terms. Expressed as a double                                          value. Good value to be specified is 3.0.                                       In case the value is less then 0 no                                             vectors will be filtered out. Default is                                        -1.0.  Overrides maxDFPercent               --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.                                          Can be used to remove really high                                               frequency terms. Expressed as an integer                                        between 0 and 100. Default is 99.  If                                           maxDFSigma is also set, it will override                                        this value.                                 --weight (-wt) weight               The kind of weight to use. Currently TF                                         or TFIDF                                    --norm (-n) norm                    The norm to use, expressed as either a                                          float or "INF" if you want to use the                                           Infinite norm.  Must be greater or equal                                        to 0.  The default is not to normalize      --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood                                            Ratio(Float)  Default is 1.0                --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.                                              Default Value: 1                            --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to                                        create (2 = bigrams, 3 = trigrams, etc)                                         Default Value:1                             --overwrite (-ow)                   If set, overwrite the output directory      --help (-h)                         Print out help                              --sequentialAccessVector (-seq)     (Optional) Whether output vectors should                                        be SequentialAccessVectors. If set true                                         else false                                  --namedVector (-nv)                 (Optional) Whether output vectors should                                        be NamedVectors. If set true else false     --logNormalize (-lnorm)             (Optional) Whether output vectors should                                        be logNormalize. If set true else false

在昨天算法的终端信息中该步骤的调用命令如下：

./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数，首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化（设置则为true），-nv解释为输出向量被设置为named 向量，这里的named是啥意思？（暂时不清楚），-wt tfidf解释为使用权重的算法，具体参考http://zh.wikipedia.org/wiki/TF-IDF 。

第（1）步在SparseVectorsFromSequenceFiles的253行的：

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是：SequenceFileTokenizerMapper，没有使用Reducer。Mapper的代码如下：

protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {    TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);    StringTuple document = new StringTuple();    stream.reset();    while (stream.incrementToken()) {      if (termAtt.length() > 0) {        document.add(new String(termAtt.buffer(), 0, termAtt.length()));      }    }    context.write(key, document);  }

该Mapper的setup函数主要设置Analyzer的，关于Analyzer的api参考：http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ，其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) ：Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序：

package mahout.fansy.test.bayes;import java.io.IOException;import java.io.StringReader;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.Text;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.mahout.common.ClassUtils;import org.apache.mahout.common.StringTuple;import org.apache.mahout.vectorizer.DefaultAnalyzer;import org.apache.mahout.vectorizer.DocumentProcessor;public class TestSequenceFileTokenizerMapper {/** * @param args */private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",Analyzer.class);public static void main(String[] args) throws IOException {testMap();}public static void testMap() throws IOException{Text key=new Text("4096");Text value=new Text("today is also late.what about tomorrow?");TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);    StringTuple document = new StringTuple();    stream.reset();    while (stream.incrementToken()) {      if (termAtt.length() > 0) {        document.add(new String(termAtt.buffer(), 0, termAtt.length()));      }    }    System.out.println("key:"+key.toString()+",document"+document);}}

得出的结果如下：

key:4096,document[today, also, late.what, about, tomorrow]

其中，TokenStream有一个stopwords属性，值为：[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]，所以当遇到这些单词的时候就不进行计算了。

额，又太晚了。哎，早困了，刷个牙线。。。

分享，快乐，成长

转载请注明出处：http://blog.csdn.net/fansy1990