Mahout贝叶斯算法源码分析(2-1)
来源:互联网 发布:淘宝网店铺搜索 编辑:程序博客网 时间:2024/05/17 01:04
seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:
Usage: [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help --sequentialAccessVector --namedVector --logNormalize] Options --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 --analyzerName (-a) analyzerName The class name of the analyzer --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB --output (-o) output The directory pathname for output. --input (-i) input Path to job input directory. --minDF (-md) minDF The minimum document frequency. Default is 1 --maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors to be used, expressed in times the standard deviation (sigma) of the document frequencies of these vectors. Can be used to remove really high frequency terms. Expressed as a double value. Good value to be specified is 3.0. In case the value is less then 0 no vectors will be filtered out. Default is -1.0. Overrides maxDFPercent --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. If maxDFSigma is also set, it will override this value. --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF --norm (-n) norm The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood Ratio(Float) Default is 1.0 --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to create (2 = bigrams, 3 = trigrams, etc) Default Value:1 --overwrite (-ow) If set, overwrite the output directory --help (-h) Print out help --sequentialAccessVector (-seq) (Optional) Whether output vectors should be SequentialAccessVectors. If set true else false --namedVector (-nv) (Optional) Whether output vectors should be NamedVectors. If set true else false --logNormalize (-lnorm) (Optional) Whether output vectors should be logNormalize. If set true else false在昨天算法的终端信息中该步骤的调用命令如下:
./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考http://zh.wikipedia.org/wiki/TF-IDF 。
第(1)步在SparseVectorsFromSequenceFiles的253行的:
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:
protected void map(Text key, Text value, Context context) throws IOException, InterruptedException { TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString())); CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class); StringTuple document = new StringTuple(); stream.reset(); while (stream.incrementToken()) { if (termAtt.length() > 0) { document.add(new String(termAtt.buffer(), 0, termAtt.length())); } } context.write(key, document); }该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考:http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ,其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) :Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序:
package mahout.fansy.test.bayes;import java.io.IOException;import java.io.StringReader;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.Text;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.mahout.common.ClassUtils;import org.apache.mahout.common.StringTuple;import org.apache.mahout.vectorizer.DefaultAnalyzer;import org.apache.mahout.vectorizer.DocumentProcessor;public class TestSequenceFileTokenizerMapper {/** * @param args */private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",Analyzer.class);public static void main(String[] args) throws IOException {testMap();}public static void testMap() throws IOException{Text key=new Text("4096");Text value=new Text("today is also late.what about tomorrow?");TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString())); CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class); StringTuple document = new StringTuple(); stream.reset(); while (stream.incrementToken()) { if (termAtt.length() > 0) { document.add(new String(termAtt.buffer(), 0, termAtt.length())); } } System.out.println("key:"+key.toString()+",document"+document);}}得出的结果如下:
key:4096,document[today, also, late.what, about, tomorrow]其中,TokenStream有一个stopwords属性,值为:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of],所以当遇到这些单词的时候就不进行计算了。
额,又太晚了。哎,早困了,刷个牙线。。。
分享,快乐,成长
转载请注明出处:http://blog.csdn.net/fansy1990
- Mahout贝叶斯算法源码分析(1)
- Mahout贝叶斯算法源码分析(2-1)
- Mahout贝叶斯算法源码分析(3)
- Mahout贝叶斯算法源码分析(4)
- Mahout贝叶斯算法源码分析(5)
- Mahout贝叶斯算法源码分析(6)
- Mahout贝叶斯算法源码分析(7)
- Mahout贝叶斯算法源码分析(8)
- Mahout贝叶斯算法源码分析(9)
- Mahout决策树算法源码分析(2)
- Mahout决策树算法源码分析(2)
- Mahout决策树算法源码分析(1)
- Mahout贝叶斯算法源码分析(2-2)
- Mahout贝叶斯算法源码分析(2-3)
- Mahout随机森林算法源码分析(2-1)BuildForest
- mahout源码分析之贝叶斯算法
- Mahout随机森林算法源码分析(2-2)
- Mahout关联规则算法源码分析(2)
- 【排序】02.交换排序(升序)
- 程序员技术练级攻略 转至http://coolshell.cn/articles/4990.html 作者:陈皓
- (1)前言——(5)关于出版商
- CF - 44C - Holidays
- Linux加载DTS设备节点的过程
- Mahout贝叶斯算法源码分析(2-1)
- [linux 命令]linux下查看进程内存使用情况
- Java基础15:treeset;排序方法-比较器;泛型;
- Javascript 生成指定范围数值随机数
- 习语言-中文C语言 最新语法程序, 很易懂,你觉得呢?
- 【程序14】 题目:将一个正整数分解质因数。例如:输入90,打印出90=2*3*3*5。
- 【java】数组/列表排序功能的两种实现
- Java基础16:map集合;
- WHU 2013 Summer Team Contest #15 - SWERC 2011[xioumu]