mahout SparseVectorsFromSequenceFiles详解(5)
来源:互联网 发布:高考物理满分知乎 编辑:程序博客网 时间:2024/06/06 14:10
这一部分讲述createDictionaryChunks
参数
wordCountPath,这是输入目录,即上面wordcount目录
dictionaryPathBase,输出目录
其它几个参数很明显
代码很简单
List<Path> chunkPaths = Lists.newArrayList(); Configuration conf = new Configuration(baseConf); FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf); long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L; int chunkIndex = 0; Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex); chunkPaths.add(chunkPath); SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class); try { long currentChunkSize = 0; Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN); int i = 0; for (Pair<Writable,Writable> record : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) { if (currentChunkSize > chunkSizeLimit) { Closeables.closeQuietly(dictWriter); chunkIndex++; chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex); chunkPaths.add(chunkPath); dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class); currentChunkSize = 0; } Writable key = record.getFirst(); int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8; currentChunkSize += fieldSize; dictWriter.append(key, new IntWritable(i++)); } maxTermDimension[0] = i; } finally { Closeables.closeQuietly(dictWriter); } return chunkPaths;
就是生成词汇的sequence file,因为sequence file需要key-value形式,value弄了个自增整数,表示这个词属于vector的那个dimension
当一个chunk写满后,会新增加新的chunk
- mahout SparseVectorsFromSequenceFiles详解(5)
- mahout SparseVectorsFromSequenceFiles详解(1)
- mahout SparseVectorsFromSequenceFiles详解(2)
- mahout SparseVectorsFromSequenceFiles详解(3)
- mahout SparseVectorsFromSequenceFiles详解(4)
- mahout SparseVectorsFromSequenceFiles详解(6)
- mahout SparseVectorsFromSequenceFiles详解(7)
- mahout SparseVectorsFromSequenceFiles详解(8)
- mahout 详解
- Mahout推荐算法API详解(实用)
- Apache Mahout入门详解
- Mahout贝叶斯算法源码分析(5)
- Mahout源码分析之DistributedLanczosSolver(5)
- Mahout驾驭hadoop之详解
- Mahout推荐算法API详解
- Mahout推荐算法API详解
- Mahout推荐算法API详解
- Mahout推荐算法API详解
- PHP与MYSQL交互函数表学习笔记
- PHP 之 用户注册与登录完整代码
- 使用 PHP 和 Google 电子表格创建用户提供信息的地图<转载>
- LL(1)语法分析实验报告
- Hadoop动态添加删除datanode及tasktracker
- mahout SparseVectorsFromSequenceFiles详解(5)
- LL(1)语法分析<转>
- 词法分析<转载>
- Oracle中忘记System和Sys sysman密码后的处理方法
- Oracle sys和system用户、sysdba 和sysoper系统权限、sysdba和dba角色的区别(转)
- 为什么Oraclelistener启动后自动关闭
- 编译原理:求First集和Follow集
- Linux 常用命令集
- Unix常用命令