mahout SparseVectorsFromSequenceFiles详解(5)

来源:互联网 发布:高考物理满分知乎 编辑:程序博客网 时间:2024/06/06 14:10

这一部分讲述createDictionaryChunks

参数

wordCountPath,这是输入目录,即上面wordcount目录

dictionaryPathBase,输出目录

其它几个参数很明显

代码很简单

    List<Path> chunkPaths = Lists.newArrayList();    Configuration conf = new Configuration(baseConf);    FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;    int chunkIndex = 0;    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);    chunkPaths.add(chunkPath);    SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);    try {      long currentChunkSize = 0;      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);      int i = 0;      for (Pair<Writable,Writable> record           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {        if (currentChunkSize > chunkSizeLimit) {          Closeables.closeQuietly(dictWriter);          chunkIndex++;          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);          chunkPaths.add(chunkPath);          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);          currentChunkSize = 0;        }        Writable key = record.getFirst();        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;        currentChunkSize += fieldSize;        dictWriter.append(key, new IntWritable(i++));      }      maxTermDimension[0] = i;    } finally {      Closeables.closeQuietly(dictWriter);    }    return chunkPaths;

就是生成词汇的sequence file,因为sequence file需要key-value形式,value弄了个自增整数,表示这个词属于vector的那个dimension

当一个chunk写满后,会新增加新的chunk