hadoop自定义SdfInputFormat,文件按标记分片

来源:互联网 发布:电脑双肩包推荐 知乎 编辑:程序博客网 时间:2024/06/07 02:51

由于要用hadoop streaming处理sdf文件,而sdf文件的文件格式为

1  -OEChem-12181003042D.....$$$$

以$$$$结尾的多行。


而hadoop默认的分片为:以分块为基础的分片

    for (FileStatus file: files) {      Path path = file.getPath();      FileSystem fs = path.getFileSystem(job.getConfiguration());      long length = file.getLen();      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);      if ((length != 0) && isSplitable(job, path)) {        long blockSize = file.getBlockSize();        long splitSize = computeSplitSize(blockSize, minSize, maxSize);        long bytesRemaining = length;        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,                                   blkLocations[blkIndex].getHosts()));          bytesRemaining -= splitSize;        }        if (bytesRemaining != 0) {          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,                     blkLocations[blkLocations.length-1].getHosts()));        }      } else if (length != 0) {        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));      } else {        //Create empty hosts array for zero length files        splits.add(new FileSplit(path, 0, length, new String[0]));      }    }
本来想直接修改FileInputFormat.java的getSplits方法。可是本人刚刚接触java和hadoop,必须参考其他例子。于是找到NLineInputFormat的源码,其分片(getSplits方法)是按指定行划分的,有了点头绪。又想到TextInputFormat是按行读取的,想想应该在读取分片时做了些处理,果然是,它在已分片的基础上,跳过起始的最后一行,多读取分片的最后一行。

LineRecordReader.java:    FileSystem fs = file.getFileSystem(job);    FSDataInputStream fileIn = fs.open(split.getPath());    boolean skipFirstLine = false;    if (codec != null) {      in = new LineReader(codec.createInputStream(fileIn), job);      end = Long.MAX_VALUE;    } else {      if (start != 0) {        skipFirstLine = true;        --start;        fileIn.seek(start);      }      in = new LineReader(fileIn, job);    }    if (skipFirstLine) {  // skip first line and re-establish "start".      start += in.readLine(new Text(), 0,                           (int)Math.min((long)Integer.MAX_VALUE, end - start));    }    this.pos = start;  }
于是想到只要改写TextInputFormat就OK了。

下面是源代码的地址:源代码

运行环境是eclipse3.7 +hadoop-1.0.0+64位gentoo