hadoop自定义SdfInputFormat，文件按标记分片

来源：互联网发布：电脑双肩包推荐知乎编辑：程序博客网时间：2024/06/07 02:51

由于要用hadoop streaming处理sdf文件，而sdf文件的文件格式为

1  -OEChem-12181003042D.....$$$$

以$$$$结尾的多行。

而hadoop默认的分片为：以分块为基础的分片

    for (FileStatus file: files) {      Path path = file.getPath();      FileSystem fs = path.getFileSystem(job.getConfiguration());      long length = file.getLen();      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);      if ((length != 0) && isSplitable(job, path)) {        long blockSize = file.getBlockSize();        long splitSize = computeSplitSize(blockSize, minSize, maxSize);        long bytesRemaining = length;        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,                                   blkLocations[blkIndex].getHosts()));          bytesRemaining -= splitSize;        }        if (bytesRemaining != 0) {          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,                     blkLocations[blkLocations.length-1].getHosts()));        }      } else if (length != 0) {        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));      } else {        //Create empty hosts array for zero length files        splits.add(new FileSplit(path, 0, length, new String[0]));      }    }

本来想直接修改FileInputFormat.java的getSplits方法。可是本人刚刚接触java和hadoop，必须参考其他例子。于是找到NLineInputFormat的源码，其分片(getSplits方法)是按指定行划分的，有了点头绪。又想到TextInputFormat是按行读取的，想想应该在读取分片时做了些处理，果然是，它在已分片的基础上，跳过起始的最后一行，多读取分片的最后一行。

LineRecordReader.java：    FileSystem fs = file.getFileSystem(job);    FSDataInputStream fileIn = fs.open(split.getPath());    boolean skipFirstLine = false;    if (codec != null) {      in = new LineReader(codec.createInputStream(fileIn), job);      end = Long.MAX_VALUE;    } else {      if (start != 0) {        skipFirstLine = true;        --start;        fileIn.seek(start);      }      in = new LineReader(fileIn, job);    }    if (skipFirstLine) {  // skip first line and re-establish "start".      start += in.readLine(new Text(), 0,                           (int)Math.min((long)Integer.MAX_VALUE, end - start));    }    this.pos = start;  }

于是想到只要改写TextInputFormat就OK了。

下面是源代码的地址：源代码

运行环境是eclipse3.7 +hadoop-1.0.0+64位gentoo