mapreduce编程过程

来源:互联网 发布:米格29知乎 编辑:程序博客网 时间:2024/06/05 09:48

概述

  • 流程图
  • 默认类
  • wordcount完整示例

流程图

  • mapreduce完整流程

默认类
  • 和上边流程图里的类对应
  • inputformatTextInputFormatrecordreaderLineRecoredReaderinputSplitFileSplitMap不知道combine不知道partitionerHashPartitionerGroupComparator这个很神奇,可以说是奇迹发生的地方,类里和主类里两个地方都可以写reduceidentityMap,老jar包里的类outputformatfileOutputFormatrecordReaderLineRecordReaderoutputCommitterFileOup0utCommitter
WordCount完整代码示例

  • inputformat 类对应 TextInputFormat ,而TextInputFormat 又继承自FileinputFormat,其实精华都在 FileinputFormat 里,下面首先为 FileinputFormat 类,然后为 TextInputFormat 类
  • FileInputFormat 源码
  • /**  * A base class for file-based {@link InputFormat}s. *  * <p><code>FileInputFormat</code> is the base class for all file-based  * <code>InputFormat</code>s. This provides a generic implementation of * {@link #getSplits(JobContext)}. * Subclasses of <code>FileInputFormat</code> can also override the  * {@link #isSplitable(JobContext, Path)} method to ensure input-files are * not split-up and are processed as a whole by {@link Mapper}s. */public abstract class FileInputFormat<K, V> extends InputFormat<K, V> {  public static enum Counter {     BYTES_READ  }    private static final Log LOG = LogFactory.getLog(FileInputFormat.class);  private static final double SPLIT_SLOP = 1.1;   // 10% slop  private static final PathFilter hiddenFileFilter = new PathFilter(){      public boolean accept(Path p){        String name = p.getName();         return !name.startsWith("_") && !name.startsWith(".");       }    };   static final String NUM_INPUT_FILES = "mapreduce.input.num.files";  /**   * Proxy PathFilter that accepts a path only if all filters given in the   * constructor do. Used by the listPaths() to apply the built-in   * hiddenFileFilter together with a user provided one (if any).   */  private static class MultiPathFilter implements PathFilter {    private List<PathFilter> filters;    public MultiPathFilter(List<PathFilter> filters) {      this.filters = filters;    }    public boolean accept(Path path) {      for (PathFilter filter : filters) {        if (!filter.accept(path)) {          return false;        }      }      return true;    }  }  /**   * Get the lower bound on split size imposed by the format.   * @return the number of bytes of the minimal split for this format   */  protected long getFormatMinSplitSize() {    return 1;  }  /**   * Is the given filename splitable? Usually, true, but if the file is   * stream compressed, it will not be.   *    * <code>FileInputFormat</code> implementations can override this and return   * <code>false</code> to ensure that individual input files are never split-up   * so that {@link Mapper}s process entire files.   *    * @param context the job context   * @param filename the file name to check   * @return is this file splitable?   */  protected boolean isSplitable(JobContext context, Path filename) {    return true;  }  /**   * Set a PathFilter to be applied to the input paths for the map-reduce job.   * @param job the job to modify   * @param filter the PathFilter class use for filtering the input paths.   */  public static void setInputPathFilter(Job job,                                        Class<? extends PathFilter> filter) {    job.getConfiguration().setClass("mapred.input.pathFilter.class", filter,                                     PathFilter.class);  }  /**   * Set the minimum input split size   * @param job the job to modify   * @param size the minimum size   */  public static void setMinInputSplitSize(Job job,                                          long size) {    job.getConfiguration().setLong("mapred.min.split.size", size);  }  /**   * Get the minimum split size   * @param job the job   * @return the minimum number of bytes that can be in a split   */  public static long getMinSplitSize(JobContext job) {    return job.getConfiguration().getLong("mapred.min.split.size", 1L);  }  /**   * Set the maximum split size   * @param job the job to modify   * @param size the maximum split size   */  public static void setMaxInputSplitSize(Job job,                                          long size) {    job.getConfiguration().setLong("mapred.max.split.size", size);  }  /**   * Get the maximum split size.   * @param context the job to look at.   * @return the maximum number of bytes a split can include   */  public static long getMaxSplitSize(JobContext context) {    return context.getConfiguration().getLong("mapred.max.split.size",                                               Long.MAX_VALUE);  }  /**   * Get a PathFilter instance of the filter set for the input paths.   *   * @return the PathFilter instance set for the job, NULL if none has been set.   */  public static PathFilter getInputPathFilter(JobContext context) {    Configuration conf = context.getConfiguration();    Class<?> filterClass = conf.getClass("mapred.input.pathFilter.class", null,        PathFilter.class);    return (filterClass != null) ?        (PathFilter) ReflectionUtils.newInstance(filterClass, conf) : null;  }  /** List input directories.   * Subclasses may override to, e.g., select only files matching a regular   * expression.    *    * @param job the job to list input paths for   * @return array of FileStatus objects   * @throws IOException if zero items.   */  protected List<FileStatus> listStatus(JobContext job                                        ) throws IOException {    List<FileStatus> result = new ArrayList<FileStatus>();    Path[] dirs = getInputPaths(job);    if (dirs.length == 0) {      throw new IOException("No input paths specified in job");    }        // get tokens for all the required FileSystems..    TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs,                                         job.getConfiguration());    List<IOException> errors = new ArrayList<IOException>();        // creates a MultiPathFilter with the hiddenFileFilter and the    // user provided one (if any).    List<PathFilter> filters = new ArrayList<PathFilter>();    filters.add(hiddenFileFilter);    PathFilter jobFilter = getInputPathFilter(job);    if (jobFilter != null) {      filters.add(jobFilter);    }    PathFilter inputFilter = new MultiPathFilter(filters);        for (int i=0; i < dirs.length; ++i) {      Path p = dirs[i];      FileSystem fs = p.getFileSystem(job.getConfiguration());       FileStatus[] matches = fs.globStatus(p, inputFilter);      if (matches == null) {        errors.add(new IOException("Input path does not exist: " + p));      } else if (matches.length == 0) {        errors.add(new IOException("Input Pattern " + p + " matches 0 files"));      } else {        for (FileStatus globStat: matches) {          if (globStat.isDir()) {            for(FileStatus stat: fs.listStatus(globStat.getPath(),                inputFilter)) {              result.add(stat);            }                    } else {            result.add(globStat);          }        }      }    }    if (!errors.isEmpty()) {      throw new InvalidInputException(errors);    }    LOG.info("Total input paths to process : " + result.size());     return result;  }    /**    * Generate the list of files and make them into FileSplits.   */   public List<InputSplit> getSplits(JobContext job                                    ) throws IOException {    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));    long maxSize = getMaxSplitSize(job);    // generate splits    List<InputSplit> splits = new ArrayList<InputSplit>();    List<FileStatus>files = listStatus(job);    for (FileStatus file: files) {      Path path = file.getPath();      FileSystem fs = path.getFileSystem(job.getConfiguration());      long length = file.getLen();      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);      if ((length != 0) && isSplitable(job, path)) {         long blockSize = file.getBlockSize();        long splitSize = computeSplitSize(blockSize, minSize, maxSize);        long bytesRemaining = length;        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,                                    blkLocations[blkIndex].getHosts()));          bytesRemaining -= splitSize;        }                if (bytesRemaining != 0) {          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,                      blkLocations[blkLocations.length-1].getHosts()));        }      } else if (length != 0) {        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));      } else {         //Create empty hosts array for zero length files        splits.add(new FileSplit(path, 0, length, new String[0]));      }    }        // Save the number of input files in the job-conf    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());    LOG.debug("Total # of splits: " + splits.size());    return splits;  }  protected long computeSplitSize(long blockSize, long minSize,                                  long maxSize) {    return Math.max(minSize, Math.min(maxSize, blockSize));  }  protected int getBlockIndex(BlockLocation[] blkLocations,                               long offset) {    for (int i = 0 ; i < blkLocations.length; i++) {      // is the offset inside this block?      if ((blkLocations[i].getOffset() <= offset) &&          (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){        return i;      }    }    BlockLocation last = blkLocations[blkLocations.length -1];    long fileLength = last.getOffset() + last.getLength() -1;    throw new IllegalArgumentException("Offset " + offset +                                        " is outside of file (0.." +                                       fileLength + ")");  }  /**   * Sets the given comma separated paths as the list of inputs    * for the map-reduce job.   *    * @param job the job   * @param commaSeparatedPaths Comma separated paths to be set as    *        the list of inputs for the map-reduce job.   */  public static void setInputPaths(Job job,                                    String commaSeparatedPaths                                   ) throws IOException {    setInputPaths(job, StringUtils.stringToPath(                        getPathStrings(commaSeparatedPaths)));  }  /**   * Add the given comma separated paths to the list of inputs for   *  the map-reduce job.   *    * @param job The job to modify   * @param commaSeparatedPaths Comma separated paths to be added to   *        the list of inputs for the map-reduce job.   */  public static void addInputPaths(Job job,                                    String commaSeparatedPaths                                   ) throws IOException {    for (String str : getPathStrings(commaSeparatedPaths)) {      addInputPath(job, new Path(str));    }  }  /**   * Set the array of {@link Path}s as the list of inputs   * for the map-reduce job.   *    * @param job The job to modify    * @param inputPaths the {@link Path}s of the input directories/files    * for the map-reduce job.   */   public static void setInputPaths(Job job,                                    Path... inputPaths) throws IOException {    Configuration conf = job.getConfiguration();    Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);    StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));    for(int i = 1; i < inputPaths.length;i++) {      str.append(StringUtils.COMMA_STR);      path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);      str.append(StringUtils.escapeString(path.toString()));    }    conf.set("mapred.input.dir", str.toString());  }  /**   * Add a {@link Path} to the list of inputs for the map-reduce job.   *    * @param job The {@link Job} to modify   * @param path {@link Path} to be added to the list of inputs for    *            the map-reduce job.   */  public static void addInputPath(Job job,                                   Path path) throws IOException {    Configuration conf = job.getConfiguration();    path = path.getFileSystem(conf).makeQualified(path);    String dirStr = StringUtils.escapeString(path.toString());    String dirs = conf.get("mapred.input.dir");    conf.set("mapred.input.dir", dirs == null ? dirStr : dirs + "," + dirStr);  }    // This method escapes commas in the glob pattern of the given paths.  private static String[] getPathStrings(String commaSeparatedPaths) {    int length = commaSeparatedPaths.length();    int curlyOpen = 0;    int pathStart = 0;    boolean globPattern = false;    List<String> pathStrings = new ArrayList<String>();        for (int i=0; i<length; i++) {      char ch = commaSeparatedPaths.charAt(i);      switch(ch) {        case '{' : {          curlyOpen++;          if (!globPattern) {            globPattern = true;          }          break;        }        case '}' : {          curlyOpen--;          if (curlyOpen == 0 && globPattern) {            globPattern = false;          }          break;        }        case ',' : {          if (!globPattern) {            pathStrings.add(commaSeparatedPaths.substring(pathStart, i));            pathStart = i + 1 ;          }          break;        }      }    }    pathStrings.add(commaSeparatedPaths.substring(pathStart, length));        return pathStrings.toArray(new String[0]);  }    /**   * Get the list of input {@link Path}s for the map-reduce job.   *    * @param context The job   * @return the list of input {@link Path}s for the map-reduce job.   */  public static Path[] getInputPaths(JobContext context) {    String dirs = context.getConfiguration().get("mapred.input.dir", "");    String [] list = StringUtils.split(dirs);    Path[] result = new Path[list.length];    for (int i = 0; i < list.length; i++) {      result[i] = new Path(StringUtils.unEscapeString(list[i]));    }    return result;  }}
  • TextInputFormat 源码
  • public class TextInputFormat extends FileInputFormat<LongWritable, Text> {  @Override  public RecordReader<LongWritable, Text>     createRecordReader(InputSplit split,                       TaskAttemptContext context) {    return new LineRecordReader();  }  @Override  protected boolean isSplitable(JobContext context, Path file) {    CompressionCodec codec =       new CompressionCodecFactory(context.getConfiguration()).getCodec(file);    return codec == null;  }}
  • inputformat 类里先用 FileSplit 把原始文件分成很多片,然后 Recordreader 以Filesplit为单位来处理分片
  • 我认为精华都在 LineRecordReader,分片很简单,无非是文件大小为10,现在按4,4,2 分,下面首先为 FileSplit 类,然后为 LineRecordReader 类
  • public class FileSplit extends InputSplit implements Writable {  private Path file;  private long start;  private long length;  private String[] hosts;  FileSplit() {}  /** Constructs a split with host information   *   * @param file the file name   * @param start the position of the first byte in the file to process   * @param length the number of bytes in the file to process   * @param hosts the list of hosts containing the block, possibly null   */  public FileSplit(Path file, long start, long length, String[] hosts) {    this.file = file;    this.start = start;    this.length = length;    this.hosts = hosts;  }   /** The file containing this split's data. */  public Path getPath() { return file; }    /** The position of the first byte in the file to process. */  public long getStart() { return start; }    /** The number of bytes in the file to process. */  @Override  public long getLength() { return length; }  @Override  public String toString() { return file + ":" + start + "+" + length; }  ////////////////////////////////////////////  // Writable methods  ////////////////////////////////////////////  @Override  public void write(DataOutput out) throws IOException {    Text.writeString(out, file.toString());    out.writeLong(start);    out.writeLong(length);  }  @Override  public void readFields(DataInput in) throws IOException {    file = new Path(Text.readString(in));    start = in.readLong();    length = in.readLong();    hosts = null;  }  @Override  public String[] getLocations() throws IOException {    if (this.hosts == null) {      return new String[]{};    } else {      return this.hosts;    }  }}
  • LineRecordReader 源码,这个着实能学到很多东西
  • public class LineRecordReader extends RecordReader<LongWritable, Text> {  private static final Log LOG = LogFactory.getLog(LineRecordReader.class);  private CompressionCodecFactory compressionCodecs = null;  private long start;  private long pos;  private long end;  private LineReader in;  private int maxLineLength;  private LongWritable key = null;  private Text value = null;  public void initialize(InputSplit genericSplit,                         TaskAttemptContext context) throws IOException {    FileSplit split = (FileSplit) genericSplit;    Configuration job = context.getConfiguration();    this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",                                    Integer.MAX_VALUE);    start = split.getStart();    end = start + split.getLength();    final Path file = split.getPath();    compressionCodecs = new CompressionCodecFactory(job);    final CompressionCodec codec = compressionCodecs.getCodec(file);    // open the file and seek to the start of the split    FileSystem fs = file.getFileSystem(job);    FSDataInputStream fileIn = fs.open(split.getPath());    boolean skipFirstLine = false;    if (codec != null) {      in = new LineReader(codec.createInputStream(fileIn), job);      end = Long.MAX_VALUE;    } else {      if (start != 0) {        skipFirstLine = true;        --start;        fileIn.seek(start);      }      in = new LineReader(fileIn, job);    }    if (skipFirstLine) {  // skip first line and re-establish "start".      start += in.readLine(new Text(), 0,                           (int)Math.min((long)Integer.MAX_VALUE, end - start));    }    this.pos = start;  }    public boolean nextKeyValue() throws IOException {    if (key == null) {      key = new LongWritable();    }    key.set(pos);    if (value == null) {      value = new Text();    }    int newSize = 0;    while (pos < end) {      newSize = in.readLine(value, maxLineLength,                            Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),                                     maxLineLength));      if (newSize == 0) {        break;      }      pos += newSize;      if (newSize < maxLineLength) {        break;      }      // line too long. try again      LOG.info("Skipped line of size " + newSize + " at pos " +                (pos - newSize));    }    if (newSize == 0) {      key = null;      value = null;      return false;    } else {      return true;    }  }  @Override  public LongWritable getCurrentKey() {    return key;  }  @Override  public Text getCurrentValue() {    return value;  }  /**   * Get the progress within the split   */  public float getProgress() {    if (start == end) {      return 0.0f;    } else {      return Math.min(1.0f, (pos - start) / (float)(end - start));    }  }    public synchronized void close() throws IOException {    if (in != null) {      in.close();     }  }}
  • Map 类,下面为wordcount 类默认的Map类
    public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{    private final static IntWritable one = new IntWritable(1);    private Text word = new Text();          public void map(Object key, Text value, Context context                    ) throws IOException, InterruptedException {      StringTokenizer itr = new StringTokenizer(value.toString());      while (itr.hasMoreTokens()) {        word.set(itr.nextToken());        context.write(word, one);      }    }  }
  • Combine 类,wordcount 使用 IntSumReducer 类作为 Combine类,和 reduce 类一模一样
  • public class IntSumReducer        extends Reducer<Text,IntWritable,Text,IntWritable> {    private IntWritable result = new IntWritable();    public void reduce(Text key, Iterable<IntWritable> values,                        Context context                       ) throws IOException, InterruptedException {      int sum = 0;      for (IntWritable val : values) {        sum += val.get();      }      result.set(sum);      context.write(key, result);    }  }
  • HashPartitioner 类,作用是用户指定 什么样的值到那个reducer去,源代码如下
  • public class HashPartitioner<K, V> extends Partitioner<K, V> {  /** Use {@link Object#hashCode()} to partition. */  public int getPartition(K key, V value,int numReduceTasks) {    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;  }}
  • GroupingComparator,奇迹发生的地方,wordcount 的 key 为 Text 类, 这里 Text 类里的 Comparator 即为我所指的 GroupingComparator ,好好利用这个类 可以实现多次聚类
  • public static class Comparator extends WritableComparator {    public Comparator() {      super(Text.class);    }    public int compare(byte[] b1, int s1, int l1,                       byte[] b2, int s2, int l2) {      int n1 = WritableUtils.decodeVIntSize(b1[s1]);      int n2 = WritableUtils.decodeVIntSize(b2[s2]);      return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);    }  }  static {    // register this comparator    WritableComparator.define(Text.class, new Comparator());  }
  •  wordcount 的 reduce 类,和上边的 combine 类一样,其实正确的说法是上面的 combine 类和这里的reducer 类一样
  • public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {    private IntWritable result = new IntWritable();    public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {      int sum = 0;      for (IntWritable val : values) {        sum += val.get();      }      result.set(sum);      context.write(key, result);    }}
  • TextOutputFormat 类,和 FileOutputFormat 类一样,它也有一个父类FileOutputFormat 类,其实精华都在 FileOutputFormat 类里。
  • FileOutputFormat 源代码
  • public abstract class FileOutputFormat<K, V> extends OutputFormat<K, V> {    protected static final String BASE_OUTPUT_NAME = "mapreduce.output.basename";  protected static final String PART = "part";  public static enum Counter {     BYTES_WRITTEN  }  /** Construct output file names so that, when an output directory listing is   * sorted lexicographically, positions correspond to output partitions.*/  private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance();  static {    NUMBER_FORMAT.setMinimumIntegerDigits(5);    NUMBER_FORMAT.setGroupingUsed(false);  }  private FileOutputCommitter committer = null;  /**   * Set whether the output of the job is compressed.   * @param job the job to modify   * @param compress should the output of the job be compressed?   */  public static void setCompressOutput(Job job, boolean compress) {    job.getConfiguration().setBoolean("mapred.output.compress", compress);  }    /**   * Is the job output compressed?   * @param job the Job to look in   * @return <code>true</code> if the job output should be compressed,   *         <code>false</code> otherwise   */  public static boolean getCompressOutput(JobContext job) {    return job.getConfiguration().getBoolean("mapred.output.compress", false);  }    /**   * Set the {@link CompressionCodec} to be used to compress job outputs.   * @param job the job to modify   * @param codecClass the {@link CompressionCodec} to be used to   *                   compress the job outputs   */  public static void   setOutputCompressorClass(Job job,                            Class<? extends CompressionCodec> codecClass) {    setCompressOutput(job, true);    job.getConfiguration().setClass("mapred.output.compression.codec",                                     codecClass,                                     CompressionCodec.class);  }    /**   * Get the {@link CompressionCodec} for compressing the job outputs.   * @param job the {@link Job} to look in   * @param defaultValue the {@link CompressionCodec} to return if not set   * @return the {@link CompressionCodec} to be used to compress the    *         job outputs   * @throws IllegalArgumentException if the class was specified, but not found   */  public static Class<? extends CompressionCodec>   getOutputCompressorClass(JobContext job,                        Class<? extends CompressionCodec> defaultValue) {    Class<? extends CompressionCodec> codecClass = defaultValue;    Configuration conf = job.getConfiguration();    String name = conf.get("mapred.output.compression.codec");    if (name != null) {      try {        codecClass =         conf.getClassByName(name).asSubclass(CompressionCodec.class);      } catch (ClassNotFoundException e) {        throw new IllegalArgumentException("Compression codec " + name +                                            " was not found.", e);      }    }    return codecClass;  }    public abstract RecordWriter<K, V>      getRecordWriter(TaskAttemptContext job                     ) throws IOException, InterruptedException;  public void checkOutputSpecs(JobContext job                               ) throws FileAlreadyExistsException, IOException{    // Ensure that the output directory is set and not already there    Path outDir = getOutputPath(job);    if (outDir == null) {      throw new InvalidJobConfException("Output directory not set.");    }        // get delegation token for outDir's file system    TokenCache.obtainTokensForNamenodes(job.getCredentials(),                                         new Path[] {outDir},                                         job.getConfiguration());    if (outDir.getFileSystem(job.getConfiguration()).exists(outDir)) {      throw new FileAlreadyExistsException("Output directory " + outDir +                                            " already exists");    }  }  /**   * Set the {@link Path} of the output directory for the map-reduce job.   *   * @param job The job to modify   * @param outputDir the {@link Path} of the output directory for    * the map-reduce job.   */  public static void setOutputPath(Job job, Path outputDir) {    job.getConfiguration().set("mapred.output.dir", outputDir.toString());  }  /**   * Get the {@link Path} to the output directory for the map-reduce job.   *    * @return the {@link Path} to the output directory for the map-reduce job.   * @see FileOutputFormat#getWorkOutputPath(TaskInputOutputContext)   */  public static Path getOutputPath(JobContext job) {    String name = job.getConfiguration().get("mapred.output.dir");    return name == null ? null: new Path(name);  }    /**   *  Get the {@link Path} to the task's temporary output directory    *  for the map-reduce job   *     * <h4 id="SideEffectFiles">Tasks' Side-Effect Files</h4>   *    * <p>Some applications need to create/write-to side-files, which differ from   * the actual job-outputs.   *    * <p>In such cases there could be issues with 2 instances of the same TIP    * (running simultaneously e.g. speculative tasks) trying to open/write-to the   * same file (path) on HDFS. Hence the application-writer will have to pick    * unique names per task-attempt (e.g. using the attemptid, say    * <tt>attempt_200709221812_0001_m_000000_0</tt>), not just per TIP.</p>    *    * <p>To get around this the Map-Reduce framework helps the application-writer    * out by maintaining a special    * <tt>${mapred.output.dir}/_temporary/_${taskid}</tt>    * sub-directory for each task-attempt on HDFS where the output of the    * task-attempt goes. On successful completion of the task-attempt the files    * in the <tt>${mapred.output.dir}/_temporary/_${taskid}</tt> (only)    * are <i>promoted</i> to <tt>${mapred.output.dir}</tt>. Of course, the    * framework discards the sub-directory of unsuccessful task-attempts. This    * is completely transparent to the application.</p>   *    * <p>The application-writer can take advantage of this by creating any    * side-files required in a work directory during execution    * of his task i.e. via    * {@link #getWorkOutputPath(TaskInputOutputContext)}, and   * the framework will move them out similarly - thus she doesn't have to pick    * unique paths per task-attempt.</p>   *    * <p>The entire discussion holds true for maps of jobs with    * reducer=NONE (i.e. 0 reduces) since output of the map, in that case,    * goes directly to HDFS.</p>    *    * @return the {@link Path} to the task's temporary output directory    * for the map-reduce job.   */  public static Path getWorkOutputPath(TaskInputOutputContext<?,?,?,?> context                                       ) throws IOException,                                                 InterruptedException {    FileOutputCommitter committer = (FileOutputCommitter)       context.getOutputCommitter();    return committer.getWorkPath();  }  /**   * Helper function to generate a {@link Path} for a file that is unique for   * the task within the job output directory.   *   * <p>The path can be used to create custom files from within the map and   * reduce tasks. The path name will be unique for each task. The path parent   * will be the job output directory.</p>ls   *   * <p>This method uses the {@link #getUniqueFile} method to make the file name   * unique for the task.</p>   *   * @param context the context for the task.   * @param name the name for the file.   * @param extension the extension for the file   * @return a unique path accross all tasks of the job.   */  public   static Path getPathForWorkFile(TaskInputOutputContext<?,?,?,?> context,                                  String name,                                 String extension                                ) throws IOException, InterruptedException {    return new Path(getWorkOutputPath(context),                    getUniqueFile(context, name, extension));  }  /**   * Generate a unique filename, based on the task id, name, and extension   * @param context the task that is calling this   * @param name the base filename   * @param extension the filename extension   * @return a string like $name-[mr]-$id$extension   */  public synchronized static String getUniqueFile(TaskAttemptContext context,                                                  String name,                                                  String extension) {    TaskID taskId = context.getTaskAttemptID().getTaskID();    int partition = taskId.getId();    StringBuilder result = new StringBuilder();    result.append(name);    result.append('-');    result.append(taskId.isMap() ? 'm' : 'r');    result.append('-');    result.append(NUMBER_FORMAT.format(partition));    result.append(extension);    return result.toString();  }  /**   * Get the default path and filename for the output format.   * @param context the task context   * @param extension an extension to add to the filename   * @return a full path $output/_temporary/$taskid/part-[mr]-$id   * @throws IOException   */  public Path getDefaultWorkFile(TaskAttemptContext context,                                 String extension) throws IOException{    FileOutputCommitter committer =       (FileOutputCommitter) getOutputCommitter(context);    return new Path(committer.getWorkPath(), getUniqueFile(context,         getOutputName(context), extension));  }    /**   * Get the base output name for the output file.   */  protected static String getOutputName(JobContext job) {    return job.getConfiguration().get(BASE_OUTPUT_NAME, PART);  }  /**   * Set the base output name for output file to be created.   */  protected static void setOutputName(JobContext job, String name) {    job.getConfiguration().set(BASE_OUTPUT_NAME, name);  }  public synchronized      OutputCommitter getOutputCommitter(TaskAttemptContext context                                        ) throws IOException {    if (committer == null) {      Path output = getOutputPath(context);      committer = new FileOutputCommitter(output, context);    }    return committer;  }}
  • 这里在给出 TextOutputFormat 类,这下就简单多了
  • public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {  protected static class LineRecordWriter<K, V>    extends RecordWriter<K, V> {    private static final String utf8 = "UTF-8";    private static final byte[] newline;    static {      try {        newline = "\n".getBytes(utf8);      } catch (UnsupportedEncodingException uee) {        throw new IllegalArgumentException("can't find " + utf8 + " encoding");      }    }    protected DataOutputStream out;    private final byte[] keyValueSeparator;    public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {      this.out = out;      try {        this.keyValueSeparator = keyValueSeparator.getBytes(utf8);      } catch (UnsupportedEncodingException uee) {        throw new IllegalArgumentException("can't find " + utf8 + " encoding");      }    }    public LineRecordWriter(DataOutputStream out) {      this(out, "\t");    }    /**     * Write the object to the byte stream, handling Text as a special     * case.     * @param o the object to print     * @throws IOException if the write throws, we pass it on     */    private void writeObject(Object o) throws IOException {      if (o instanceof Text) {        Text to = (Text) o;        out.write(to.getBytes(), 0, to.getLength());      } else {        out.write(o.toString().getBytes(utf8));      }    }    public synchronized void write(K key, V value)      throws IOException {      boolean nullKey = key == null || key instanceof NullWritable;      boolean nullValue = value == null || value instanceof NullWritable;      if (nullKey && nullValue) {        return;      }      if (!nullKey) {        writeObject(key);      }      if (!(nullKey || nullValue)) {        out.write(keyValueSeparator);      }      if (!nullValue) {        writeObject(value);      }      out.write(newline);    }    public synchronized     void close(TaskAttemptContext context) throws IOException {      out.close();    }  }  public RecordWriter<K, V>          getRecordWriter(TaskAttemptContext job                         ) throws IOException, InterruptedException {    Configuration conf = job.getConfiguration();    boolean isCompressed = getCompressOutput(job);    String keyValueSeparator= conf.get("mapred.textoutputformat.separator",                                       "\t");    CompressionCodec codec = null;    String extension = "";    if (isCompressed) {      Class<? extends CompressionCodec> codecClass =         getOutputCompressorClass(job, GzipCodec.class);      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);      extension = codec.getDefaultExtension();    }    Path file = getDefaultWorkFile(job, extension);    FileSystem fs = file.getFileSystem(conf);    if (!isCompressed) {      FSDataOutputStream fileOut = fs.create(file, false);      return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);    } else {      FSDataOutputStream fileOut = fs.create(file, false);      return new LineRecordWriter<K, V>(new DataOutputStream                                        (codec.createOutputStream(fileOut)),                                        keyValueSeparator);    }  }}
  • 在 经典类 FileOutputformat 类里用到 的两个经典类为,LineRecordWriter 和 FileOutputCommitter
  • LineRecordWriter 是作为 TextOutputFormat内部类出现的,其实上面已经有了,为保持完整性,在列出如下
  • protected static class LineRecordWriter<K, V>    extends RecordWriter<K, V> {    private static final String utf8 = "UTF-8";    private static final byte[] newline;    static {      try {        newline = "\n".getBytes(utf8);      } catch (UnsupportedEncodingException uee) {        throw new IllegalArgumentException("can't find " + utf8 + " encoding");      }    }    protected DataOutputStream out;    private final byte[] keyValueSeparator;    public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {      this.out = out;      try {        this.keyValueSeparator = keyValueSeparator.getBytes(utf8);      } catch (UnsupportedEncodingException uee) {        throw new IllegalArgumentException("can't find " + utf8 + " encoding");      }    }    public LineRecordWriter(DataOutputStream out) {      this(out, "\t");    }    /**     * Write the object to the byte stream, handling Text as a special     * case.     * @param o the object to print     * @throws IOException if the write throws, we pass it on     */    private void writeObject(Object o) throws IOException {      if (o instanceof Text) {        Text to = (Text) o;        out.write(to.getBytes(), 0, to.getLength());      } else {        out.write(o.toString().getBytes(utf8));      }    }    public synchronized void write(K key, V value)      throws IOException {      boolean nullKey = key == null || key instanceof NullWritable;      boolean nullValue = value == null || value instanceof NullWritable;      if (nullKey && nullValue) {        return;      }      if (!nullKey) {        writeObject(key);      }      if (!(nullKey || nullValue)) {        out.write(keyValueSeparator);      }      if (!nullValue) {        writeObject(value);      }      out.write(newline);    }    public synchronized     void close(TaskAttemptContext context) throws IOException {      out.close();    }  }
  •  FileOutputCommitter 类做的工作比较杂,比如 更改文件名,看看任务有没有成功,等等,主要做后期处理
  • public class FileOutputCommitter extends OutputCommitter {  private static final Log LOG = LogFactory.getLog(FileOutputCommitter.class);  /**   * Temporary directory name    */  protected static final String TEMP_DIR_NAME = "_temporary";  public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";  static final String SUCCESSFUL_JOB_OUTPUT_DIR_MARKER =    "mapreduce.fileoutputcommitter.marksuccessfuljobs";  private FileSystem outputFileSystem = null;  private Path outputPath = null;  private Path workPath = null;  /**   * Create a file output committer   * @param outputPath the job's output path   * @param context the task's context   * @throws IOException   */  public FileOutputCommitter(Path outputPath,                              TaskAttemptContext context) throws IOException {    if (outputPath != null) {      this.outputPath = outputPath;      outputFileSystem = outputPath.getFileSystem(context.getConfiguration());      workPath = new Path(outputPath,                          (FileOutputCommitter.TEMP_DIR_NAME + Path.SEPARATOR +                           "_" + context.getTaskAttemptID().toString()                           )).makeQualified(outputFileSystem);    }  }  /**   * Create the temporary directory that is the root of all of the task    * work directories.   * @param context the job's context   */  public void setupJob(JobContext context) throws IOException {    if (outputPath != null) {      Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME);      FileSystem fileSys = tmpDir.getFileSystem(context.getConfiguration());      if (!fileSys.mkdirs(tmpDir)) {        LOG.error("Mkdirs failed to create " + tmpDir.toString());      }    }  }  private static boolean shouldMarkOutputDir(Configuration conf) {    return conf.getBoolean(SUCCESSFUL_JOB_OUTPUT_DIR_MARKER,                            true);  }  // Mark the output dir of the job for which the context is passed.  private void markOutputDirSuccessful(JobContext context)  throws IOException {    if (outputPath != null) {      FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());      if (fileSys.exists(outputPath)) {        // create a file in the folder to mark it        Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);        fileSys.create(filePath).close();      }    }  }  /**   * Delete the temporary directory, including all of the work directories.   * This is called for all jobs whose final run state is SUCCEEDED   * @param context the job's context.   */  public void commitJob(JobContext context) throws IOException {    // delete the _temporary folder    cleanupJob(context);    // check if the o/p dir should be marked    if (shouldMarkOutputDir(context.getConfiguration())) {      // create a _success file in the o/p folder      markOutputDirSuccessful(context);    }  }  @Override  @Deprecated  public void cleanupJob(JobContext context) throws IOException {    if (outputPath != null) {      Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME);      FileSystem fileSys = tmpDir.getFileSystem(context.getConfiguration());      if (fileSys.exists(tmpDir)) {        fileSys.delete(tmpDir, true);      }    } else {      LOG.warn("Output path is null in cleanup");    }  }  /**   * Delete the temporary directory, including all of the work directories.   * @param context the job's context   * @param state final run state of the job, should be FAILED or KILLED   */  @Override  public void abortJob(JobContext context, JobStatus.State state)  throws IOException {    cleanupJob(context);  }    /**   * No task setup required.   */  @Override  public void setupTask(TaskAttemptContext context) throws IOException {    // FileOutputCommitter's setupTask doesn't do anything. Because the    // temporary task directory is created on demand when the     // task is writing.  }  /**   * Move the files from the work directory to the job output directory   * @param context the task context   */  public void commitTask(TaskAttemptContext context)   throws IOException {    TaskAttemptID attemptId = context.getTaskAttemptID();    if (workPath != null) {      context.progress();      if (outputFileSystem.exists(workPath)) {        // Move the task outputs to their final place        moveTaskOutputs(context, outputFileSystem, outputPath, workPath);        // Delete the temporary task-specific output directory        if (!outputFileSystem.delete(workPath, true)) {          LOG.warn("Failed to delete the temporary output" +           " directory of task: " + attemptId + " - " + workPath);        }        LOG.info("Saved output of task '" + attemptId + "' to " +                  outputPath);      }    }  }  /**   * Move all of the files from the work directory to the final output   * @param context the task context   * @param fs the output file system   * @param jobOutputDir the final output direcotry   * @param taskOutput the work path   * @throws IOException   */  private void moveTaskOutputs(TaskAttemptContext context,                               FileSystem fs,                               Path jobOutputDir,                               Path taskOutput)   throws IOException {    TaskAttemptID attemptId = context.getTaskAttemptID();    context.progress();    if (fs.isFile(taskOutput)) {      Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput,                                           workPath);      if (!fs.rename(taskOutput, finalOutputPath)) {        if (!fs.delete(finalOutputPath, true)) {          throw new IOException("Failed to delete earlier output of task: " +                                  attemptId);        }        if (!fs.rename(taskOutput, finalOutputPath)) {          throw new IOException("Failed to save output of task: " +           attemptId);        }      }      LOG.debug("Moved " + taskOutput + " to " + finalOutputPath);    } else if(fs.getFileStatus(taskOutput).isDir()) {      FileStatus[] paths = fs.listStatus(taskOutput);      Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, workPath);      fs.mkdirs(finalOutputPath);      if (paths != null) {        for (FileStatus path : paths) {          moveTaskOutputs(context, fs, jobOutputDir, path.getPath());        }      }    }  }  /**   * Delete the work directory   */  @Override  public void abortTask(TaskAttemptContext context) {    try {      if (workPath != null) {         context.progress();        outputFileSystem.delete(workPath, true);      }    } catch (IOException ie) {      LOG.warn("Error discarding output" + StringUtils.stringifyException(ie));    }  }  /**   * Find the final name of a given output file, given the job output directory   * and the work directory.   * @param jobOutputDir the job's output directory   * @param taskOutput the specific task output file   * @param taskOutputPath the job's work directory   * @return the final path for the specific output file   * @throws IOException   */  private Path getFinalPath(Path jobOutputDir, Path taskOutput,                             Path taskOutputPath) throws IOException {    URI taskOutputUri = taskOutput.toUri();    URI relativePath = taskOutputPath.toUri().relativize(taskOutputUri);    if (taskOutputUri == relativePath) {      throw new IOException("Can not get the relative path: base = " +           taskOutputPath + " child = " + taskOutput);    }    if (relativePath.getPath().length() > 0) {      return new Path(jobOutputDir, relativePath.getPath());    } else {      return jobOutputDir;    }  }  /**   * Did this task write any files in the work directory?   * @param context the task's context   */  @Override  public boolean needsTaskCommit(TaskAttemptContext context                                 ) throws IOException {    return workPath != null && outputFileSystem.exists(workPath);  }  /**   * Get the directory that the task should write results into   * @return the work directory   * @throws IOException   */  public Path getWorkPath() throws IOException {    return workPath;  }}
  • 其实这才是真正的wordcount代码编写过程,网上一堆人只写主类,那样理解太偏了,下面是wordcount 的 主类,也就是运行时的入口
  • public class WordCount {  public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: wordcount <in> <out>");      System.exit(2);    }    Job job = new Job(conf, "word count");    job.setJarByClass(WordCount.class);    job.setInputFormatClass(TextInputFormat.class);    TextInputFormat.addInputPath(job, new Path(otherArgs[0]));    job.setMapperClass(TokenizerMapper.class);    job.setCombinerClass(IntSumReducer.class);    job.setMapOutputKeyClass(Text.class);    job.setMapOutputValueClass(IntWritable.class);    job.setGroupingComparatorClass(Text.Comparator.class);    job.setReducerClass(IntSumReducer.class);    job.setOutputFormatClass(TextOutputFormat.class);    TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    System.exit(job.waitForCompletion(true) ? 0 : 1);  }}
  • 运行时的命令如下
  • hadoop --config 配置文件位置 jar jar包路径 WordCount in out
  • wordcount完整版结束
原创粉丝点击