Hadoop MapReduce处理海量小文件：基于CombineFileInputFormat

来源：互联网发布：211 985取消知乎编辑：程序博客网时间：2024/04/29 13:50

在使用Hadoop处理海量小文件的应用场景中，如果你选择使用CombineFileInputFormat，而且你是第一次使用，可能你会感到有点迷惑。虽然，从这个处理方案的思想上很容易理解，但是可能会遇到这样那样的问题。
使用CombineFileInputFormat作为Map任务的输入规格描述，首先需要实现一个自定义的RecordReader。
CombineFileInputFormat的大致原理是，他会将输入多个数据文件（小文件）的元数据全部包装到CombineFileSplit类里面。也就是说，因为小文件的情况下，在HDFS中都是单Block的文件，即一个文件一个Block，一个CombineFileSplit包含了一组文件Block，包括每个文件的起始偏移（offset），长度（length），Block位置（localtions）等元数据。如果想要处理一个CombineFileSplit，很容易想到，对其包含的每个InputSplit（实际上这里面没有这个，你需要读取一个小文件块的时候，需要构造一个FileInputSplit对象）。
在执行MapReduce任务的时候，需要读取文件的文本行（简单一点是文本行，也可能是其他格式数据）。那么对于CombineFileSplit来说，你需要处理其包含的小文件Block，就要对应设置一个RecordReader，才能正确读取文件数据内容。通常情况下，我们有一批小文件，格式通常是相同的，只需要在为CombineFileSplit实现一个RecordReader的时候，内置另一个用来读取小文件Block的RecordReader，这样就能保证读取CombineFileSplit内部聚积的小文件。

编程实现

通过上面的说明，我们基于Hadoop内置的CombineFileInputFormat来实现处理海量小文件，需要做的工作就很显然了，如下所示：

实现一个RecordReader来读取CombineFileSplit包装的文件Block
继承自CombineFileInputFormat实现一个使用我们自定义的RecordReader的输入规格说明类
处理数据的Mapper实现类
配置用来处理海量小文件的MapReduce Job

下面，对编程实现的过程，详细讲解：

CombineSmallfileRecordReader类

为CombineFileSplit实现一个RecordReader，并在内部使用Hadoop自带的LineRecordReader来读取小文件的文本行数据，代码实现如下所示：

01package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.fs.Path;
06import org.apache.hadoop.io.BytesWritable;
07import org.apache.hadoop.io.LongWritable;
08import org.apache.hadoop.mapreduce.InputSplit;
09import org.apache.hadoop.mapreduce.RecordReader;
10import org.apache.hadoop.mapreduce.TaskAttemptContext;
11import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
12import org.apache.hadoop.mapreduce.lib.input.FileSplit;
13import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
14 
15public class CombineSmallfileRecordReader extends RecordReader<LongWritable, BytesWritable> {
16 
17    private CombineFileSplit combineFileSplit;
18    private LineRecordReader lineRecordReader = new LineRecordReader();
19    private Path[] paths;
20    private int totalLength;
21    private int currentIndex;
22    private float currentProgress = 0;
23    private LongWritable currentKey;
24    private BytesWritable currentValue = new BytesWritable();;
25 
26    public CombineSmallfileRecordReader(CombineFileSplit combineFileSplit, TaskAttemptContext context, Integer index) throwsIOException {
27        super();
28        this.combineFileSplit = combineFileSplit;
29        this.currentIndex = index; // 当前要处理的小文件Block在CombineFileSplit中的索引
30    }
31 
32    @Override
33    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
34        this.combineFileSplit = (CombineFileSplit) split;
35        // 处理CombineFileSplit中的一个小文件Block，因为使用LineRecordReader，需要构造一个FileSplit对象，然后才能够读取数据
36        FileSplit fileSplit = new FileSplit(combineFileSplit.getPath(currentIndex), combineFileSplit.getOffset(currentIndex), combineFileSplit.getLength(currentIndex), combineFileSplit.getLocations());
37        lineRecordReader.initialize(fileSplit, context);
38 
39        this.paths = combineFileSplit.getPaths();
40        totalLength = paths.length;
41        context.getConfiguration().set("map.input.file.name", combineFileSplit.getPath(currentIndex).getName());
42    }
43 
44    @Override
45    public LongWritable getCurrentKey() throws IOException, InterruptedException {
46        currentKey = lineRecordReader.getCurrentKey();
47        return currentKey;
48    }
49 
50    @Override
51    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
52        byte[] content = lineRecordReader.getCurrentValue().getBytes();
53        currentValue.set(content, 0, content.length);
54        return currentValue;
55    }
56 
57    @Override
58    public boolean nextKeyValue() throws IOException, InterruptedException {
59        if (currentIndex >= 0 && currentIndex < totalLength) {
60            return lineRecordReader.nextKeyValue();
61        } else {
62            return false;
63        }
64    }
65 
66    @Override
67    public float getProgress() throws IOException {
68        if (currentIndex >= 0 && currentIndex < totalLength) {
69            currentProgress = (float) currentIndex / totalLength;
70            return currentProgress;
71        }
72        return currentProgress;
73    }
74 
75    @Override
76    public void close() throws IOException {
77        lineRecordReader.close();
78    }
79}

如果存在这样的应用场景，你的小文件具有不同的格式，那么久需要考虑对不同类型的小文件，使用不同的内置RecordReader，具体逻辑也是在上面的类中实现。

CombineSmallfileInputFormat类

我们已经为CombineFileSplit实现了一个RecordReader，然后需要在一个CombineFileInputFormat中注入这个RecordReader类实现类CombineSmallfileRecordReader的对象。这时，需要实现一个CombineFileInputFormat的子类，可以重写createRecordReader方法。我们实现的CombineSmallfileInputFormat，代码如下所示：

01package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.io.BytesWritable;
06import org.apache.hadoop.io.LongWritable;
07import org.apache.hadoop.mapreduce.InputSplit;
08import org.apache.hadoop.mapreduce.RecordReader;
09import org.apache.hadoop.mapreduce.TaskAttemptContext;
10import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
11import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
12import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
13 
14public class CombineSmallfileInputFormat extends CombineFileInputFormat<LongWritable, BytesWritable> {
15 
16    @Override
17    public RecordReader<LongWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throwsIOException {
18 
19        CombineFileSplit combineFileSplit = (CombineFileSplit) split;
20        CombineFileRecordReader<LongWritable, BytesWritable> recordReader = new CombineFileRecordReader<LongWritable, BytesWritable>(combineFileSplit, context, CombineSmallfileRecordReader.class);
21        try {
22            recordReader.initialize(combineFileSplit, context);
23        } catch (InterruptedException e) {
24            new RuntimeException("Error to initialize CombineSmallfileRecordReader.");
25        }
26        return recordReader;
27    }
28 
29}

上面比较重要的是，一定要通过CombineFileRecordReader来创建一个RecordReader，而且它的构造方法的参数必须是上面的定义的类型和顺序，构造方法包含3个参数：第一个是CombineFileSplit类型，第二个是TaskAttemptContext类型，第三个是Class<? extends RecordReader>类型。

CombineSmallfileMapper类

下面，我们实现我们的MapReduce任务实现类，CombineSmallfileMapper类代码，如下所示：

01package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.io.BytesWritable;
06import org.apache.hadoop.io.LongWritable;
07import org.apache.hadoop.io.Text;
08import org.apache.hadoop.mapreduce.Mapper;
09 
10public class CombineSmallfileMapper extends Mapper<LongWritable, BytesWritable, Text, BytesWritable> {
11 
12    private Text file = new Text();
13 
14    @Override
15    protected void map(LongWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
16        String fileName = context.getConfiguration().get("map.input.file.name");
17        file.set(fileName);
18        context.write(file, value);
19    }
20 
21}

比较简单，就是将输入的文件文本行拆分成键值对，然后输出。

CombineSmallfiles类

下面看我们的主方法入口类，这里面需要配置我之前实现的MapReduce Job，实现代码如下所示：

01package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.conf.Configuration;
06import org.apache.hadoop.fs.Path;
07import org.apache.hadoop.io.BytesWritable;
08import org.apache.hadoop.io.Text;
09import org.apache.hadoop.mapreduce.Job;
10import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
12import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
13import org.apache.hadoop.util.GenericOptionsParser;
14import org.shirdrn.kodz.inaction.hadoop.smallfiles.IdentityReducer;
15 
16public class CombineSmallfiles {
17 
18    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
19 
20        Configuration conf = new Configuration();
21        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
22        if (otherArgs.length != 2) {
23            System.err.println("Usage: conbinesmallfiles <in> <out>");
24            System.exit(2);
25        }
26 
27        conf.setInt("mapred.min.split.size", 1);
28        conf.setLong("mapred.max.split.size", 26214400); // 25m
29 
30        conf.setInt("mapred.reduce.tasks", 5);
31 
32        Job job = new Job(conf, "combine smallfiles");
33        job.setJarByClass(CombineSmallfiles.class);
34        job.setMapperClass(CombineSmallfileMapper.class);
35        job.setReducerClass(IdentityReducer.class);
36 
37        job.setMapOutputKeyClass(Text.class);
38        job.setMapOutputValueClass(BytesWritable.class);
39        job.setOutputKeyClass(Text.class);
40        job.setOutputValueClass(BytesWritable.class);
41 
42        job.setInputFormatClass(CombineSmallfileInputFormat.class);
43        job.setOutputFormatClass(SequenceFileOutputFormat.class);
44 
45        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
46        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
47 
48        int exitFlag = job.waitForCompletion(true) ? 0 : 1;
49        System.exit(exitFlag);
50 
51    }
52 
53}

运行程序

下面看一下，我们经过处理后，将小文件合并的结果，从而更利于使用Hadoop MapReduce框架进行高效地计算。

准备工作

1jar -cvf combine-smallfiles.jar -C ./ org/shirdrn/kodz/inaction/hadoop/smallfiles
2xiaoxiang@ubuntu3:~$ cd /opt/comodo/cloud/hadoop-1.0.3
3xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -mkdir /user/xiaoxiang/datasets/smallfiles
4xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal /opt/comodo/cloud/dataset/smallfiles/* /user/xiaoxiang/datasets/smallfiles

运行结果

01xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop jar combine-smallfiles.jar org.shirdrn.kodz.inaction.hadoop.smallfiles.combine.CombineSmallfiles /user/xiaoxiang/datasets/smallfiles /user/xiaoxiang/output/smallfiles/combine
0213/03/23 21:52:09 INFO input.FileInputFormat: Total input paths to process : 117
0313/03/23 21:52:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library
0413/03/23 21:52:09 WARN snappy.LoadSnappy: Snappy native library not loaded
0513/03/23 21:52:10 INFO mapred.JobClient: Running job: job_201303111631_0038
0613/03/23 21:52:11 INFO mapred.JobClient:  map 0% reduce 0%
0713/03/23 21:52:29 INFO mapred.JobClient:  map 33% reduce 0%
0813/03/23 21:52:32 INFO mapred.JobClient:  map 55% reduce 0%
0913/03/23 21:52:35 INFO mapred.JobClient:  map 76% reduce 0%
1013/03/23 21:52:38 INFO mapred.JobClient:  map 99% reduce 0%
1113/03/23 21:52:41 INFO mapred.JobClient:  map 100% reduce 0%
1213/03/23 21:53:02 INFO mapred.JobClient:  map 100% reduce 20%
1313/03/23 21:53:05 INFO mapred.JobClient:  map 100% reduce 40%
1413/03/23 21:53:14 INFO mapred.JobClient:  map 100% reduce 60%
1513/03/23 21:53:17 INFO mapred.JobClient:  map 100% reduce 80%
1613/03/23 21:53:32 INFO mapred.JobClient:  map 100% reduce 100%
1713/03/23 21:53:37 INFO mapred.JobClient: Job complete: job_201303111631_0038
1813/03/23 21:53:37 INFO mapred.JobClient: Counters: 28
1913/03/23 21:53:37 INFO mapred.JobClient:   Job Counters
2013/03/23 21:53:37 INFO mapred.JobClient:     Launched reduce tasks=5
2113/03/23 21:53:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=33515
2213/03/23 21:53:37 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
2313/03/23 21:53:37 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
2413/03/23 21:53:37 INFO mapred.JobClient:     Launched map tasks=1
2513/03/23 21:53:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=69085
2613/03/23 21:53:37 INFO mapred.JobClient:   File Output Format Counters
2713/03/23 21:53:37 INFO mapred.JobClient:     Bytes Written=237510415
2813/03/23 21:53:37 INFO mapred.JobClient:   FileSystemCounters
2913/03/23 21:53:37 INFO mapred.JobClient:     FILE_BYTES_READ=508266867
3013/03/23 21:53:37 INFO mapred.JobClient:     HDFS_BYTES_READ=147037765
3113/03/23 21:53:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=722417364
3213/03/23 21:53:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=237510415
3313/03/23 21:53:37 INFO mapred.JobClient:   File Input Format Counters
3413/03/23 21:53:37 INFO mapred.JobClient:     Bytes Read=0
3513/03/23 21:53:37 INFO mapred.JobClient:   Map-Reduce Framework
3613/03/23 21:53:37 INFO mapred.JobClient:     Map output materialized bytes=214110010
3713/03/23 21:53:37 INFO mapred.JobClient:     Map input records=3510000
3813/03/23 21:53:37 INFO mapred.JobClient:     Reduce shuffle bytes=0
3913/03/23 21:53:37 INFO mapred.JobClient:     Spilled Records=11840717
4013/03/23 21:53:37 INFO mapred.JobClient:     Map output bytes=207089980
4113/03/23 21:53:37 INFO mapred.JobClient:     CPU time spent (ms)=64200
4213/03/23 21:53:37 INFO mapred.JobClient:     Total committed heap usage (bytes)=722665472
4313/03/23 21:53:37 INFO mapred.JobClient:     Combine input records=0
4413/03/23 21:53:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=7914
4513/03/23 21:53:37 INFO mapred.JobClient:     Reduce input records=3510000
4613/03/23 21:53:37 INFO mapred.JobClient:     Reduce input groups=117
4713/03/23 21:53:37 INFO mapred.JobClient:     Combine output records=0
4813/03/23 21:53:37 INFO mapred.JobClient:     Physical memory (bytes) snapshot=820969472
4913/03/23 21:53:37 INFO mapred.JobClient:     Reduce output records=3510000
5013/03/23 21:53:37 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3257425920
5113/03/23 21:53:37 INFO mapred.JobClient:     Map output records=3510000

验证结果

01xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -text /user/xiaoxiang/output/smallfiles/combine/part-r-00000 | head -5
02data_50000_000     44 4a 20 32 31 34 34 30 30 39 39 38 37 32 31 36 20 32 31 34 34 30 31 30 30 30 32 30 39 37 20 32 32 31 34 35 32 31 34 35
03data_50000_000     44 45 20 32 31 34 34 30 30 39 39 38 37 37 33 32 20 32 31 34 34 30 31 30 30 30 31 32 34 31 20 31 38 32 34 39 37 32 37 34
04data_50000_000     42 57 20 32 31 34 34 30 30 39 39 36 39 36 33 30 20 32 31 34 34 30 31 30 30 30 30 33 38 35 20 39 34 35 38 34 39 39 31 37
05data_50000_000     50 59 20 32 31 34 34 30 30 39 39 37 37 34 35 34 20 32 31 34 34 30 30 39 39 39 39 35 32 39 20 34 38 37 33 32 33 34 39 37
06data_50000_000     4d 4c 20 32 31 34 34 30 30 39 39 37 33 35 35 36 20 32 31 34 34 30 30 39 39 39 38 36 37 33 20 36 33 30 38 36 32 34 36 31
07xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -text /user/xiaoxiang/output/smallfiles/combine/part-r-00000 | tail -5
08data_50000_230     43 52 20 32 31 34 38 36 36 38 31 36 38 36 38 36 20 32 31 34 38 36 36 38 31 39 35 30 36 38 20 36 39 35 39 38 38 34 30 33
09data_50000_230     50 52 20 32 31 34 38 36 36 38 31 36 35 36 34 36 20 32 31 34 38 36 36 38 31 39 34 36 34 30 20 38 34 30 36 35 31 39 38 38
10data_50000_230     53 52 20 32 31 34 38 36 36 38 31 36 36 34 38 37 20 32 31 34 38 36 36 38 31 39 34 36 34 30 20 37 39 32 35 36 38 32 38 30
11data_50000_230     4d 43 20 32 31 34 38 36 36 38 31 36 39 32 34 32 20 32 31 34 38 36 36 38 31 39 34 32 31 31 20 36 32 33 34 34 38 32 30 30
12data_50000_230     4c 49 20 32 31 34 38 36 36 38 31 38 38 38 38 34 20 32 31 34 38 36 36 38 31 39 33 37 38 33 20 32 34 30 30 33 34 36 38 38

输出的文件格式，键是文件名称，值是该文件中的每一行文本数据。

0 0