Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat
来源:互联网 发布:211 985取消 知乎 编辑:程序博客网 时间:2024/04/29 13:50
在使用Hadoop处理海量小文件的应用场景中,如果你选择使用CombineFileInputFormat,而且你是第一次使用,可能你会感到有点迷惑。虽然,从这个处理方案的思想上很容易理解,但是可能会遇到这样那样的问题。
使用CombineFileInputFormat作为Map任务的输入规格描述,首先需要实现一个自定义的RecordReader。
CombineFileInputFormat的大致原理是,他会将输入多个数据文件(小文件)的元数据全部包装到CombineFileSplit类里面。也就是说,因为小文件的情况下,在HDFS中都是单Block的文件,即一个文件一个Block,一个CombineFileSplit包含了一组文件Block,包括每个文件的起始偏移(offset),长度(length),Block位置(localtions)等元数据。如果想要处理一个CombineFileSplit,很容易想到,对其包含的每个InputSplit(实际上这里面没有这个,你需要读取一个小文件块的时候,需要构造一个FileInputSplit对象)。
在执行MapReduce任务的时候,需要读取文件的文本行(简单一点是文本行,也可能是其他格式数据)。那么对于CombineFileSplit来说,你需要处理其包含的小文件Block,就要对应设置一个RecordReader,才能正确读取文件数据内容。通常情况下,我们有一批小文件,格式通常是相同的,只需要在为CombineFileSplit实现一个RecordReader的时候,内置另一个用来读取小文件Block的RecordReader,这样就能保证读取CombineFileSplit内部聚积的小文件。
编程实现
通过上面的说明,我们基于Hadoop内置的CombineFileInputFormat来实现处理海量小文件,需要做的工作就很显然了,如下所示:
- 实现一个RecordReader来读取CombineFileSplit包装的文件Block
- 继承自CombineFileInputFormat实现一个使用我们自定义的RecordReader的输入规格说明类
- 处理数据的Mapper实现类
- 配置用来处理海量小文件的MapReduce Job
下面,对编程实现的过程,详细讲解:
- CombineSmallfileRecordReader类
为CombineFileSplit实现一个RecordReader,并在内部使用Hadoop自带的LineRecordReader来读取小文件的文本行数据,代码实现如下所示:
01
package
org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02
03
import
java.io.IOException;
04
05
import
org.apache.hadoop.fs.Path;
06
import
org.apache.hadoop.io.BytesWritable;
07
import
org.apache.hadoop.io.LongWritable;
08
import
org.apache.hadoop.mapreduce.InputSplit;
09
import
org.apache.hadoop.mapreduce.RecordReader;
10
import
org.apache.hadoop.mapreduce.TaskAttemptContext;
11
import
org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
12
import
org.apache.hadoop.mapreduce.lib.input.FileSplit;
13
import
org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
14
15
public
class
CombineSmallfileRecordReader
extends
RecordReader<LongWritable, BytesWritable> {
16
17
private
CombineFileSplit combineFileSplit;
18
private
LineRecordReader lineRecordReader =
new
LineRecordReader();
19
private
Path[] paths;
20
private
int
totalLength;
21
private
int
currentIndex;
22
private
float
currentProgress =
0
;
23
private
LongWritable currentKey;
24
private
BytesWritable currentValue =
new
BytesWritable();;
25
26
public
CombineSmallfileRecordReader(CombineFileSplit combineFileSplit, TaskAttemptContext context, Integer index)
throws
IOException {
27
super
();
28
this
.combineFileSplit = combineFileSplit;
29
this
.currentIndex = index;
// 当前要处理的小文件Block在CombineFileSplit中的索引
30
}
31
32
@Override
33
public
void
initialize(InputSplit split, TaskAttemptContext context)
throws
IOException, InterruptedException {
34
this
.combineFileSplit = (CombineFileSplit) split;
35
// 处理CombineFileSplit中的一个小文件Block,因为使用LineRecordReader,需要构造一个FileSplit对象,然后才能够读取数据
36
FileSplit fileSplit =
new
FileSplit(combineFileSplit.getPath(currentIndex), combineFileSplit.getOffset(currentIndex), combineFileSplit.getLength(currentIndex), combineFileSplit.getLocations());
37
lineRecordReader.initialize(fileSplit, context);
38
39
this
.paths = combineFileSplit.getPaths();
40
totalLength = paths.length;
41
context.getConfiguration().set(
"map.input.file.name"
, combineFileSplit.getPath(currentIndex).getName());
42
}
43
44
@Override
45
public
LongWritable getCurrentKey()
throws
IOException, InterruptedException {
46
currentKey = lineRecordReader.getCurrentKey();
47
return
currentKey;
48
}
49
50
@Override
51
public
BytesWritable getCurrentValue()
throws
IOException, InterruptedException {
52
byte
[] content = lineRecordReader.getCurrentValue().getBytes();
53
currentValue.set(content,
0
, content.length);
54
return
currentValue;
55
}
56
57
@Override
58
public
boolean
nextKeyValue()
throws
IOException, InterruptedException {
59
if
(currentIndex >=
0
&& currentIndex < totalLength) {
60
return
lineRecordReader.nextKeyValue();
61
}
else
{
62
return
false
;
63
}
64
}
65
66
@Override
67
public
float
getProgress()
throws
IOException {
68
if
(currentIndex >=
0
&& currentIndex < totalLength) {
69
currentProgress = (
float
) currentIndex / totalLength;
70
return
currentProgress;
71
}
72
return
currentProgress;
73
}
74
75
@Override
76
public
void
close()
throws
IOException {
77
lineRecordReader.close();
78
}
79
}
如果存在这样的应用场景,你的小文件具有不同的格式,那么久需要考虑对不同类型的小文件,使用不同的内置RecordReader,具体逻辑也是在上面的类中实现。
- CombineSmallfileInputFormat类
我们已经为CombineFileSplit实现了一个RecordReader,然后需要在一个CombineFileInputFormat中注入这个RecordReader类实现类CombineSmallfileRecordReader的对象。这时,需要实现一个CombineFileInputFormat的子类,可以重写createRecordReader方法。我们实现的CombineSmallfileInputFormat,代码如下所示:
01
package
org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02
03
import
java.io.IOException;
04
05
import
org.apache.hadoop.io.BytesWritable;
06
import
org.apache.hadoop.io.LongWritable;
07
import
org.apache.hadoop.mapreduce.InputSplit;
08
import
org.apache.hadoop.mapreduce.RecordReader;
09
import
org.apache.hadoop.mapreduce.TaskAttemptContext;
10
import
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
11
import
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
12
import
org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
13
14
public
class
CombineSmallfileInputFormat
extends
CombineFileInputFormat<LongWritable, BytesWritable> {
15
16
@Override
17
public
RecordReader<LongWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context)
throws
IOException {
18
19
CombineFileSplit combineFileSplit = (CombineFileSplit) split;
20
CombineFileRecordReader<LongWritable, BytesWritable> recordReader =
new
CombineFileRecordReader<LongWritable, BytesWritable>(combineFileSplit, context, CombineSmallfileRecordReader.
class
);
21
try
{
22
recordReader.initialize(combineFileSplit, context);
23
}
catch
(InterruptedException e) {
24
new
RuntimeException(
"Error to initialize CombineSmallfileRecordReader."
);
25
}
26
return
recordReader;
27
}
28
29
}
上面比较重要的是,一定要通过CombineFileRecordReader来创建一个RecordReader,而且它的构造方法的参数必须是上面的定义的类型和顺序,构造方法包含3个参数:第一个是CombineFileSplit类型,第二个是TaskAttemptContext类型,第三个是Class<? extends RecordReader>类型。
- CombineSmallfileMapper类
下面,我们实现我们的MapReduce任务实现类,CombineSmallfileMapper类代码,如下所示:
01
package
org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02
03
import
java.io.IOException;
04
05
import
org.apache.hadoop.io.BytesWritable;
06
import
org.apache.hadoop.io.LongWritable;
07
import
org.apache.hadoop.io.Text;
08
import
org.apache.hadoop.mapreduce.Mapper;
09
10
public
class
CombineSmallfileMapper
extends
Mapper<LongWritable, BytesWritable, Text, BytesWritable> {
11
12
private
Text file =
new
Text();
13
14
@Override
15
protected
void
map(LongWritable key, BytesWritable value, Context context)
throws
IOException, InterruptedException {
16
String fileName = context.getConfiguration().get(
"map.input.file.name"
);
17
file.set(fileName);
18
context.write(file, value);
19
}
20
21
}
比较简单,就是将输入的文件文本行拆分成键值对,然后输出。
- CombineSmallfiles类
下面看我们的主方法入口类,这里面需要配置我之前实现的MapReduce Job,实现代码如下所示:
01
package
org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
02
03
import
java.io.IOException;
04
05
import
org.apache.hadoop.conf.Configuration;
06
import
org.apache.hadoop.fs.Path;
07
import
org.apache.hadoop.io.BytesWritable;
08
import
org.apache.hadoop.io.Text;
09
import
org.apache.hadoop.mapreduce.Job;
10
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
12
import
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
13
import
org.apache.hadoop.util.GenericOptionsParser;
14
import
org.shirdrn.kodz.inaction.hadoop.smallfiles.IdentityReducer;
15
16
public
class
CombineSmallfiles {
17
18
public
static
void
main(String[] args)
throws
IOException, ClassNotFoundException, InterruptedException {
19
20
Configuration conf =
new
Configuration();
21
String[] otherArgs =
new
GenericOptionsParser(conf, args).getRemainingArgs();
22
if
(otherArgs.length !=
2
) {
23
System.err.println(
"Usage: conbinesmallfiles <in> <out>"
);
24
System.exit(
2
);
25
}
26
27
conf.setInt(
"mapred.min.split.size"
,
1
);
28
conf.setLong(
"mapred.max.split.size"
,
26214400
);
// 25m
29
30
conf.setInt(
"mapred.reduce.tasks"
,
5
);
31
32
Job job =
new
Job(conf,
"combine smallfiles"
);
33
job.setJarByClass(CombineSmallfiles.
class
);
34
job.setMapperClass(CombineSmallfileMapper.
class
);
35
job.setReducerClass(IdentityReducer.
class
);
36
37
job.setMapOutputKeyClass(Text.
class
);
38
job.setMapOutputValueClass(BytesWritable.
class
);
39
job.setOutputKeyClass(Text.
class
);
40
job.setOutputValueClass(BytesWritable.
class
);
41
42
job.setInputFormatClass(CombineSmallfileInputFormat.
class
);
43
job.setOutputFormatClass(SequenceFileOutputFormat.
class
);
44
45
FileInputFormat.addInputPath(job,
new
Path(otherArgs[
0
]));
46
FileOutputFormat.setOutputPath(job,
new
Path(otherArgs[
1
]));
47
48
int
exitFlag = job.waitForCompletion(
true
) ?
0
:
1
;
49
System.exit(exitFlag);
50
51
}
52
53
}
运行程序
下面看一下,我们经过处理后,将小文件合并的结果,从而更利于使用Hadoop MapReduce框架进行高效地计算。
- 准备工作
1
jar -cvf combine-smallfiles.jar -C ./ org/shirdrn/kodz/inaction/hadoop/smallfiles
2
xiaoxiang@ubuntu3:~$
cd
/opt/comodo/cloud/hadoop-1.0.3
3
xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -
mkdir
/user/xiaoxiang/datasets/smallfiles
4
xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal /opt/comodo/cloud/dataset/smallfiles/* /user/xiaoxiang/datasets/smallfiles
- 运行结果
01
xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop jar combine-smallfiles.jar org.shirdrn.kodz.inaction.hadoop.smallfiles.combine.CombineSmallfiles /user/xiaoxiang/datasets/smallfiles /user/xiaoxiang/output/smallfiles/combine
02
13/03/23 21:52:09 INFO input.FileInputFormat: Total input paths to process : 117
03
13/03/23 21:52:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library
04
13/03/23 21:52:09 WARN snappy.LoadSnappy: Snappy native library not loaded
05
13/03/23 21:52:10 INFO mapred.JobClient: Running job: job_201303111631_0038
06
13/03/23 21:52:11 INFO mapred.JobClient: map 0% reduce 0%
07
13/03/23 21:52:29 INFO mapred.JobClient: map 33% reduce 0%
08
13/03/23 21:52:32 INFO mapred.JobClient: map 55% reduce 0%
09
13/03/23 21:52:35 INFO mapred.JobClient: map 76% reduce 0%
10
13/03/23 21:52:38 INFO mapred.JobClient: map 99% reduce 0%
11
13/03/23 21:52:41 INFO mapred.JobClient: map 100% reduce 0%
12
13/03/23 21:53:02 INFO mapred.JobClient: map 100% reduce 20%
13
13/03/23 21:53:05 INFO mapred.JobClient: map 100% reduce 40%
14
13/03/23 21:53:14 INFO mapred.JobClient: map 100% reduce 60%
15
13/03/23 21:53:17 INFO mapred.JobClient: map 100% reduce 80%
16
13/03/23 21:53:32 INFO mapred.JobClient: map 100% reduce 100%
17
13/03/23 21:53:37 INFO mapred.JobClient: Job complete: job_201303111631_0038
18
13/03/23 21:53:37 INFO mapred.JobClient: Counters: 28
19
13/03/23 21:53:37 INFO mapred.JobClient: Job Counters
20
13/03/23 21:53:37 INFO mapred.JobClient: Launched reduce tasks=5
21
13/03/23 21:53:37 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33515
22
13/03/23 21:53:37 INFO mapred.JobClient: Total
time
spent by all reduces waiting after reserving slots (ms)=0
23
13/03/23 21:53:37 INFO mapred.JobClient: Total
time
spent by all maps waiting after reserving slots (ms)=0
24
13/03/23 21:53:37 INFO mapred.JobClient: Launched map tasks=1
25
13/03/23 21:53:37 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=69085
26
13/03/23 21:53:37 INFO mapred.JobClient: File Output Format Counters
27
13/03/23 21:53:37 INFO mapred.JobClient: Bytes Written=237510415
28
13/03/23 21:53:37 INFO mapred.JobClient: FileSystemCounters
29
13/03/23 21:53:37 INFO mapred.JobClient: FILE_BYTES_READ=508266867
30
13/03/23 21:53:37 INFO mapred.JobClient: HDFS_BYTES_READ=147037765
31
13/03/23 21:53:37 INFO mapred.JobClient: FILE_BYTES_WRITTEN=722417364
32
13/03/23 21:53:37 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=237510415
33
13/03/23 21:53:37 INFO mapred.JobClient: File Input Format Counters
34
13/03/23 21:53:37 INFO mapred.JobClient: Bytes Read=0
35
13/03/23 21:53:37 INFO mapred.JobClient: Map-Reduce Framework
36
13/03/23 21:53:37 INFO mapred.JobClient: Map output materialized bytes=214110010
37
13/03/23 21:53:37 INFO mapred.JobClient: Map input records=3510000
38
13/03/23 21:53:37 INFO mapred.JobClient: Reduce shuffle bytes=0
39
13/03/23 21:53:37 INFO mapred.JobClient: Spilled Records=11840717
40
13/03/23 21:53:37 INFO mapred.JobClient: Map output bytes=207089980
41
13/03/23 21:53:37 INFO mapred.JobClient: CPU
time
spent (ms)=64200
42
13/03/23 21:53:37 INFO mapred.JobClient: Total committed heap usage (bytes)=722665472
43
13/03/23 21:53:37 INFO mapred.JobClient: Combine input records=0
44
13/03/23 21:53:37 INFO mapred.JobClient: SPLIT_RAW_BYTES=7914
45
13/03/23 21:53:37 INFO mapred.JobClient: Reduce input records=3510000
46
13/03/23 21:53:37 INFO mapred.JobClient: Reduce input
groups
=117
47
13/03/23 21:53:37 INFO mapred.JobClient: Combine output records=0
48
13/03/23 21:53:37 INFO mapred.JobClient: Physical memory (bytes) snapshot=820969472
49
13/03/23 21:53:37 INFO mapred.JobClient: Reduce output records=3510000
50
13/03/23 21:53:37 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3257425920
51
13/03/23 21:53:37 INFO mapred.JobClient: Map output records=3510000
- 验证结果
01
xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -text /user/xiaoxiang/output/smallfiles/combine/part-r-00000 |
head
-5
02
data_50000_000 44 4a 20 32 31 34 34 30 30 39 39 38 37 32 31 36 20 32 31 34 34 30 31 30 30 30 32 30 39 37 20 32 32 31 34 35 32 31 34 35
03
data_50000_000 44 45 20 32 31 34 34 30 30 39 39 38 37 37 33 32 20 32 31 34 34 30 31 30 30 30 31 32 34 31 20 31 38 32 34 39 37 32 37 34
04
data_50000_000 42 57 20 32 31 34 34 30 30 39 39 36 39 36 33 30 20 32 31 34 34 30 31 30 30 30 30 33 38 35 20 39 34 35 38 34 39 39 31 37
05
data_50000_000 50 59 20 32 31 34 34 30 30 39 39 37 37 34 35 34 20 32 31 34 34 30 30 39 39 39 39 35 32 39 20 34 38 37 33 32 33 34 39 37
06
data_50000_000 4d 4c 20 32 31 34 34 30 30 39 39 37 33 35 35 36 20 32 31 34 34 30 30 39 39 39 38 36 37 33 20 36 33 30 38 36 32 34 36 31
07
xiaoxiang@ubuntu3:/opt/comodo/cloud/hadoop-1.0.3$ bin/hadoop fs -text /user/xiaoxiang/output/smallfiles/combine/part-r-00000 |
tail
-5
08
data_50000_230 43 52 20 32 31 34 38 36 36 38 31 36 38 36 38 36 20 32 31 34 38 36 36 38 31 39 35 30 36 38 20 36 39 35 39 38 38 34 30 33
09
data_50000_230 50 52 20 32 31 34 38 36 36 38 31 36 35 36 34 36 20 32 31 34 38 36 36 38 31 39 34 36 34 30 20 38 34 30 36 35 31 39 38 38
10
data_50000_230 53 52 20 32 31 34 38 36 36 38 31 36 36 34 38 37 20 32 31 34 38 36 36 38 31 39 34 36 34 30 20 37 39 32 35 36 38 32 38 30
11
data_50000_230 4d 43 20 32 31 34 38 36 36 38 31 36 39 32 34 32 20 32 31 34 38 36 36 38 31 39 34 32 31 31 20 36 32 33 34 34 38 32 30 30
12
data_50000_230 4c 49 20 32 31 34 38 36 36 38 31 38 38 38 38 34 20 32 31 34 38 36 36 38 31 39 33 37 38 33 20 32 34 30 30 33 34 36 38 38
输出的文件格式,键是文件名称,值是该文件中的每一行文本数据。
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat(整个小文件读入到map中)
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat(每次往map中读入1行)
- Hadoop MapReduce处理小的压缩文件:基于CombineFileInputFormat
- Hadoop MapReduce处理海量小文件:压缩文件
- Hadoop MapReduce处理海量小文件:压缩文件
- MapReduce小文件处理之CombineFileInputFormat实现
- mapreduce合并小文件CombineFileInputFormat
- mapreduce合并小文件CombineFileInputFormat
- Hadoop MapReduce处理海量小文件:自定义InputFormat和RecordReader
- Hadoop MapReduce处理海量小文件(每次整个小文件整体读入到map):基于FileInputFormat
- 用CombineFileInputFormat处理小文件的mapreduce程序
- Hadoop中CombineFileInputFormat详解——处理大量小文件
- 利用CombineFileInputFormat处理小文件
- CombineFileinputFormat处理大批量小文件
- uva575
- makefile教程(4)
- 手机腾讯网js增量更新设计和实现
- groupdel: cannot remove user's primary group
- Log4j的用法详解
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat
- STL异常
- python/sklearn/theano中随机数总结
- Struts中动作与数据模型
- maven遇到的异常和错误解决方法
- IOS 调试技巧
- 页面滑动1:例题(ViewPagerDemo1)
- 关于Eclipse的Git插件更新代码失败
- 深入浅出空间索引2