Hadoop 自定义InputFormat实现自定义Split
来源:互联网 发布:os x 安装软件 编辑:程序博客网 时间:2024/05/29 06:40
原文链接:http://blog.csdn.net/anbo724/article/details/6956286
上一篇文章中提到了如何进行RecordReader的重写,本篇文章就是来实现如何实现自定义split的大小
要解决的需求:
(1)一个文本中每一行都记录了一个文件的路径,
(2)要求处理路径对应的文件,但是因为文件量比较大,所以想进行分布式处理
(3)所以就对输入的文档进行预处理,读取前N行做为一个splits,但是没有实现,因为重写FileSplit不是太容易实现,就偷懒直接定义一个split的大小是1000个字节,这样就可以将输入的文档进行分片了。
直接贴代码:
InputFormat
- /**
- * @file LineInputFormat.java
- * @brief自定义InputFormat 实现split大小的控制
- * @author anbo, anbo724@gmail.com
- * @version 1.0
- * @date 2011-10-18
- */
- /* Copyright(C)
- * For free
- * All right reserved
- *
- */
- package an.hadoop.test;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.List;
- import org.apache.commons.logging.Log;
- import org.apache.commons.logging.LogFactory;
- import org.apache.hadoop.fs.BlockLocation;
- import org.apache.hadoop.fs.FileStatus;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.CompressionCodecFactory;
- import org.apache.hadoop.mapreduce.InputFormat;
- import org.apache.hadoop.mapreduce.InputSplit;
- import org.apache.hadoop.mapreduce.JobContext;
- import org.apache.hadoop.mapreduce.RecordReader;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.input.FileSplit;
- import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
- public class LineInputFormat extends FileInputFormat<LongWritable , Text> {
- public long mySplitSize = 1000;
- private static final Log LOG = LogFactory.getLog(FileInputFormat.class);
- private static final double SPLIT_SLOP = 1.1; // 10% slop
- @Override
- public RecordReader<LongWritable, Text>
- createRecordReader(InputSplit split,
- TaskAttemptContext context) {
- return new LineRecordReader(); //为什么不行呢
- }
- @Override
- protected boolean isSplitable(JobContext context, Path file) {
- CompressionCodec codec =
- new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
- //return codec == null;
- return true;//要求分片
- }
- /**
- * Generate the list of files and make them into FileSplits.
- */
- @Override
- public List<InputSplit> getSplits(JobContext job) throws IOException {
- long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
- long maxSize = getMaxSplitSize(job);
- // generate splits
- List<InputSplit> splits = new ArrayList<InputSplit>(); //用以存放生成的split的
- for (FileStatus file: listStatus(job)) {//filestatues是文件对应的信息,具体看对应的类
- Path path = file.getPath();
- FileSystem fs = path.getFileSystem(job.getConfiguration());
- long length = file.getLen(); //得到文本的长度
- BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length); //取得文件所在块的位置
- if ((length != 0) && isSplitable(job, path)) { //如果文件不为空,并且可以分片的话就进行下列操作,
- long blockSize = file.getBlockSize();//
- //long splitSize = computeSplitSize(blockSize, minSize, maxSize); //split的大小Math.max(minSize, Math.min(maxSize, blockSize));
- //可以通过调整splitSize的大小来控制对应的文件块的大小,比如设置splitSize=100,那么就可以控制成每个split的大小
- //但是问题是,我是要求按行进行处理的,虽然这样应该也可以按行进行切分了,不过却不能保证每个split对应的行数都是相等的
- //一般情况是如果文件大于64M(32M)就会使用块大小来作为split
- long splitSize = mySplitSize;
- long bytesRemaining = length; //文本的长度
- while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {//剩下的文本长度大于split大小的SPLIT_SLOP倍数
- int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);//找到对应block块中对应的第0个字符开始,
- splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
- blkLocations[blkIndex].getHosts()));
- //这个是形成split的代码FileSplit(文件路径,0,split大小,host)
- //原始函数为 FileSplit(Path file, long start, long length, String[] hosts) {
- //但是应该可以通过重写FileSplit来实现对应的要求
- bytesRemaining -= splitSize;
- }
- if (bytesRemaining != 0) {
- splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
- blkLocations[blkLocations.length-1].getHosts()));
- }
- } else if (length != 0) {
- splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
- } else {
- //Create empty hosts array for zero length files
- splits.add(new FileSplit(path, 0, length, new String[0]));
- }
- }
- LOG.debug("Total # of splits: " + splits.size());
- return splits;
- }
- }
- public class Test_multi {
- public static void main(String[] args) throws Exception {
- Configuration conf = new Configuration();
- String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
- if (otherArgs.length != 2) {
- System.err.println("Usage: test_multi <in> <out>");
- System.exit(2);
- }
- Job job = new Job(conf, "test_multi");
- job.setJarByClass(Test_multi.class);
- job.setMapperClass(MultiMapper.class);
- // job.setInputFormatClass(LineInputFormat.class);//自定义了InputFormat
- //job.setCombinerClass(IntSumReducer.class);
- //job.setReducerClass(IntSumReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(Text.class);
- FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
- FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
- //job.waitForCompletion(true);
- System.exit(job.waitForCompletion(true) ? 0 : 1);
- }
然后看下一日志;
不使用自定义的InputFormat的处理结果是
- 11/11/10 14:54:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
- 11/11/10 14:54:25 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
- 11/11/10 14:54:25 INFO input.FileInputFormat: Total input paths to process : 1
- 11/11/10 14:54:25 INFO mapred.JobClient: Running job: job_local_0001
- 11/11/10 14:54:25 INFO input.FileInputFormat: Total input paths to process : 1
- 11/11/10 14:54:26 INFO mapred.MapTask: io.sort.mb = 100
- 11/11/10 14:54:26 INFO mapred.JobClient: map 0% reduce 0%
- 11/11/10 14:54:26 INFO mapred.MapTask: data buffer = 79691776/99614720
- 11/11/10 14:54:26 INFO mapred.MapTask: record buffer = 262144/327680
- 11/11/10 14:54:32 INFO mapred.LocalJobRunner:
- 11/11/10 14:54:33 INFO mapred.JobClient: map 58% reduce 0%
- 11/11/10 14:54:34 INFO mapred.MapTask: Starting flush of map output
- 11/11/10 14:54:35 INFO mapred.LocalJobRunner:
- 11/11/10 14:54:35 INFO mapred.JobClient: map 100% reduce 0%
- 11/11/10 14:54:35 INFO mapred.MapTask: Finished spill 0
- 11/11/10 14:54:35 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
- 11/11/10 14:54:35 INFO mapred.LocalJobRunner:
- 11/11/10 14:54:35 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
- 11/11/10 14:54:35 INFO mapred.LocalJobRunner:
- 11/11/10 14:54:35 INFO mapred.Merger: Merging 1 sorted segments
- 11/11/10 14:54:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2974 bytes
- 11/11/10 14:54:35 INFO mapred.LocalJobRunner:
- 11/11/10 14:54:36 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
- 11/11/10 14:54:36 INFO mapred.LocalJobRunner:
- 11/11/10 14:54:36 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
- 11/11/10 14:54:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://an.local:9100/user/an/out2
- 11/11/10 14:54:36 INFO mapred.LocalJobRunner: reduce > reduce
- 11/11/10 14:54:36 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
- 11/11/10 14:54:36 INFO mapred.JobClient: map 100% reduce 100%
- 11/11/10 14:54:36 INFO mapred.JobClient: Job complete: job_local_0001
- 11/11/10 14:54:36 INFO mapred.JobClient: Counters: 14
- 11/11/10 14:54:36 INFO mapred.JobClient: FileSystemCounters
- 11/11/10 14:54:36 INFO mapred.JobClient: FILE_BYTES_READ=35990
- 11/11/10 14:54:36 INFO mapred.JobClient: HDFS_BYTES_READ=8052
- 11/11/10 14:54:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=72570
- 11/11/10 14:54:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2642
- 11/11/10 14:54:36 INFO mapred.JobClient: Map-Reduce Framework
- 11/11/10 14:54:36 INFO mapred.JobClient: Reduce input groups=165
- 11/11/10 14:54:36 INFO mapred.JobClient: Combine output records=0
- 11/11/10 14:54:36 INFO mapred.JobClient: Map input records=165
- 11/11/10 14:54:36 INFO mapred.JobClient: Reduce shuffle bytes=0
- 11/11/10 14:54:36 INFO mapred.JobClient: Reduce output records=165
- 11/11/10 14:54:36 INFO mapred.JobClient: Spilled Records=330
- 11/11/10 14:54:36 INFO mapred.JobClient: Map output bytes=2642
- 11/11/10 14:54:36 INFO mapred.JobClient: Combine input records=0
- 11/11/10 14:54:36 INFO mapred.JobClient: Map output records=165
- 11/11/10 14:54:36 INFO mapred.JobClient: Reduce input records=165
使用自定义的InputFormat的日志是:
- 11/11/10 14:42:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
- 11/11/10 14:42:41 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
- 11/11/10 14:42:41 INFO input.FileInputFormat: Total input paths to process : 1
- 11/11/10 14:42:42 INFO mapred.JobClient: Running job: job_local_0001
- 11/11/10 14:42:42 INFO input.FileInputFormat: Total input paths to process : 1
- 11/11/10 14:42:42 INFO mapred.MapTask: io.sort.mb = 100
- 11/11/10 14:42:43 INFO mapred.JobClient: map 0% reduce 0%
- 11/11/10 14:42:46 INFO mapred.MapTask: data buffer = 79691776/99614720
- 11/11/10 14:42:46 INFO mapred.MapTask: record buffer = 262144/327680
- 11/11/10 14:42:49 INFO mapred.MapTask: Starting flush of map output
- 11/11/10 14:42:49 INFO mapred.MapTask: Finished spill 0
- 11/11/10 14:42:49 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
- 11/11/10 14:42:49 INFO mapred.LocalJobRunner:
- 11/11/10 14:42:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
- 11/11/10 14:42:49 INFO mapred.MapTask: io.sort.mb = 100
- 11/11/10 14:42:50 INFO mapred.MapTask: data buffer = 79691776/99614720
- 11/11/10 14:42:50 INFO mapred.MapTask: record buffer = 262144/327680
- 11/11/10 14:42:50 INFO mapred.JobClient: map 100% reduce 0%
- 11/11/10 14:42:51 INFO mapred.MapTask: Starting flush of map output
- 11/11/10 14:42:51 INFO mapred.MapTask: Finished spill 0
- 11/11/10 14:42:51 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
- 11/11/10 14:42:51 INFO mapred.LocalJobRunner:
- 11/11/10 14:42:51 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
- 11/11/10 14:42:51 INFO mapred.MapTask: io.sort.mb = 100
- 11/11/10 14:42:51 INFO mapred.MapTask: data buffer = 79691776/99614720
- 11/11/10 14:42:51 INFO mapred.MapTask: record buffer = 262144/327680
- 11/11/10 14:42:53 INFO mapred.MapTask: Starting flush of map output
- 11/11/10 14:42:53 INFO mapred.MapTask: Finished spill 0
- 11/11/10 14:42:53 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting
- 11/11/10 14:42:53 INFO mapred.LocalJobRunner:
- 11/11/10 14:42:53 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000002_0' done.
- 11/11/10 14:42:53 INFO mapred.MapTask: io.sort.mb = 100
- 11/11/10 14:42:53 INFO mapred.MapTask: data buffer = 79691776/99614720
- 11/11/10 14:42:53 INFO mapred.MapTask: record buffer = 262144/327680
- 11/11/10 14:42:54 INFO mapred.MapTask: Starting flush of map output
- 11/11/10 14:42:54 INFO mapred.MapTask: Finished spill 0
- 11/11/10 14:42:54 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting
- 11/11/10 14:42:54 INFO mapred.LocalJobRunner:
- 11/11/10 14:42:54 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000003_0' done.
- 11/11/10 14:42:54 INFO mapred.LocalJobRunner:
- 11/11/10 14:42:54 INFO mapred.Merger: Merging 4 sorted segments
- 11/11/10 14:42:54 INFO mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 2980 bytes
- 11/11/10 14:42:54 INFO mapred.LocalJobRunner:
- 11/11/10 14:42:55 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
- 11/11/10 14:42:55 INFO mapred.LocalJobRunner:
- 11/11/10 14:42:55 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
- 11/11/10 14:42:55 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://an.local:9100/user/an/out2
- 11/11/10 14:42:55 INFO mapred.LocalJobRunner: reduce > reduce
- 11/11/10 14:42:55 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
- 11/11/10 14:42:55 INFO mapred.JobClient: map 100% reduce 100%
- 11/11/10 14:42:55 INFO mapred.JobClient: Job complete: job_local_0001
- 11/11/10 14:42:55 INFO mapred.JobClient: Counters: 14
- 11/11/10 14:42:55 INFO mapred.JobClient: FileSystemCounters
- 11/11/10 14:42:55 INFO mapred.JobClient: FILE_BYTES_READ=86081
- 11/11/10 14:42:55 INFO mapred.JobClient: HDFS_BYTES_READ=40373
- 11/11/10 14:42:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=181846
- 11/11/10 14:42:55 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2642
- 11/11/10 14:42:55 INFO mapred.JobClient: Map-Reduce Framework
- 11/11/10 14:42:55 INFO mapred.JobClient: Reduce input groups=165
- 11/11/10 14:42:55 INFO mapred.JobClient: Combine output records=0
- 11/11/10 14:42:55 INFO mapred.JobClient: Map input records=165
- 11/11/10 14:42:55 INFO mapred.JobClient: Reduce shuffle bytes=0
- 11/11/10 14:42:55 INFO mapred.JobClient: Reduce output records=165
- 11/11/10 14:42:55 INFO mapred.JobClient: Spilled Records=330
- 11/11/10 14:42:55 INFO mapred.JobClient: Map output bytes=2642
- 11/11/10 14:42:55 INFO mapred.JobClient: Combine input records=0
- 11/11/10 14:42:55 INFO mapred.JobClient: Map output records=165
- 11/11/10 14:42:55 INFO mapred.JobClient: Reduce input records=165
从中可以看出第二个日志文件里面有四段这样的代码:
- 11/11/10 14:42:42 INFO mapred.MapTask: io.sort.mb = 100
- 11/11/10 14:42:43 INFO mapred.JobClient: map 0% reduce 0%
- 11/11/10 14:42:46 INFO mapred.MapTask: data buffer = 79691776/99614720
- 11/11/10 14:42:46 INFO mapred.MapTask: record buffer = 262144/327680
- 11/11/10 14:42:49 INFO mapred.MapTask: Starting flush of map output
- 11/11/10 14:42:49 INFO mapred.MapTask: Finished spill 0
说明是被分成了四个split,分片是成功了。
下一个问题:
使用多文件输入,中间处理之后输出文件是跟输入文件同名的,只是不在同一个文件夹下面。
输入文件与输出文件一一对应
0 0
- Hadoop 自定义InputFormat实现自定义Split
- Hadoop 自定义InputFormat实现自定义Split
- Hadoop自定义InputFormat
- hadoop自定义InputFormat
- Hadoop自定义InputFormat
- 自定义hadoop的inputformat
- 自定义hadoop的InputFormat
- hadoop 自定义inputformat和outputformat
- Hadoop实现自定义InputFormat按单个文件Map
- Hadoop自定义 inputformat 和outputformat 实现图像的读写
- Hadoop自定义InputFormat以实现多文件输入 MultiFileInputFormat
- 自定义InputFormat
- 自定义InputFormat
- 自定义InputFormat
- 自定义 hadoop MapReduce InputFormat 切分输入文件
- 自定义 hadoop MapReduce InputFormat 切分输入文件
- hadoop自定义InputFormat,OutputFormat输入输出类型
- Hadoop:自定义输入文件格式类InputFormat
- utils/CCSpriteFrameCacheHelper
- IE6下的怪异解析知识点补充
- easyUI Tab href,content区别
- Makefile 中的 $@, $^, $< , $? 符号
- vSphere,ESXi,vCenter之间的关系
- Hadoop 自定义InputFormat实现自定义Split
- utils/CCDataReaderHelper
- 苹果浏览器应用实战(四)
- PHP导入Excel到MySQL的方法 详细出处参考:http://www.jb51.net/article/26921.htm
- utils/CCConstValue
- mysql 在cmd命令行下操作遇到的转义字符
- 初步了解hessian
- LibRTMP源代码分析8:发送消息(Send RTMPPacket)
- OLAP多维分析之Mondrian Schema详解