Hadoop0.20+ custom MultipleOutputFormat
来源:互联网 发布:极光数据恢复软件官网 编辑:程序博客网 时间:2024/05/20 16:08
Hadoop0.20.2中无法使用MultipleOutputFormat,多文件输出这个方法。尽管0.19.2中的方法老的方法org.apache.hadoop.mapred.lib.MultipleOutputFormat还是可以继续在0.20.2中使用,但是org.apache.hadoop.mapred下的方法都是标记为“已过时”,在hadoop下个版本中可能就不能使用了。hadoop 0.20.2中是推荐使用Configuration替换JobConf,而这个老的方法org.apache.hadoop.mapred.lib.MultipleOutputFormat中还是使用的JobConf,就是说还没有新的可替换API。
此外hadoop 0.20.2还只是一个中间版本,并不是所有API都升级到最新了,没有提供的API只能自己写。
重写MultipleOutputFormat需要2个类:
LineRecordWriter
MultipleOutputFormat
PartitionByFilenameOutputFormat是实验中需要自定义的每个文件各自输出结果
LineRecordWriter:
- package cn.xmu.dm;
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.io.UnsupportedEncodingException;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
- public class LineRecordWriter<K, V> extends RecordWriter<K, V> {
- private static final String utf8 = "UTF-8";
- protected DataOutputStream out;
- private final byte[] keyValueSeparator;
- public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
- this.out = out;
- try {
- this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8
- + " encoding");
- }
- }
- public LineRecordWriter(DataOutputStream out) {
- this(out, "/t");
- }
- private void writeObject(Object o) throws IOException {
- if (o instanceof Text) {
- Text to = (Text) o;
- out.write(to.getBytes(), 0, to.getLength());
- } else {
- out.write(o.toString().getBytes(utf8));
- }
- }
- public synchronized void write(K key, V value) throws IOException {
- boolean nullKey = key == null || key instanceof NullWritable;
- boolean nullValue = value == null || value instanceof NullWritable;
- if (nullKey && nullValue) {
- return;
- }
- if (!nullKey) {
- writeObject(key);
- }
- if (!(nullKey || nullValue)) {
- out.write(keyValueSeparator);
- }
- if (!nullValue) {
- writeObject(value);
- }
- out.write("\r\n".getBytes());
- }
- public synchronized void close(TaskAttemptContext context)
- throws IOException {
- out.close();
- }
- }
MultipleOutputFormat:
- package cn.xmu.dm;
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.util.HashMap;
- import java.util.Iterator;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.Writable;
- import org.apache.hadoop.io.WritableComparable;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.OutputCommitter;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.util.ReflectionUtils;
- public abstract class MultipleOutputFormat<K extends WritableComparable<?>, V extends Writable>
- extends FileOutputFormat<K, V> {
- private MultiRecordWriter writer = null;
- public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException,
- InterruptedException {
- if (writer == null) {
- writer = new MultiRecordWriter(job, getTaskOutputPath(job));
- }
- return writer;
- }
- private Path getTaskOutputPath(TaskAttemptContext conf) throws IOException {
- Path workPath = null;
- OutputCommitter committer = super.getOutputCommitter(conf);
- if (committer instanceof FileOutputCommitter) {
- workPath = ((FileOutputCommitter) committer).getWorkPath();
- } else {
- Path outputPath = super.getOutputPath(conf);
- if (outputPath == null) {
- throw new IOException("Undefined job output-path");
- }
- workPath = outputPath;
- }
- return workPath;
- }
- protected abstract String generateFileNameForKeyValue(K key, V value, Configuration conf);
- public class MultiRecordWriter extends RecordWriter<K, V> {
- private HashMap<String, RecordWriter<K, V>> recordWriters = null;
- private TaskAttemptContext job = null;
- private Path workPath = null;
- public MultiRecordWriter(TaskAttemptContext job, Path workPath) {
- super();
- this.job = job;
- this.workPath = workPath;
- recordWriters = new HashMap<String, RecordWriter<K, V>>();
- }
- @Override
- public void close(TaskAttemptContext context) throws IOException, InterruptedException {
- Iterator<RecordWriter<K, V>> values = this.recordWriters.values().iterator();
- while (values.hasNext()) {
- values.next().close(context);
- }
- this.recordWriters.clear();
- }
- @Override
- public void write(K key, V value) throws IOException, InterruptedException {
- String baseName = generateFileNameForKeyValue(key, value, job.getConfiguration());
- RecordWriter<K, V> rw = this.recordWriters.get(baseName);
- if (rw == null) {
- rw = getBaseRecordWriter(job, baseName);
- this.recordWriters.put(baseName, rw);
- }
- rw.write(key, value);
- }
- private RecordWriter<K, V> getBaseRecordWriter(TaskAttemptContext job, String baseName)
- throws IOException, InterruptedException {
- Configuration conf = job.getConfiguration();
- boolean isCompressed = getCompressOutput(job);
- String keyValueSeparator = ",";
- RecordWriter<K, V> recordWriter = null;
- if (isCompressed) {
- Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job,
- GzipCodec.class);
- CompressionCodec codec = ReflectionUtils.newInstance(codecClass, conf);
- Path file = new Path(workPath, baseName + codec.getDefaultExtension());
- FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
- recordWriter = new LineRecordWriter<K, V>(new DataOutputStream(codec
- .createOutputStream(fileOut)), keyValueSeparator);
- } else {
- Path file = new Path(workPath, baseName);
- FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
- recordWriter = new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
- }
- return recordWriter;
- }
- }
- }
PartitionByFilenameOutputFormat:
- package cn.xmu.dm;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.io.Text;
- public class PartitionByFilenameOutputFormat extends MultipleOutputFormat<Text, Text>{
- @Override
- protected String generateFileNameForKeyValue(Text key, Text value,
- Configuration conf) {
- return value.toString().substring(0, value.toString().indexOf("\t"));
- }
- }
http://irwenqiang.iteye.com/blog/1535275
0 0
- Hadoop0.20+ custom MultipleOutputFormat
- Hadoop0.20更新笔记
- Hadoop0.20 安装部署笔记
- hadoop0.20 sqoop1.2搭建
- Hadoop0.20单节点配置 (Ubuntu 10.04)
- Hadoop的MultipleOutputFormat使用
- 【hadoop】MultipleOutputFormat和MultipleOutputs
- Hadoop的MultipleOutputFormat使用
- MultipleOutputFormat多文件输出
- CentOS7安装配置hadoop0.20,附带虚拟机集成资源包
- Spark多文件输出(MultipleOutputFormat)
- ubuntu12.04上Hadoop0.20安装,以及eclipse连接hadoop,eclipse-hadoop插件的安装
- Hadoop0.20.0部署网站
- Hadoop0.21.0编译手册
- Hadoop0.21.0编译方法
- Hadoop0.23配置
- hadoop0.23 编译
- Fedora8配置Hadoop0.22
- iOS 动画的实现方式和代码
- 编译安装webos
- 实际操作发现的java中if位置不同引起的一点变化
- Android结束应用(不停止service)
- C++内联函数(Inline)介绍
- Hadoop0.20+ custom MultipleOutputFormat
- 脱壳工具大汇总
- Android仿360悬浮框
- Spring MVC之注解Annonatoin之@SessionAttributes和@ModelAttribute
- 非广告!在360云盘建立了3个共享群,方便开发者交流共享资源
- sql数据库
- SAP阿拉伯数字转中文大写函数
- 黑马程序员_JAVA学习笔记6
- IOS 观察者模式