Hadoop多个输出案例
来源:互联网 发布:干软件二次开发怎么样 编辑:程序博客网 时间:2024/05/16 13:58
需求:将原始数据按近似比例采样,将数据分为训练集和测试集。训练集存放于指定输出目录的train目录下,测试集存放于指定输出目录的test目录下。
class SampleMapper extends Mapper<LongWritable, Text, NullWritable, Text> { private double ratio; private Random random = new Random(); MultipleOutputs<NullWritable, Text> multipleOutputs; protected void setup(Context context) throws IOException, InterruptedException { ratio = Double.parseDouble(context.getConfiguration().get("ratio")); multipleOutputs = new MultipleOutputs<NullWritable, Text>(context); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if (random.nextDouble() <= ratio) { multipleOutputs.write(NullWritable.get(), value,"train/"); } else { multipleOutputs.write(NullWritable.get(), value,"test/"); } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { multipleOutputs.close(); }}
public static void job(Configuration config, Path inputPath, Path outputPath, String ratio) throws IOException { config.set("ratio", ratio); Job job = Job.getInstance(config); job.setJobName("Random Sample"); job.setJarByClass(Sampler.class); job.setMapperClass(SampleMapper.class); job.setMapOutputKeyClass(NullWritable.class); job.setMapOutputValueClass(Text.class); job.setNumReduceTasks(0); FileInputFormat.setInputPaths(job, inputPath); FileOutputFormat.setOutputPath(job, outputPath); MultipleOutputs.addNamedOutput(job, "train", TextOutputFormat.class, NullWritable.class, Text.class); MultipleOutputs.addNamedOutput(job, "test", TextOutputFormat.class, NullWritable.class, Text.class); try { job.waitForCompletion(true); } catch (ClassNotFoundException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } }
关键代码:
multipleOutputs.write(NullWritable.get(), value,"train/");multipleOutputs.write(NullWritable.get(), value,"test/");FileOutputFormat.setOutputPath(job, outputPath); MultipleOutputs.addNamedOutput(job, "train", TextOutputFormat.class, NullWritable.class, Text.class); MultipleOutputs.addNamedOutput(job, "test", TextOutputFormat.class, NullWritable.class, Text.class);
指定采样比例、输入路径和输出路径为:
hadoop.sampler.ratio = 0.2
hadoop.sampler.datainputpath = /lgh/data/input
hadoop.sampler.dataoutputpath = /lgh/sampleoutput
输出目录:
/lgh/sampleoutput/train
/lgh/sampleoutput/test
1 0
- Hadoop多个输出案例
- Hadoop多个输入案例
- Hadoop reduce多个输出
- hadoop的reducer输出多个文件
- hadoop的reducer输出多个文件
- 【hadoop】reducer输出多个目录
- hadoop划分为多个输出文件
- Hadoop MultipleOutputs.addNamedOutput 多个输出
- 【hadoop蜜汁问题解决】Multioutputs按照key输出多个文件
- Hadoop案例之单表关联输出祖孙关系
- hadoop多输出
- hadoop多文件输出
- hadoop多文件输出
- [Hadoop]MapReduce多输出
- 集合框架_键盘录入多个数据在控制台输出最大值案例
- hadoop streaming 多路输出
- hadoop多目录输出1
- Hadoop 实现多文件输出
- 第七次上机作业
- source insight 启动不了(crash)
- css基础
- C++第七次作业
- 基于柯西矩阵的Erasure Code技术详解
- Hadoop多个输出案例
- Java泛型让声明方法返回子类型
- 机器学习入门资源不完全汇总和技能图谱
- RESTful API的理解
- 修改PHP上传文件大小限制的方法
- TCP/IP 七层协议
- C++默认参数
- ARMv8 與 Linux的新手筆記
- EventBus笔记(一)