MapReduce-MulitipleOutputs实现自定义输出到多个目录
来源:互联网 发布:网络主播的说辞 编辑:程序博客网 时间:2024/05/21 15:44
输入源数据样例:
Source1-0001Source2-0002Source1-0003Source2-0004Source1-0005Source2-0006Source3-0007Source3-0008描述:
- Source1开头的数据属于集合A;
- Source2开头的数据属于集合B;
- Source3开头的数据即属于集合A,也属于集合B;
输出要求:
- 完整保留集合A数据(包含Source1、Source3开头数据)
- 完整保留集合B数据(包含Source2、Source3开头数据)
程序实现:
import java.io.IOException;import java.util.List;import java.util.Map;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.mahout.common.AbstractJob;import com.yhd.common.util.HadoopUtil;/** * AbstractJob 是mahout的Job模板,可以不使用该模板, * 实则的核心部分在于MultipleOutputs部分 * * @author ouyangyewei * */public class TestMultipleOutputsJob extends AbstractJob { @Override public int run(String[] args) throws Exception { addInputOption(); addOutputOption(); Map<String, List<String>> parseArgs = parseArguments(args); if(parseArgs==null){ return -1; } HadoopUtil.delete(getConf(), getOutputPath()); Configuration conf = new Configuration(); conf.setInt("mapred.reduce.tasks", 4); conf.set("mapred.job.queue.name", "pms"); conf.set("mapred.child.java.opts", "-Xmx3072m"); conf.set("mapreduce.reduce.shuffle.memory.limit.percent", "0.05"); Job job = new Job(new Configuration(conf)); job.setJobName("TestMultipleOutputsJob"); job.setJarByClass(TestMultipleOutputsJob.class); job.setMapperClass(MultipleMapper.class); job.setNumReduceTasks(0); FileInputFormat.setInputPaths(job, this.getInputPath()); FileOutputFormat.setOutputPath(job, this.getOutputPath()); /** 输出文件格式将为:Source1-m-**** */ MultipleOutputs.addNamedOutput(job, "Source1", TextOutputFormat.class, Text.class, Text.class); /** 输出文件格式将为:Source2-m-**** */ MultipleOutputs.addNamedOutput(job, "Source2", TextOutputFormat.class, Text.class, Text.class); boolean suceeded = job.waitForCompletion(true); if(!suceeded) { return -1; } return 0; } /** * * @author ouyangyewei * */ public static class MultipleMapper extends Mapper<LongWritable, Text, Text, Text> { private MultipleOutputs<Text, Text> mos = null; @Override protected void setup(Context context ) throws IOException, InterruptedException { mos = new MultipleOutputs<Text, Text>(context); } public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String[] tokenizer = line.split("-"); if (tokenizer[0].equals("Source1")) { /** 集合A的数据 */ mos.write("Source1", new Text(tokenizer[0]), tokenizer[1]); } else if (tokenizer[0].equals("Source2")) { /** 集合B的数据 */ mos.write("Source2", new Text(tokenizer[0]), tokenizer[1]); } /** 集合A交集合B的数据 */ if (tokenizer[0].equals("Source3")) { mos.write("Source1", new Text(tokenizer[0]), tokenizer[1]); mos.write("Source2", new Text(tokenizer[0]), tokenizer[1]); } } protected void cleanup(Context context ) throws IOException, InterruptedException { mos.close(); } } /** * @param args */ public static void main(String[] args) { System.setProperty("javax.xml.parsers.DocumentBuilderFactory", "com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"); System.setProperty("javax.xml.parsers.SAXParserFactory", "com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl"); TestMultipleOutputsJob instance = new TestMultipleOutputsJob(); try { instance.run(args); } catch (Exception e) { e.printStackTrace(); } }}
使用hadoop jar命令调度运行jar包代码:
hadoop jar bigdata-datamining-1.0-user-trace-jar-with-dependencies.jar com.yhd.datamining.data.usertrack.offline.job.mapred.TestMultipleOutputsJob \--input /user/pms/workspace/ouyangyewei/testMultipleOutputs \--output /user/pms/workspace/ouyangyewei/testMultipleOutputs/output
程序运行以后,输出的结果:
[pms@yhd-jqhadoop39 /home/pms/workspace/ouyangyewei]$hadoop fs -ls /user/pms/workspace/ouyangyewei/testMultipleOutputs/outputFound 4 items-rw-r--r-- 3 pms pms 65 2014-12-16 09:18 /user/pms/workspace/ouyangyewei/testMultipleOutputs/output/Source1-m-00000-rw-r--r-- 3 pms pms 65 2014-12-16 09:18 /user/pms/workspace/ouyangyewei/testMultipleOutputs/output/Source2-m-00000-rw-r--r-- 3 pms pms 0 2014-12-16 09:18 /user/pms/workspace/ouyangyewei/testMultipleOutputs/output/_SUCCESS-rw-r--r-- 3 pms pms 0 2014-12-16 09:18 /user/pms/workspace/ouyangyewei/testMultipleOutputs/output/part-m-00000[pms@yhd-jqhadoop39 /home/pms/workspace/ouyangyewei]$hadoop fs -cat /user/pms/workspace/ouyangyewei/testMultipleOutputs/output/Source1-m-00000Source10001Source10003Source10005Source30007Source30008[pms@yhd-jqhadoop39 /home/pms/workspace/ouyangyewei]$hadoop fs -cat /user/pms/workspace/ouyangyewei/testMultipleOutputs/output/Source2-m-00000Source20002Source20004Source20006Source30007Source30008
补充于2014-12-18:
这种方式的缺陷是会产生很多类似Source1或Source2开头的子文件,一种很好的方式就是指定baseOutputPath,将Source1开头的文件放在同一个目录中管理
对上述代码进行改写实现目录管理:
public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String[] tokenizer = line.split("-"); if (tokenizer[0].equals("Source1")) { /** 集合A的数据 */ mos.write("Source1", new Text(tokenizer[0]), tokenizer[1], "Source1/part"); } else if (tokenizer[0].equals("Source2")) { /** 集合B的数据 */ mos.write("Source2", new Text(tokenizer[0]), tokenizer[1], "Source2/part"); } /** 集合A交集合B的数据 */ if (tokenizer[0].equals("Source3")) { mos.write("Source1", new Text(tokenizer[0]), tokenizer[1], "Source1/part"); mos.write("Source2", new Text(tokenizer[0]), tokenizer[1], "Source2/part"); } }程序运行以后,输出的结果:
$hadoop fs -ls /user/pms/workspace/ouyangyewei/testUsertrack/job1OutputFound 4 items-rw-r--r-- 3 pms pms 0 2014-12-18 14:11 /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/_SUCCESS-rw-r--r-- 3 pms pms 0 2014-12-18 14:11 /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/part-r-00000drwxr-xr-x - pms pms 0 2014-12-18 14:11 /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/Source1drwxr-xr-x - pms pms 0 2014-12-18 14:11 /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/Source2[pms@yhd-jqhadoop39 /home/pms/workspace/ouyangyewei/testUsertrack]$hadoop fs -ls /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/Source1Found 1 items-rw-r--r-- 3 pms pms 65 2014-12-18 14:11 /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/Source1/part-r-00000[pms@yhd-jqhadoop39 /home/pms/workspace/ouyangyewei/testUsertrack]$hadoop fs -ls /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/Source2Found 1 items-rw-r--r-- 3 pms pms 65 2014-12-18 14:11 /user/pms/workspace/ouyangyewei/testUsertrack/job1Output/Source2/part-r-00000
可以参考下:http://dirlt.com/mapred.html
1 0
- MapReduce-MulitipleOutputs实现自定义输出到多个目录
- 实现mapreduce多文件自定义输出
- 实现mapreduce多文件自定义输出
- 实现mapreduce多文件自定义输出
- mapreduce实现多文件自定义输出
- 实现MapReduce多文件自定义输出
- mapreduce实现writable接口自定义输出格式
- MapReduce中的自定义多目录/文件名输出HDFS
- MapReduce中的自定义多目录/文件名输出HDFS
- MapReduce中的自定义多目录/文件名输出<转>
- MapReduce自定义分组实现
- mapreduce多目录输出(MultipleOutputFormat和MultipleOutputs)
- 用一个MapReduce job实现去重,多目录输出功能
- mapreduce 自定义key/value 输出分隔符
- MapReduce中自定义文件输出名
- mapreduce 自定义key/value 输出分隔符
- MapReduce中自定义文件输出名
- MapReduce实现自定义二次排序
- 用指针玩字符串(数组名作形参)
- 医嘱(病房)管理系统
- iOS设计模式之 通知模式(广播)
- 个人笔记-CSS
- 2014-12-16初学者jquery
- MapReduce-MulitipleOutputs实现自定义输出到多个目录
- c语言--部分循环摘录
- android lib项目注意的事项
- 交叉表动态列
- SVN服务器搭建和使用
- 读改善java程序的151个建议(7)
- org.apache.axis2.AxisFault: Namespace URI may not be null
- Vision引擎中后期处理器 - 自适应色调映射介绍
- c语言--部分循环摘录2