Hadoop学习笔记(二)——map reduce Helloworld
来源:互联网 发布:麦卡锡主义知乎 编辑:程序博客网 时间:2024/05/28 05:16
- 数据准备
本文使用专利数据集,专利引用数据集cite75_99.txt和专利描述数据集apat63_99.txt下载地址为 美国国家经济研究局专利描述数据 专利引用数据1 专利引用数据2
专利描述数据前10个含义如下
输入的文件类似于
“CITING”,”CITED”
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
3858242,3668705
3858242,3707004
表示第一列引用了第二列的专利,第二列为被引用者
- 代码
1 3964859,4647229
10000 4539112
100000 5031388
1000006 4714284
1000007 4766693
1000011 5033339
1000017 3908629
1000026 4043055
1000033 4190903,4975983
1000043 4091523
1000044 4082383,4055371
1000045 4290571
1000046 5918892,5525001
1000049 5996916
专利4082383,4055371都引用了专利1000044
Mapper 类命名为patentMaper,将被引用者作为key,将引用者作为value输出
import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class PatentMapper extends Mapper<LongWritable, Text, Text, Text> {public void map(LongWritable ikey, Text ivalue, Context context)throws IOException, InterruptedException {String[] citation = ivalue.toString().split(",");context.write(new Text(citation[1]), new Text(citation[0]));}}
Reducer类命名为PatentReducer,所有被引用者相同的key整合在一起
import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class PatentReducer extends Reducer<Text, Text, Text, Text> {public void reduce(Text _key, Iterable<Text> values, Context context)throws IOException, InterruptedException {// process valuesString csv = "";for (Text val : values) {if (csv.length() > 0) {csv += ",";}csv += val.toString();}context.write(_key, new Text(csv));}}接下来写一个driver类来驱动这个任务命名为PatentDriver
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class PatentDriver {public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf, "Patent_Job");job.setJarByClass(PatentDriver.class);// TODO: specify a mapperjob.setMapperClass(PatentMapper.class);// TODO: specify a reducerjob.setReducerClass(PatentReducer.class);// TODO: specify output typesjob.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);// TODO: specify input and output DIRECTORIES (not files)Path in=new Path(args[0]);Path out=new Path(args[1]);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);System.exit(job.waitForCompletion(true)?0:1);}}接下来将这三个文件打成jar包patent.jar
- 运行
$ sh /usr/local/hadoop/sbin/start-dfs.sh$ sh /usr/local/hadoop/sbin/start-yarn.sh
把测试数据文件传到hdfs
$ hadoop fs -copyFromLocal cite75_99.txt /input之后将patent.jar加入环境变量
$ export HADOOP_CLASSPATH=patent.jar之后运行这个mapreduce任务
$ hadoop PatentDriver /input/cite75_99.txt /output会输出类似如下的内容
14/06/23 15:08:25 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:803214/06/23 15:08:26 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.14/06/23 15:08:27 INFO input.FileInputFormat: Total input paths to process : 114/06/23 15:08:27 INFO mapreduce.JobSubmitter: number of splits:214/06/23 15:08:27 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name14/06/23 15:08:27 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar14/06/23 15:08:27 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class14/06/23 15:08:27 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class14/06/23 15:08:27 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name14/06/23 15:08:27 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class14/06/23 15:08:27 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class14/06/23 15:08:27 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir14/06/23 15:08:27 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir14/06/23 15:08:27 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class14/06/23 15:08:27 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps14/06/23 15:08:27 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class14/06/23 15:08:27 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir14/06/23 15:08:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1403506647662_000114/06/23 15:08:28 INFO impl.YarnClientImpl: Submitted application application_1403506647662_0001 to ResourceManager at localhost/127.0.0.1:803214/06/23 15:08:28 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1403506647662_0001/14/06/23 15:08:28 INFO mapreduce.Job: Running job: job_1403506647662_000114/06/23 15:08:42 INFO mapreduce.Job: Job job_1403506647662_0001 running in uber mode : false14/06/23 15:08:42 INFO mapreduce.Job: map 0% reduce 0%14/06/23 15:09:01 INFO mapreduce.Job: map 2% reduce 0%14/06/23 15:09:04 INFO mapreduce.Job: map 15% reduce 0%14/06/23 15:09:07 INFO mapreduce.Job: map 19% reduce 0%14/06/23 15:09:08 INFO mapreduce.Job: map 23% reduce 0%14/06/23 15:09:11 INFO mapreduce.Job: map 26% reduce 0%14/06/23 15:09:23 INFO mapreduce.Job: map 27% reduce 0%14/06/23 15:09:27 INFO mapreduce.Job: map 33% reduce 0%14/06/23 15:09:30 INFO mapreduce.Job: map 46% reduce 0%14/06/23 15:09:33 INFO mapreduce.Job: map 48% reduce 0%14/06/23 15:09:46 INFO mapreduce.Job: map 58% reduce 0%14/06/23 15:09:49 INFO mapreduce.Job: map 66% reduce 0%14/06/23 15:09:53 INFO mapreduce.Job: map 67% reduce 0%14/06/23 15:10:06 INFO mapreduce.Job: map 68% reduce 0%14/06/23 15:10:09 INFO mapreduce.Job: map 75% reduce 0%14/06/23 15:10:12 INFO mapreduce.Job: map 79% reduce 0%14/06/23 15:10:13 INFO mapreduce.Job: map 82% reduce 0%14/06/23 15:10:15 INFO mapreduce.Job: map 85% reduce 0%14/06/23 15:10:16 INFO mapreduce.Job: map 88% reduce 0%14/06/23 15:10:19 INFO mapreduce.Job: map 96% reduce 0%14/06/23 15:10:21 INFO mapreduce.Job: map 100% reduce 0%14/06/23 15:10:35 INFO mapreduce.Job: map 100% reduce 67%14/06/23 15:10:38 INFO mapreduce.Job: map 100% reduce 69%14/06/23 15:10:41 INFO mapreduce.Job: map 100% reduce 73%14/06/23 15:10:44 INFO mapreduce.Job: map 100% reduce 78%14/06/23 15:10:47 INFO mapreduce.Job: map 100% reduce 82%14/06/23 15:10:50 INFO mapreduce.Job: map 100% reduce 86%14/06/23 15:10:53 INFO mapreduce.Job: map 100% reduce 90%14/06/23 15:10:57 INFO mapreduce.Job: map 100% reduce 94%14/06/23 15:11:00 INFO mapreduce.Job: map 100% reduce 98%14/06/23 15:11:01 INFO mapreduce.Job: map 100% reduce 100%14/06/23 15:11:02 INFO mapreduce.Job: Job job_1403506647662_0001 completed successfully14/06/23 15:11:02 INFO mapreduce.Job: Counters: 43File System CountersFILE: Number of bytes read=594240678FILE: Number of bytes written=893918911FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=264079739HDFS: Number of bytes written=158078539HDFS: Number of read operations=9HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Launched map tasks=2Launched reduce tasks=1Data-local map tasks=2Total time spent by all maps in occupied slots (ms)=193332Total time spent by all reduces in occupied slots (ms)=39408Map-Reduce FrameworkMap input records=16522439Map output records=16522439Map output bytes=264075431Map output materialized bytes=297120321Input split bytes=212Combine input records=0Combine output records=0Reduce input groups=3258984Reduce shuffle bytes=297120321Reduce input records=16522439Reduce output records=3258984Spilled Records=49567317Shuffled Maps =2Failed Shuffles=0Merged Map outputs=2GC time elapsed (ms)=6070CPU time spent (ms)=95830Physical memory (bytes) snapshot=458739712Virtual memory (bytes) snapshot=1995051008Total committed heap usage (bytes)=281157632Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=264079527File Output Format Counters Bytes Written=158078539
可以看到输出已经生成
tail一下输出的内容可以看到引用的专利已经产生
0 0
- Hadoop学习笔记(二)——map reduce Helloworld
- 【hadoop】Hadoop学习笔记(二):从map到reduce的数据流
- hadoop的基础学习-map reduce原理(二)
- Hadoop学习笔记(Map-Reduce的工作、调度机制)
- Hadoop学习笔记(二)helloworld
- hadoop学习笔记之Map-Reduce
- 【Hadoop学习】之Map-Reduce(一)
- hadoop map reduce 阶段笔记
- hadoop map reduce 阶段笔记
- hadoop 学习之Map/Reduce
- Hadoop学习:Map-Reduce入门
- Hadoop Map-Reduce入门学习
- hadoop学习笔记<四>----map-reduce工作原理
- hadoop学习笔记之深入了解map-reduce
- Hadoop — 分布式计算框架 Map-Reduce(初识 )
- Hadoop Map/Reduce编程模型实现海量数据处理—数字求和-Hadoop学习
- Hadoop Map/Reduce编程模型实现海量数据处理—数字求和-Hadoop学习
- 分布式基础学习【二】 —— 分布式计算系统(Map/Reduce)
- Android——编译odex保护
- 关于FastReport运行时报错“Class TfrxCheckboxView not found"的问题
- Fragment生命周期
- 每日记录
- Linux makefile 教程 非常详细,且易懂
- Hadoop学习笔记(二)——map reduce Helloworld
- 电阻的基础知识
- collectionView
- Hive 数据倾斜总结
- 10款实用Android UI工具
- poj 3071
- 我还能坚持多久?
- VS2010 单文档+多视图+Outlook风格
- 获取程序的当前路径