hadoop 编程规范（hadoop专利分析）

来源：互联网发布：多益网络ipo结果编辑：程序博客网时间：2024/04/29 02:09

网上有很多hadoop例子，但是不难发现，即使是一个wordcount都有很多不一样的地方，我们不可能总拿着别人的例子跑，所以自己要总结出一套规范，让api即使更新也能马上适应过来。这里也以hadoop 专利分析作为炮灰
右键新建map/reduce项目，然后点击项目右键Mapper,Reducer,MapperReduce Driver 并且在MapperReduce Driver 里填上刚才新建的Mapper,Reducer的类名，建好后，并修改
MapperReduce Driver里的路径为args[0],args[1]，然后Run AS 里选择RunConfiguration点击Javaapplication配置Arguments为:
hdfs://master:9000/user/input/file1.txt
hdfs://master:9000/user/aa
这种东西，这样一套规范就完成了
接下来，我们来对“专利分析”这个案例，一一进行分析

import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class MapClass extends Mapper<LongWritable, Text, Text, Text> {    public void map(LongWritable ikey, Text ivalue, Context context) throws IOException, InterruptedException {        String[] citation = ivalue.toString().split(",");          context.write(new Text(citation[1]), new Text(citation[0]));      }}

源文件类似这样专利号    引用专利号K1  ，   V1K2  ，   V2K3  ，   V3K1  ，   V3

LongWritable ikey 代表的是文本中的每一行
ivalue代表的就是文本里的值
String[] citation = ivalue.toString().split(“,”);
就是对文本以逗号为分界的分割
context.write(new Text(citation[1]), new Text(citation[0]));
Context 是MapReduce任务运行的一个上下文，包含了整个任务的全部信息
上下文写入：键为引用专利号，值为专利号的map，键是唯一的，所以hadoop会自动将值何在一起，即：

专利号    引用专利号V1     K1V2     K2V3     K3 k1

import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class Reduce extends Reducer<Text, Text, Text, Text> {    public void reduce(Text _key, Iterable<Text> values, Context context) throws IOException, InterruptedException {          // process values          String csv = "";          for (Text val : values) {              if (csv.length() > 0) {                  csv += ",";              }              csv += val.toString();          }          context.write(_key, new Text(csv));     }}

Text _key, Iterable<Text> values:这里就是上面map分解后传给你的东西了    即    专利号    引用专利号    V3       K3 k1  String csv = "";          for (Text val : values) {              if (csv.length() > 0) {                  csv += ",";              }              csv += val.toString();          }          context.write(_key, new Text(csv)); 这里就是在value上加上逗号方便观察了

最后，这也是自动生成的代码。。。右键运行在hadoop选择你刚才配置的那个就可以了import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class Driver {    public static void main(String[] args) throws Exception {        Configuration conf = new Configuration();        Job job = Job.getInstance(conf, "JobName");        job.setJarByClass(Driver.class);        job.setMapperClass(MapClass.class);        job.setReducerClass(Reduce.class);        // TODO: specify output types        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(Text.class);        // TODO: specify input and output DIRECTORIES (not files)        FileInputFormat.setInputPaths(job, new Path(args[0]));        FileOutputFormat.setOutputPath(job, new Path(args[1]));        if (!job.waitForCompletion(true))            return;    }}

0 0