迁移到Hadoop 0.20.2后的一些感想

来源：互联网发布：echarts 实时数据折线编辑：程序博客网时间：2024/05/17 23:27

迁移到Hadoop 0.20.2后的一些感想
------------------------------------------------

1. 问题：

离Hadoop 0.20.2的发布也有三个月了，平时一直在用http://www.cloudera.com/提供的一个Hadoop版本，它也是基于Hadoop 0.18.3开发的，因为它的一个比较稳定的版本。但是最近在用hypertable 0.9.2.7的时候发现我本地jni调用老是会出现Hyperspace COMM already commected，查了一下原因，是hyperspace的COMM被占用，连接出错，网上找了一下，作者也说有这个问题，还说什么修改一下也不是很难。看了一下它的源代码，是socket连接抛出来的，要改的话要修改hyperspace模块的代码，由于hyperspace底层是用oracle的berkeley db的，对它也不是很熟悉，所以没去改了，想直接升级到0.9.3.1，看它有没有解决这个问题，但是让我失望的是它还是没有解决这个问题，到是在它的thrift端做了很多改动，而且把对hypertable表的TableSplit也加入了它的thrift服务端中，也正是我想要的，呵呵，这样也可以绕过前面的hyperspace的问题，因为它在thrift的服务端只生成了一个Hypertable Client，这样就不会出现COMM connected的问题了，而且对它的Cell也做了比较大的改动，用了最新的Hadoop 0.20.2。没办法，要升就一起升吧，Hadoop 0.18.3->Hadoop 0.20.2 ; hypertable 0.9.2.7 -> hypertable 0.9.3.1，原来的TableInputFomat和TableOutputFormat看来都要做修改了，于是就有了下面的这些感想。

2. Hadoop 0.20.2的一些变化

新的版本不管从目录结构和API上都有了比较大的变化，不管从0.18到0.19，还是从0.19到0.20目录结构都发生了很大的变化，第一感觉就是模块化的意识越来越强了，而且看起来也更清晰了。

2.1 目录结构的变化

     主要有三个目录,core,hdfs,mapred。
     ＊主要是把原来共用的功能都提取到了core中，其中有conf, fs , io ,ipc , net,record等。还加入了类似于unix目录的权限功能。
     ＊把hdfs单独放入了一个目录中，而且把hdfs的配置文件也提取了出来，叫hdfs-default.xml，其中hdfs目录又分了protocol,提供了一些Client端的通讯协议，还有server和tools目录，其中server目录又分成了balancer,common,datanode,namenode,protocol , 这里的protocol目录中提供了DataNode和NameNode的通讯协议，还有DataNode之间的通讯协议等。
     ＊把mapred也独立出来，而且也把mapred的配置文件也提取出来，放入了mapred-default.xml中，它也有两个子目录，一个是mapred，其中放了一些mapreduce的核心类，还有一些Deprecated的类，用于向后兼容，不过一般都不提倡用这些接口和类了。另一个目录是mapreduce目录，这里有一些对外的抽象类和接口，用于进行根据自己的需要进行扩展，在这个目录中有一个叫lib的目录，它提供了一些框架提供的常用的input,output,map,reduce方法。

2.2 API的变化

     在Hadoop 0.20.2中，API的变化也是很大的，主要把一些接口变成了抽象类，以此在提高可扩展性，进行了一些重构，变化还是满大的。下面举一个例子来说明这里的变化。
        2.2.1 Hadoop的一个example
        这是Hadoop中的一个WordCount的例子，从中你可以发现Map和Reduce的接口变化，还有JobClient的变化。

       Hadoop 0.18.3 ------------------------- /** * Counts the words in each line. * For each line of input, break the line into words and emit them as * (<b>word</b>, <b>1</b>). */ public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } Hadoop 0.20.2 -------------------- public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context // 这里的output和reporter集成到了Context中 ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } =================== Hadoop 0.18.3 ------------------- /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop 0.20.2 ------------------- public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context // 看，这里也发现了同样的变化 ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); // 原来的collect被集成到了这里，下面我们会看一下Context到底是什么？ } } =================================== Hadoop 0.18.3 -------------------- /** * The main driver for word count map/reduce program. * Invoke this method to submit the map/reduce job. * @throws IOException When there is communication problems with the * job tracker. */ public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); List<String> other_args = new ArrayList<String>(); for(int i=0; i < args.length; ++i) { try { if ("-m".equals(args[i])) { conf.setNumMapTasks(Integer.parseInt(args[++i])); } else if ("-r".equals(args[i])) { conf.setNumReduceTasks(Integer.parseInt(args[++i])); } else { other_args.add(args[i]); } } catch (NumberFormatException except) { System.out.println("ERROR: Integer expected instead of " + args[i]); return printUsage(); } catch (ArrayIndexOutOfBoundsException except) { System.out.println("ERROR: Required parameter missing from " + args[i-1]); return printUsage(); } } // Make sure there are exactly 2 parameters left. if (other_args.size() != 2) { System.out.println("ERROR: Wrong number of parameters: " + other_args.size() + " instead of 2."); return printUsage(); } FileInputFormat.setInputPaths(conf, other_args.get(0)); FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1))); JobClient.runJob(conf); return 0; } Hadoop 0.20.2 ------------------- public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); // 原来的JobConf没有了，用Job来进行代替，这里的Job继承自JobContext，它集成了JobConf job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); // 设置Map方法 job.setCombinerClass(IntSumReducer.class); // 设置组合方法 job.setReducerClass(IntSumReducer.class); // 设置Reduce方法 job.setOutputKeyClass(Text.class); // 设置Map和Reduce的输出Key类型 job.setOutputValueClass(IntWritable.class); // 设置Map和Reduce的输出Value类型 FileInputFormat.addInputPath(job, new Path(otherArgs[0])); // 设置InputFile的路径 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); // 设置OutputFile的路径 System.exit(job.waitForCompletion(true) ? 0 : 1); // 运行task }
        注：
        1. 这里的Job好像只有setNumReduceTasks方法了，没有setNumMapTasks方法了，它在原来的JobConf里有，但是JobContext没有开放出来。
        2. JobContext组合了org.apache.hadoop.mapred.JobConf类，而这个JobConf类已经是deprecated，估计JobConf也是个过渡产品。

3. Hypertable 0.9.3.1的一些变化

    主要感觉它的thrift的java客户端变化很大。为了支持mapreduce，很多东西都集成到了thrift服务端。加入了MapReduce connector，Hyperspace的replication，还有DUMP TABLE等。在它的thrift的客户端中加入了InputFormat和OutputFormat，还有TableSplit，可以用来对Hypertable中的表进行Key和Value对的读取，还是很方便的。但是它没有对TableSplit后的range_location进行处理，只是用了“localhost“来进行Host的连接，不知道是为什么？
    看来要在Hypertable中使用kfs还是要进行源代码的编译还有kfs的动态库。头大啊。