DistributedCache的使用方法 制定文件到链接不用关心实际路径

来源:互联网 发布:男士加绒牛仔裤淘宝网 编辑:程序博客网 时间:2024/06/01 09:45

 http://www.cnblogs.com/xuxm2007/archive/2011/06/30/2094397.html

 

Hadoop的分布式缓存机制使得一个job的所有map或reduce可以访问同一份文件。在任务提交后,hadoop将由-files和-archive选项指定的文件复制到HDFS上(JobTracker的文件系统)。在任务运行前,TaskTracker从JobTracker文件系统复制文件到本地磁盘作为缓存,这样任务就可以访问这些文件。对于job来说,它并不关心文件是从哪儿来的。在使用DistributedCache时,对于本地化文件的访问,通常使用Symbolic Link来访问,这样更方便。通过 URI hdfs://namenode/test/input/file1#myfile 指定的文件在当前工作目录中被符号链接为myfile。这样job里面可直接通过myfile来访问文件,而不用关心该文件在本地的具体路径。

示例如下:
  在这个程序中,我们创建了一个符号链接,即god.txt指向HDFS上的文件/test/file/file.1。这样程序里就可以直接打开god.txt进行文件读取,而不用关心HDFS上的文件/test/file/file.1本地化后的真正路径。

view sourceprint?
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.StringTokenizer;
  
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
  
public class WordCount202 {
  
    publicstatic void UseDistributedCacheBySymbolicLink() throwsException {
        FileReader reader =new FileReader("god.txt");
        BufferedReader br =new BufferedReader(reader);
        String s1 =null;
        while((s1 = br.readLine()) != null) {
            System.out.println(s1);
        }
        br.close();
        reader.close();
    }
  
    publicstatic class TokenizerMapper extends
            Mapper<Object, Text, Text, IntWritable> {
  
        privatefinal static IntWritable one = new IntWritable(1);
        privateText word = newText();
  
        protectedvoid setup(Context context)throws IOException,
                InterruptedException {
            System.out.println("Now, use the distributed cache and syslink");
            try{
                UseDistributedCacheBySymbolicLink();
            }catch (Exception e) {
                e.printStackTrace();
            }
        }
  
        publicvoid map(Object key, Text value, Context context)
                throwsIOException, InterruptedException {
            StringTokenizer itr =new StringTokenizer(value.toString());
            while(itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }
  
    publicstatic class IntSumReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        privateIntWritable result = newIntWritable();
  
        publicvoid reduce(Text key, Iterable<IntWritable> values,
                Context context)throws IOException, InterruptedException {
            intsum = 0;
            for(IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
  
    publicstatic void main(String[] args) throws Exception {
        Configuration conf =new Configuration();
        String[] otherArgs =new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if(otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
  
        DistributedCache.createSymlink(conf);
        String path ="/test/file/file.1";
        Path filePath =new Path(path);
        String uriWithLink = filePath.toUri().toString() +"#" + "god.txt";
        DistributedCache.addCacheFile(newURI(uriWithLink), conf);
  
        Job job =new Job(conf, "word count");
        job.setJarByClass(WordCount202.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ?0 : 1);
    }
}

  程序运行的结果是在jobtracker中可以看到打印后的/test/file/file.1文件的内容。

  如果程序中要用到很多小文件,那么使用Symbolic Link将非常方便。