DistributedCache的使用方法制定文件到链接不用关心实际路径

来源：互联网发布：男士加绒牛仔裤淘宝网编辑：程序博客网时间：2024/06/01 09:45

http://www.cnblogs.com/xuxm2007/archive/2011/06/30/2094397.html

Hadoop的分布式缓存机制使得一个job的所有map或reduce可以访问同一份文件。在任务提交后，hadoop将由-files和-archive选项指定的文件复制到HDFS上（JobTracker的文件系统）。在任务运行前，TaskTracker从JobTracker文件系统复制文件到本地磁盘作为缓存，这样任务就可以访问这些文件。对于job来说，它并不关心文件是从哪儿来的。在使用DistributedCache时，对于本地化文件的访问，通常使用Symbolic Link来访问，这样更方便。通过 URI hdfs://namenode/test/input/file1#myfile 指定的文件在当前工作目录中被符号链接为myfile。这样job里面可直接通过myfile来访问文件，而不用关心该文件在本地的具体路径。

示例如下：
　　在这个程序中，我们创建了一个符号链接，即god.txt指向HDFS上的文件/test/file/file.1。这样程序里就可以直接打开god.txt进行文件读取，而不用关心HDFS上的文件/test/file/file.1本地化后的真正路径。

view sourceprint?
import java.io.BufferedReader; 
import java.io.FileReader; 
import java.io.IOException; 
import java.net.URI; 
import java.util.StringTokenizer; 
  
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.filecache.DistributedCache; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.Mapper; 
import org.apache.hadoop.mapreduce.Reducer; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.util.GenericOptionsParser; 
  
public class WordCount202 { 
  
    publicstatic void UseDistributedCacheBySymbolicLink() throwsException { 
        FileReader reader =new FileReader("god.txt");
        BufferedReader br =new BufferedReader(reader);
        String s1 =null; 
        while((s1 = br.readLine()) != null) {
            System.out.println(s1);
        }
        br.close();
        reader.close();
    } 
  
    publicstatic class TokenizerMapper extends
            Mapper<Object, Text, Text, IntWritable> {
  
        privatefinal static IntWritable one = new IntWritable(1); 
        privateText word = newText(); 
  
        protectedvoid setup(Context context)throws IOException,
                InterruptedException {
            System.out.println("Now, use the distributed cache and syslink");
            try{ 
                UseDistributedCacheBySymbolicLink();
            }catch (Exception e) {
                e.printStackTrace();
            }
        }
  
        publicvoid map(Object key, Text value, Context context)
                throwsIOException, InterruptedException { 
            StringTokenizer itr =new StringTokenizer(value.toString());
            while(itr.hasMoreTokens()) { 
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    } 
  
    publicstatic class IntSumReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        privateIntWritable result = newIntWritable(); 
  
        publicvoid reduce(Text key, Iterable<IntWritable> values,
                Context context)throws IOException, InterruptedException {
            intsum = 0;
            for(IntWritable val : values) { 
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    } 
  
    publicstatic void main(String[] args) throws Exception { 
        Configuration conf =new Configuration();
        String[] otherArgs =new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if(otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
  
        DistributedCache.createSymlink(conf);
        String path ="/test/file/file.1";
        Path filePath =new Path(path);
        String uriWithLink = filePath.toUri().toString() +"#" + "god.txt";
        DistributedCache.addCacheFile(newURI(uriWithLink), conf); 
  
        Job job =new Job(conf, "word count"); 
        job.setJarByClass(WordCount202.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ?0 : 1);
    } 
}

　　程序运行的结果是在jobtracker中可以看到打印后的/test/file/file.1文件的内容。

　　如果程序中要用到很多小文件，那么使用Symbolic Link将非常方便。

DistributedCache的使用方法 制定文件到链接不用关心实际路径

DistributedCache的使用方法制定文件到链接不用关心实际路径