Hadoop Distributed Cache 共享archives时的问题（以MapFile的共享为例）

来源：互联网发布：win7网络图标不显示编辑：程序博客网时间：2024/05/17 00:56

在Hadoop分布式处理中，如果需要在map和reduce任务中共享一些只读的数据，可以将这些数据配置在配置信息中（JobConf）。但是，根据《Hadoop权威指南》中所述，如果各计算结点间需要共享的只读数据量较大，由于配置信息的大小受到各计算结点的内存大小的限制，因此最好是采用Distributed Cache。可以在Job启动前，利用如下代码在Distributed Cache中添加文件或是存档：

DistributedCache.addCacheFile(new URI("/myapp/data.data"), jobconf); //添加文件DistributedCache.addCacheArchive(new URI("/myapp/archive.zip"), jobconf); //如果是要共享一个目录，先将其打包成.zip, .tar.gz等的压缩文件

在Mapper或是Reducer中可以利用如下代码获得所共享的文件或是存档信息：

 public static class MapClass extends MapReduceBase       implements Mapper<K, V, K, V> {            private Path[] localArchives;       private Path[] localFiles;              public void configure(JobConf job) {         // Get the cached archives/files         localArchives = DistributedCache.getLocalCacheArchives(job); //获得共享的文档的本地路径列表，同时DistributedCache机制为将存档自动解压，                                                                                                                       //因此所获得的路径是一个目录         localFiles = DistributedCache.getLocalCacheFiles(job);   //获得共享的文件的本地路径列表       }              public void map(K key, V value,                        OutputCollector<K, V> output, Reporter reporter)        throws IOException {         // Use data from the cached archives/files here         // ...         // ...         output.collect(k, v);       }     }

但是，在利用DistributedCache共享archives时，所获得的路径并不完全是原来的压缩文件解压后得到的目录，其中还含有其他内容，因此在操作时需要注意。

以下以共享MapFile为例，MapFile顾名思义就是一种磁盘版的Map结构，便于根据Key值进行快速查找，在磁盘上MapFile以一个目录的形式存放

图中P为最上层的目录，其下存在一个MapFile文件为part-00000，它包括data和index文件。

现在在MapReduce时将P加入DistributedCache中，首先将P压缩成zip文件，

在Job启动前的代码中加入如下语句：

DistributedCache.addCacheArchive(new URI(args[4]), conf); //其中args[4]是P.zip文件文件在路径名，一般都要先将它放到HDFS中

然后在Mapper或是Reducer代码添加如下代码获得MapFile的Readers

public static class MapClass extends MapReduceBase            implements Mapper<K, V, K, V> {      private MapFile.Reader[] readers = null;      private Path P_PATH;     private Path[] localArchives;      private String p_path_name;      public void configure(JobConf conf) {           p_path_name = conf.get("P_path_name");           try{              localArchives = DistributedCache.getLocalCacheArchives(conf);}          catch(IOException e){                    System.out.println("get cache files error");          }          System.out.println("num of LocalArchives is " + localArchives.length);          if(localArchives.length > 0 ) {                P_PATH = localArchives[0];          }           Path path = this.P_PATH;          try{                File zipDir = new File(path.toString());               String[] zipFiles = zipDir.list();               String mapfileDir = null;               for(int i = 0; i < zipFiles.length; ++i)               {                   if(zipFiles[i].equals(this.p_path_name))                   {                      mapfileDir = "file://" + path.toString() + "/" + zipFiles[i]; //查找真正存放mapfiles的目录，加上"file://"表示是在本地文件系统, DistributedCache会将共享的目录和文件放到本地文件系统上                    }                    System.out.println("zipFiles[i] is " + zipFiles[i]);                }               System.out.println("mapfileDir is " + mapfileDir);               Path mapFilePath = new Path(mapfileDir);               FileSystem newFs = mapFilePath.getFileSystem(conf);               System.out.println("mapfile path is " + mapFilePath.toString());               this.readers = MapFileOutputFormat.getReaders(newFs, mapFilePath, conf); //获得MapFile readers，在一个目录下存在多个mapFile可以一次获取多个readers         }catch(IOException e) {            System.out.println("fail to create mapfile readers");        }}

在本例中利用DistributedCache.getLocalCacheArchives(conf)获得的路径名为....../P.zip，但它是一个目录，其下存在

P.zip ------文件，原始的zip文件

P ------目录，zip解压得到，要找的就是这个目录

其中包括part-00000,这是真正的 mapfile

.P.zip.crc ------zip压缩时的校验文件

因此在获得P.zip路径后，还需要进一步查找，在能确定最终解压后的目录所在，这是利用DistributedCache共享archives时需要注意的问题。