转：用 Hadoop 计算共生矩阵

来源：互联网发布：文明6 中文补丁 mac 编辑：程序博客网时间：2024/06/08 18:33

转自：http://juliashine.com/calculating-a-co-occurrence-matrix-with-hadoop/

本文译自Calculating A Co-Occurrence Matrix with Hadoop

本文是《Data-Intensive Text Processing with MapReduce》提到的MapReduce算法的系列文章的延续。这次我们会使用语料库建立一个单词共生矩阵。

所谓共生矩阵可以描述为对于某种事件，给予一个特定的时间或者空间限制，然后记录在这种情况下会发生的事件。本文中的“事件”指的是在文本中出现的单词，我们要记录在限制条件内其他词的出现情况，这个限制条件是指其他词相对于目标单词的位置。例如，考虑这句话“The quick brown fox jumped over the lazy dog”。限制条件是2，jumped这个单词满足条件的共生是 [brown,fox,over,the]。共生矩阵可以应用于需要调查在某个事件发生时同时还发生了什么时间的情况。我们将使用《Data-Intensive Text Processing with MapReduce》书中第三章提到过的Stripes算法和Pairs算法来建立文本共生矩阵。用以建立共生矩阵的文本资料来自《莎士比亚全集》。

Pairs算法

实现pairs算法十分简单。map函数每调用一次就传入一行，按照空格把传入的行切分成字符串数组。下一步是建立两层循环。外层循环数组中的每个词，内层循环遍历当前词的邻接词。内层循环的迭代次数取决于需要捕捉的当前词的邻接距离。在内层循环的每次迭代的最后，我们会输出一个WordPair对象（由两个词组成，当前词居左，邻接词居右）作为key，该组词的出现频度作为值。下面是pairs算法的实现代码：

 
 public class PairsOccurrenceMapper extends Mapper<LongWritable, Text, WordPair, IntWritable> {
     private WordPair wordPair = new WordPair();
     private IntWritable ONE = new IntWritable(1);
  
     @Override
     protected void map(LongWritable key, Text value, Context context) 
                               throwsIOException, InterruptedException {
         int neighbors = context.getConfiguration().getInt("neighbors", 2);
         String[] tokens = value.toString().split("\\s+");
         if (tokens.length > 1) {
           for (int i = 0; i < tokens.length; i++) {
               wordPair.setWord(tokens[i]);
  
              int start = (i - neighbors < 0) ? 0 : i - neighbors;
              int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
               for (int j = start; j <= end; j++) {
                   if (j == i) continue;
                    wordPair.setNeighbor(tokens[j]);
                    context.write(wordPair, ONE);
               }
           }
       }
   }
 }

Pairs算法中的Reducer只要简单的将作为key的同一WordPair的计数加和。

 
 public class PairsReducer extends Reducer<WordPair,IntWritable,WordPair,IntWritable> {
     private IntWritable totalCount = new IntWritable();
     @Override
     protected void reduce(WordPair key, Iterable<IntWritable> values, Context context) 
                              throwsIOException, InterruptedException {
         int count = 0;
         for (IntWritable value : values) {
              count += value.get();
         }
         totalCount.set(count);
         context.write(key,totalCount);
     }
 }

Stripes算法

共生矩阵中的stripes算法实现一样很简单。不过与pairs算法不同的是，每个词的所有相邻词被存储在一个hashmap中，以该邻接词为key，词的出现频数为值。当循环遍历完一个词的所有邻接词之后，这个词和与之关联的hashmap被输出。下面是stripes算法的实现代码：

 public class StripesOccurrenceMapper extends Mapper<LongWritable,Text,Text,MapWritable> {
   private MapWritable occurrenceMap = new MapWritable();
   private Text word = new Text();
  
   @Override
  protected void map(LongWritable key, Text value, Context context) 
                            throwsIOException, InterruptedException {
    int neighbors = context.getConfiguration().getInt("neighbors", 2);
    String[] tokens = value.toString().split("\\s+");
    if (tokens.length > 1) {
       for (int i = 0; i < tokens.length; i++) {
           word.set(tokens[i]);
           occurrenceMap.clear();
  
           int start = (i - neighbors < 0) ? 0 : i - neighbors;
           int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
            for (int j = start; j <= end; j++) {
                 if (j == i) continue;
                 Text neighbor = new Text(tokens[j]);
                 if(occurrenceMap.containsKey(neighbor)){
                    IntWritable count = (IntWritable)occurrenceMap.get(neighbor);
                    count.set(count.get()+1);
                 }else{
                    occurrenceMap.put(neighbor,new IntWritable(1));
                 }
            }
           context.write(word,occurrenceMap);
      }
    }
   }
 }

Stripes算法的Reducer稍微有点复杂，因为我们要遍历每个key的所有map，包括遍历每个map中的所有值。

 public class StripesReducer extends Reducer<Text, MapWritable, Text, MapWritable> {
     private MapWritable incrementingMap = new MapWritable();
  
     @Override
     protected void reduce(Text key, Iterable<MapWritable> values, Context context) 
                               throwsIOException, InterruptedException {
         incrementingMap.clear();
         for (MapWritable value : values) {
             addAll(value);
         }
         context.write(key, incrementingMap);
     }
  
     private void addAll(MapWritable mapWritable) {
         Set<Writable> keys = mapWritable.keySet();
         for (Writable key : keys) {
             IntWritable fromCount = (IntWritable) mapWritable.get(key);
             if (incrementingMap.containsKey(key)) {
                 IntWritable count = (IntWritable) incrementingMap.get(key);
                 count.set(count.get() + fromCount.get());
             } else {
                 incrementingMap.put(key, fromCount);
             }
         }
     }
 }

结论

现在来比较两种算法，看得出相较于Stripes算法，Pairs算法会产生更多的键值对。而且，Pairs 算法捕获到的是单个的共生事件而Stripes 算法能够捕获到所有的共生事件。Pairs算法和Stripes算法的实现都非常适宜于使用Combiner。因为这两种算法实现产生的结果都是可交换与可结合【译者注：可使用combiner的数据必须能够满足交换律与结合律，忘了这是那篇文档中提出的了】的，所以我们可以简单地重用reducer作为Combiner。如前所述，共生矩阵不仅仅能应用于文本处理，它会是我们手中的一项重要武器。谢谢你读到这里。

参考资料

Data-Intensive Processing with MapReduce by Jimmy Lin and Chris Dyer
Hadoop: The Definitive Guide by Tom White
Source Code and Tests from blog
Hadoop API
MRUnit 用来测试Apache Hadoop mapreduce

0 0