【Hadoop】mahout推荐hadoop协同矩阵-RowSimilarityJob

来源：互联网发布：银川计算机java培训编辑：程序博客网时间：2024/06/11 07:57

生成协同矩阵是RocommendJob的第二步主要操作，第一步操作生成偏好矩阵分析请点击打开链接。在分析之前如果你有兴趣可以先去看一下mahuot对RowSimilarityJob的简单介绍。英文请点击打开链接。然后还有一个链接个人感觉比较有帮助，是某个人和开发者的一个邮件讨论，点击打开链接

这一步操作以上一步PreparePreferenceMatrixJob生成的偏好矩阵为输入，来对偏好矩阵做进一步的操作生成协同矩阵。这一步也是分为三个子步骤完成，每个子步骤分别有一个mapper与一个reducer完成。下面就对每个子步骤进行分析。

第一个子步骤：

mapper操作VectorNormMapper是对偏好矩阵进行重新组合向量。把以itemId为key，以userId为value的的向量(itemId, VectorWritable<userId, pref>)，再转化成以userId为key，以itemId为value的向量组合(userId, VectorWritable<itemId, pref>)。其实在PreparePreferenceMatrixJob的某一个子步骤中已经生成了格式为(userId, VectorWritable<itemId, pref>)的向量，但是不知为何又转化为了(itemId, VectorWritable<userId, pref>)的向量，而在这里又做了一步转化。真心没懂。

在mapper的操作中，在清除工作的cleanup方法中也做了相应的输出工作。

protected void map(IntWritable row, VectorWritable vectorWritable, Context ctx)        throws IOException, InterruptedException {      //(itemId, VectorWritable<userId, pref>)      Vector rowVector = similarity.normalize(vectorWritable.get());      int numNonZeroEntries = 0;      double maxValue = Double.MIN_VALUE;      Iterator<Vector.Element> nonZeroElements = rowVector.iterateNonZero();      while (nonZeroElements.hasNext()) {        Vector.Element element = nonZeroElements.next();        RandomAccessSparseVector partialColumnVector = new RandomAccessSparseVector(Integer.MAX_VALUE);        partialColumnVector.setQuick(row.get(), element.get());        ctx.write(new IntWritable(element.index()), new VectorWritable(partialColumnVector));        numNonZeroEntries++;        if (maxValue < element.get()) {          maxValue = element.get();        }      }      if (threshold != NO_THRESHOLD) {        nonZeroEntries.setQuick(row.get(), numNonZeroEntries);        maxValues.setQuick(row.get(), maxValue);      }      norms.setQuick(row.get(), similarity.norm(rowVector));      ctx.getCounter(Counters.ROWS).increment(1);    }    @Override    protected void cleanup(Context ctx) throws IOException, InterruptedException {      super.cleanup(ctx);      // dirty trick      ctx.write(new IntWritable(NORM_VECTOR_MARKER), new VectorWritable(norms));      ctx.write(new IntWritable(NUM_NON_ZERO_ENTRIES_VECTOR_MARKER), new VectorWritable(nonZeroEntries));      ctx.write(new IntWritable(MAXVALUE_VECTOR_MARKER), new VectorWritable(maxValues));    }

reducer操作MergeVectorsReducer就是对mapper的输出(userId, VectorWritable<itemId, pref>)进行合并集合，把相同userId下的vector合并到一起，然后直接输出。其实在这一步之前也有一步的combiner'操作，combiner操作中是对vector进行了merge操作。这一步的reducer操作中在输出的时候也会根据不同情况进行相应的不同操作。通过看上面mapper代码，也会发现它输出好几种数据，reducer会对这不同种类数据写到不同的位置。

protected void reduce(IntWritable row, Iterable<VectorWritable> partialVectors, Context ctx)        throws IOException, InterruptedException {      Vector partialVector = Vectors.merge(partialVectors);      if (row.get() == NORM_VECTOR_MARKER) {        Vectors.write(partialVector, normsPath, ctx.getConfiguration());      } else if (row.get() == MAXVALUE_VECTOR_MARKER) {        Vectors.write(partialVector, maxValuesPath, ctx.getConfiguration());      } else if (row.get() == NUM_NON_ZERO_ENTRIES_VECTOR_MARKER) {        Vectors.write(partialVector, numNonZeroEntriesPath, ctx.getConfiguration(), true);      } else {        ctx.write(row, new VectorWritable(partialVector));      }    }

第二个子步骤:

mapper操作CooccurrencesMapper把上一步的reduce输出(userId, VectorWritable<itemId,pref>)进行处理，通过循环遍历，对每一个vector中的每一个元素进行相互组合，输出为( itemid_index, vector<itemid_index, value> )。在这里的value值会根据采用不同推荐策略进行不同的计算，如果采用的是基于CountbasedMeasure相关策略，那么当两个item被同一个用户看过的时候，则上述的value就为1。

 protected void map(IntWritable column, VectorWritable occurrenceVector, Context ctx)        throws IOException, InterruptedException {      //(userId, VectorWritable<itemId,pref>)      Vector.Element[] occurrences = Vectors.toArray(occurrenceVector);      Arrays.sort(occurrences, BY_INDEX);      int cooccurrences = 0;      int prunedCooccurrences = 0;      //第二层for循环，是让m = n的，然后又取出数组的的第m个元素，然后与第n个元素计算，这个时候写出的数据是不是就是自身和自身的一个关系？？？？      for (int n = 0; n < occurrences.length; n++) {        Vector.Element occurrenceA = occurrences[n];        Vector dots = new RandomAccessSparseVector(Integer.MAX_VALUE);        for (int m = n; m < occurrences.length; m++) {          Vector.Element occurrenceB = occurrences[m];          if (threshold == NO_THRESHOLD || consider(occurrenceA, occurrenceB)) {        //in CountbasedMeasure aggregate always return 1            dots.setQuick(occurrenceB.index(), similarity.aggregate(occurrenceA.get(), occurrenceB.get()));            cooccurrences++;          } else {            prunedCooccurrences++;          }        }        ctx.write(new IntWritable(occurrenceA.index()), new VectorWritable(dots));//在这里输出的就是以itemA与所有与其有关系的item的之间的关系      }      ctx.getCounter(Counters.COOCCURRENCES).increment(cooccurrences);      ctx.getCounter(Counters.PRUNED_COOCCURRENCES).increment(prunedCooccurrences);    }

the input like this ：

column1: row1, row2, row3
column2: row1, row3
column3: row2

the output will be ：

for column1:

(row1,row2)
(row1,row3)
(row2,row3)

for column2:

(row1,row3)

for column3 there's nothing to emit.

reducer操作SimilarityReducer就是对上述单个输出进行汇总计算两两item之间的相似度，生成最后的协同矩阵。

protected void reduce(IntWritable row, Iterable<VectorWritable> partialDots, Context ctx)        throws IOException, InterruptedException {    //(itemIdA, VectorWritable<itemIdB,1>)      Iterator<VectorWritable> partialDotsIterator = partialDots.iterator();      Vector dots = partialDotsIterator.next().get();      while (partialDotsIterator.hasNext()) {        Vector toAdd = partialDotsIterator.next().get();        Iterator<Vector.Element> nonZeroElements = toAdd.iterateNonZero();        while (nonZeroElements.hasNext()) {          Vector.Element nonZeroElement = nonZeroElements.next();          //对与row有关系的相同itemId的score进行累加                dots.setQuick(nonZeroElement.index(), dots.getQuick(nonZeroElement.index()) + nonZeroElement.get());        }      }      Vector similarities = dots.like();      double normA = norms.getQuick(row.get());      Iterator<Vector.Element> dotsWith = dots.iterateNonZero();      while (dotsWith.hasNext()) {        Vector.Element b = dotsWith.next();        double similarityValue = similarity.similarity(b.get(), normA, norms.getQuick(b.index()), numberOfColumns);        if (similarityValue >= treshold) {          similarities.set(b.index(), similarityValue);        }      }      if (excludeSelfSimilarity) {        similarities.setQuick(row.get(), 0);      }      ctx.write(row, new VectorWritable(similarities));    }  }