深入理解mahout基于hadoop的协同过滤流程

来源：互联网发布：双缝干涉知乎编辑：程序博客网时间：2024/06/10 22:02

mahout版本为mahout-distribution-0.9

mahout基于hadoop协同过滤（itembased）触发类为org.apache.mahout.cf.taste.hadoop.item.RecommenderJob。

执行RecommenderJob时大概需要以下参数：

--input <input> --output <output> --numRecommendations   Number of recommendations per user--usersFile   File of users to recommend for--itemsFileFile of items to recommend for --filterFileFile containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user (optional)  --booleanData Treat input as without pref values --similarityClassname (-s) similarityClassname     [SIMILARITY_COOCCURRENCE,SIMILARITY_LOGLIKELIHOOD,     SIMILARITY_TANIMOTO_COEFFICIENT,SIMILARITY_CITY_BLOCK,     SIMILARITY_COSINE,SIMILARITY_PEARSON_CORRELATION,     SIMILARITY_EUCLIDEAN_DISTANCE]--maxPrefsPerUserMaximum number of preferences considered per user in final recommendation phase           --minPrefsPerUser ignore users with less preferences than this in the similarity computation --maxSimilaritiesPerItem  Maximum number of similarities considered per item --maxPrefsInItemSimilaritymax number of preferences to consider per user or item in the item similarity computation phase, users or items withmore preferences will be sampled down --threshold <threshold>            discard item pairs with a similarity value below this --outputPathForSimilarityMatrix  write the item similarity matrix to this path (optional)--randomSeed    use this seed for sampling --sequencefileOutput  write the output into a SequenceFile instead of a text file--help --tempDir  Intermediate output directory--startPhase  First phase to run default 0--endPhaseLast phase to run

mahout协同过滤输入格式为（userid，itemid，preference）。RecommenderJob流程主要包括以下几部分，我们会依次进行分析。

1、PreparePreferenceMatrixJob作业

所在包为org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob，打开此类可以看到，此作业包含3个子作业。

1.1 itemIDIndex

此作业会执行ItemIDIndexMapper.class作为map类，执行ItemIDIndexReducer.class作为reduce类。

ItemIDIndexMapper.class作用是将long型的itemid转为int型内部索引。输入格式为（userid，itemid，preference），输出格式为（index，itemid）。

结果会保存到tempDir/preparePreferenceMatrix/itemIDIndex中。

1.2 toUserVectors

此作业执行ToItemPrefsMapper.class作为map类，ToUserVectorsReducer.class作为reduce类。

toUserVectors作用是将用户偏好写成每个用户的偏好向量。输入格式为（userid，temid，preference>），输出格式为 (userId, VectorWritable<itemId, preference>)

结果会保存到tempDir/preparePreferenceMatrix/userVectors文件中。

1.3 toItemVectors

此作业执行ToItemVectorsMapper.class作为map类，ToItemVectorsReducer.class作为reduce类。

toItemVectors作用是建立项目的评分矩阵。输入为toUserVectors作业的执行结果userVectors格式为(userId, VectorWritable<itemId,preference>)，输出为(itemId,VectorWritable<userId,preference>)。

结果会保存到tempDir/preparePreferenceMatrix/ratingMatrix文件中。

2、 RowSimilarityJob作业

所在包为package org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob，打开此类进行分析。

2.1 countObservations

此作业执行CountObservationsMapper.class作为map类，执行SumObservationsReducer.class作为reduce类。

countObservations作用是计算出每个用户的评定物品数。输入为1.3中的结果ratingMatrix格式为(itemId,VectorWritable<userId,preference>)，输出格式为（vector<userid,count>）。

计算结果保存在tempDir/observationsPerColumn.bin文件中（此文件读取需要使用org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors中的read函数）。

2.2 normsAndTranspose

此作业map函数为VectorNormMapper.class，reduce函数为MergeVectorsReducer.class。

normsAndTranspose作用是将首先将ratingMatrix(itemId,VectorWritable<userId,preference>)数据进行normalize化，之后数据写成(userId, VectorWritable<itemId,preference>)格式。在此步骤的map阶段计算了每个项目preference的norms值，并将数值存到tempDir/norms.bin文件中，norms计算方式依相似度函数而定，具体实现见相似度函数中的norm函数。此处以EuclideanDistanceSimilarity为例，计算方式如下：

 @Override  public double norm(Vector vector) {    double norm = 0;    for (Vector.Element e : vector.nonZeroes()) {      double value = e.get();      norm += value * value;    }    return norm;  }

normalize的计算方式与所选的相似度就算方式有关，此处以SIMILARITY_EUCLIDEAN_DISTANCE为例，计算方式如下：

@Override  public Vector normalize(Vector vector) {    return vector;  }

normalize以SIMILARITY_EUCLIDEAN_DISTANCE为例，计算方式如下：

@Override  public Vector normalize(Vector vector) {    if (vector.getNumNondefaultElements() == 0) {      return vector;    }    // center non-zero elements    double average = vector.norm(1) / vector.getNumNonZeroElements();    for (Vector.Element e : vector.nonZeroes()) {      e.set(e.get() - average);    }    return super.normalize(vector);  }

计算结果保存在tempDir/weights文件中。

2.3 pairwiseSimilarity

此作业以 CooccurrencesMapper.class作为map类，以SimilarityReducer.class作为reduce类。

pairwiseSimilarity作用为计算item之间的相似度。数据输入为weights文件格式为(userId, VectorWritable<itemId,preference>)，输出为（itemM，VectorWritable<itemM+,similarscore>）(每条项目为key的数据中其它项目index一定比当前项目大，比如M对应M+1,M+2,....),由此可以看出此作业输出为对角阵的上半部分。

此作业计算相似度的方式为：在map阶段完成aggregate计算，计算方式依相似度函数而定，此处以SIMILARITY_EUCLIDEAN_DISTANCE为例，计算方式如下，

 @Override  public double aggregate(double valueA, double nonZeroValueB) {    return valueA * nonZeroValueB;  }

再reduce阶段通过使用map阶段计算出的aggregate值和2.2作业的norms值，得到item之间的相似度，相似度的计算方式依相似度函数而定，此处以EuclideanDistanceSimilarity为例，计算方式如下：

@Override  public double similarity(double dots, double normA, double normB, int numberOfColumns) {    // Arg can't be negative in theory, but can in practice due to rounding, so cap it.    // Also note that normA / normB are actually the squares of the norms.    double euclideanDistance = Math.sqrt(Math.max(0.0, normA - 2 * dots + normB));    return 1.0 / (1.0 + euclideanDistance);  }

此作业的计算结果保存在tempDir/pairwiseSimilarity中。

2.4 asMatrix

此作业的map类为UnsymmetrifyMapper.class，reduce类为MergeToTopKSimilaritiesReducer.class。

asMatrix的作用是将2.3的矩阵矩阵整理为完整的相似度矩阵。输入为2.3的pairwiseSimilarity文件，格式为（itemM，VectorWritable<itemM+,similarscore>），输出为（itemM，VectorWritable（（item1,similarscore），（item2，similarscore），....）。

asMatrix作业的工作方式为：在map阶段，计算pairwiseSimilarity中每个item的maxSimilaritiesPerRow（默认为100）个最相似的item，使用TopElementsQueue函数。在reduce阶段，首先将斜半矩阵合并成全矩阵，然后使用Vectors.topKElements（）函数，取出最相似的maxSimilaritiesPerRow个item。此处可以看出map阶段相当于reduce的预处理阶段，将对角矩阵进行预处理，减少了reduce的计算量。

asMatrix计算结果保存在tempDir/similarityMatrix文件中。

2.5 outputSimilarityMatrix

如果设置了outputPathForSimilarityMatrix参数，此作业将被执行。

outputSimilarityMatrix作用是将2.4产生的结果保存到用户通过outputPathForSimilarityMatrix参数设置的目录里。

3、partialMultiply作业

此作业也位于org.apache.mahout.cf.taste.hadoop.item.RecommenderJob包中，位于run函数中。分析此作业代码：

if (shouldRunNextPhase(parsedArgs, currentPhase)) {      Job partialMultiply = new Job(getConf(), "partialMultiply");      Configuration partialMultiplyConf = partialMultiply.getConfiguration();      MultipleInputs.addInputPath(partialMultiply, similarityMatrixPath, SequenceFileInputFormat.class,                                  SimilarityMatrixRowWrapperMapper.class);      MultipleInputs.addInputPath(partialMultiply, new Path(prepPath, PreparePreferenceMatrixJob.USER_VECTORS),          SequenceFileInputFormat.class, UserVectorSplitterMapper.class);      partialMultiply.setJarByClass(ToVectorAndPrefReducer.class);      partialMultiply.setMapOutputKeyClass(VarIntWritable.class);      partialMultiply.setMapOutputValueClass(VectorOrPrefWritable.class);      partialMultiply.setReducerClass(ToVectorAndPrefReducer.class);      partialMultiply.setOutputFormatClass(SequenceFileOutputFormat.class);      partialMultiply.setOutputKeyClass(VarIntWritable.class);      partialMultiply.setOutputValueClass(VectorAndPrefsWritable.class);      partialMultiplyConf.setBoolean("mapred.compress.map.output", true);      partialMultiplyConf.set("mapred.output.dir", partialMultiplyPath.toString());      if (usersFile != null) {        partialMultiplyConf.set(UserVectorSplitterMapper.USERS_FILE, usersFile);      }      partialMultiplyConf.setInt(UserVectorSplitterMapper.MAX_PREFS_PER_USER_CONSIDERED, maxPrefsPerUser);      boolean succeeded = partialMultiply.waitForCompletion(true);      if (!succeeded) {        return -1;      }    }

分析上面代码可知，此作业通过MultipleInputs.addInputPath（）函数执行了SimilarityMatrixRowWrapperMapper.class和UserVectorSplitterMapper.class作为reduce的输入，reduce阶段执行了ToVectorAndPrefReducer.class类。

3.1 SimilarityMatrixRowWrapperMapper.class

此map作用是将similarity matrix每一行映射VectorOrPrefWritable格式。此map输入格式为（itemid，VectorWritable<itemid,similarscore>），输出格式为（itemid，VectorOrPrefWritable）。VectorOrPrefWritable类包含三个成员分别是：vector，long userid，float value。在此map阶段只是书写了VectorOrPrefWritable类的vector。

3.2 UserVectorSplitterMapper.class

此map作用是用户向量转化为VectorOrPrefWritable对象表示形式。此map输入为1.2作业结果即tempDir/preparePreferenceMatrix/userVectors文件，输出格式为（itemid，VectorOrPrefWritable）。此map阶段只书写了VectorOrPrefWritable类的userid和value。

3.3 ToVectorAndPrefReducer

这是partialMultiply作业的reduce阶段，此阶段的输入是3.1 map阶段和3.2 map阶段的输出结果，格式为（itemid，VectorOrPrefWritable），输出格式为（itemid，VectorAndPrefsWritable）。此阶段的工作方式为，将3.1 map和3.2 map传来的数据，分别写入VectorAndPrefsWritable中的List<Long> userIDs，List<Float> prefValues，Vector similarityMatrixColumn。

此作业的结果保存在tempDir/partialMultiply文件中

4、itemFiltering作业

当用户设置了filterFile文件时，此作业才会执行。此作业的作用是过滤掉文件中设置的item。

5、aggregateAndRecommend作业

此作业的map类为PartialMultiplyMapper，reduce类为AggregateAndRecommendReducer。

此作业的作用是提取出推荐结果。最后的推荐结果会保存到设置的output目录下。

当没有设置filterFile文件时，此作业的输入为3.3作业的输出文件tempDir/partialMultiply格式为（itemid，VectorAndPrefsWritable），输出格式为（userid，RecommendedItemsWritable）

此作业工作方式为：map阶段映射每个用户的preferences值和对应的item相似向量，即表示形式为(userIDWritable, PrefAndSimilarityColumnWritable)，PrefAndSimilarityColumnWritable类里面记录了float prefValue和Vector similarityColumn。

reduce阶段首先判断booleanData参数，如果是true，执行reduceBooleanData，如果是false，执行reduceNonBooleanData，一般默认为false。reduce阶段计算方式如下所示：

/** * <p>computes prediction values for each user</p> * * <pre> * u = a user * i = an item not yet rated by u * N = all items similar to i (where similarity is usually computed by pairwisely comparing the item-vectors * of the user-item matrix) * * Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all n from N: abs(similarity(i,n))) * </pre> */

AggregateAndRecommendReducer类reduce计算相似度代码如下：

@Override  protected void reduce(VarLongWritable userID,                        Iterable<PrefAndSimilarityColumnWritable> values,                        Context context) throws IOException, InterruptedException {    if (booleanData) {      reduceBooleanData(userID, values, context);    } else {      reduceNonBooleanData(userID, values, context);    }  }  private void reduceBooleanData(VarLongWritable userID,                                 Iterable<PrefAndSimilarityColumnWritable> values,                                 Context context) throws IOException, InterruptedException {    /* having boolean data, each estimated preference can only be 1,     * however we can't use this to rank the recommended items,     * so we use the sum of similarities for that. */    Iterator<PrefAndSimilarityColumnWritable> columns = values.iterator();    Vector predictions = columns.next().getSimilarityColumn();    while (columns.hasNext()) {      predictions.assign(columns.next().getSimilarityColumn(), Functions.PLUS);    }    writeRecommendedItems(userID, predictions, context);  }  private void reduceNonBooleanData(VarLongWritable userID,                        Iterable<PrefAndSimilarityColumnWritable> values,                        Context context) throws IOException, InterruptedException {    /* each entry here is the sum in the numerator of the prediction formula */    Vector numerators = null;    /* each entry here is the sum in the denominator of the prediction formula */    Vector denominators = null;    /* each entry here is the number of similar items used in the prediction formula */    Vector numberOfSimilarItemsUsed = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);    for (PrefAndSimilarityColumnWritable prefAndSimilarityColumn : values) {      Vector simColumn = prefAndSimilarityColumn.getSimilarityColumn();      float prefValue = prefAndSimilarityColumn.getPrefValue();      /* count the number of items used for each prediction */      for (Element e : simColumn.nonZeroes()) {        int itemIDIndex = e.index();        numberOfSimilarItemsUsed.setQuick(itemIDIndex, numberOfSimilarItemsUsed.getQuick(itemIDIndex) + 1);      }      if (denominators == null) {        denominators = simColumn.clone();      } else {        denominators.assign(simColumn, Functions.PLUS_ABS);      }      if (numerators == null) {        numerators = simColumn.clone();        if (prefValue != BOOLEAN_PREF_VALUE) {          numerators.assign(Functions.MULT, prefValue);        }      } else {        if (prefValue != BOOLEAN_PREF_VALUE) {          simColumn.assign(Functions.MULT, prefValue);        }        numerators.assign(simColumn, Functions.PLUS);      }    }    if (numerators == null) {      return;    }    Vector recommendationVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);    for (Element element : numerators.nonZeroes()) {      int itemIDIndex = element.index();      /* preference estimations must be based on at least 2 datapoints */      if (numberOfSimilarItemsUsed.getQuick(itemIDIndex) > 1) {        /* compute normalized prediction */        double prediction = element.get() / denominators.getQuick(itemIDIndex);        recommendationVector.setQuick(itemIDIndex, prediction);      }    }    writeRecommendedItems(userID, recommendationVector, context);  }  /**   * find the top entries in recommendationVector, map them to the real itemIDs and write back the result   */  private void writeRecommendedItems(VarLongWritable userID, Vector recommendationVector, Context context)    throws IOException, InterruptedException {    TopItemsQueue topKItems = new TopItemsQueue(recommendationsPerUser);    for (Element element : recommendationVector.nonZeroes()) {      int index = element.index();      long itemID;      if (indexItemIDMap != null && !indexItemIDMap.isEmpty()) {        itemID = indexItemIDMap.get(index);      } else { //we don't have any mappings, so just use the original        itemID = index;      }      if (itemsToRecommendFor == null || itemsToRecommendFor.contains(itemID)) {        float value = (float) element.get();        if (!Float.isNaN(value)) {          MutableRecommendedItem topItem = topKItems.top();          if (value > topItem.getValue()) {            topItem.set(itemID, value);            topKItems.updateTop();          }        }      }    }    List<RecommendedItem> topItems = topKItems.getTopItems();    if (!topItems.isEmpty()) {      recommendedItems.set(topItems);      context.write(userID, recommendedItems);    }  }

1 0