深入理解mahout基于hadoop的协同过滤流程
来源:互联网 发布:双缝干涉 知乎 编辑:程序博客网 时间:2024/06/10 22:02
mahout版本为mahout-distribution-0.9
mahout基于hadoop协同过滤(itembased)触发类为org.apache.mahout.cf.taste.hadoop.item.RecommenderJob。
执行RecommenderJob时大概需要以下参数:
--input <input> --output <output> --numRecommendations Number of recommendations per user--usersFile File of users to recommend for--itemsFileFile of items to recommend for --filterFileFile containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user (optional) --booleanData Treat input as without pref values --similarityClassname (-s) similarityClassname [SIMILARITY_COOCCURRENCE,SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT,SIMILARITY_CITY_BLOCK, SIMILARITY_COSINE,SIMILARITY_PEARSON_CORRELATION, SIMILARITY_EUCLIDEAN_DISTANCE]--maxPrefsPerUserMaximum number of preferences considered per user in final recommendation phase --minPrefsPerUser ignore users with less preferences than this in the similarity computation --maxSimilaritiesPerItem Maximum number of similarities considered per item --maxPrefsInItemSimilaritymax number of preferences to consider per user or item in the item similarity computation phase, users or items withmore preferences will be sampled down --threshold <threshold> discard item pairs with a similarity value below this --outputPathForSimilarityMatrix write the item similarity matrix to this path (optional)--randomSeed use this seed for sampling --sequencefileOutput write the output into a SequenceFile instead of a text file--help --tempDir Intermediate output directory--startPhase First phase to run default 0--endPhaseLast phase to run
mahout协同过滤输入格式为(userid,itemid,preference)。RecommenderJob流程主要包括以下几部分,我们会依次进行分析。
1、PreparePreferenceMatrixJob作业
所在包为org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob,打开此类可以看到,此作业包含3个子作业。
1.1 itemIDIndex
此作业会执行ItemIDIndexMapper.class作为map类,执行ItemIDIndexReducer.class作为reduce类。
ItemIDIndexMapper.class作用是将long型的itemid转为int型内部索引。输入格式为(userid,itemid,preference),输出格式为(index,itemid)。
结果会保存到tempDir/preparePreferenceMatrix/itemIDIndex中。
1.2 toUserVectors
此作业执行ToItemPrefsMapper.class作为map类,ToUserVectorsReducer.class作为reduce类。
toUserVectors作用是将用户偏好写成每个用户的偏好向量。输入格式为(userid,temid,preference>),输出格式为 (userId, VectorWritable<itemId, preference>)
结果会保存到tempDir/preparePreferenceMatrix/userVectors文件中。
1.3 toItemVectors
此作业执行ToItemVectorsMapper.class作为map类,ToItemVectorsReducer.class作为reduce类。
toItemVectors作用是建立项目的评分矩阵。输入为toUserVectors作业的执行结果userVectors格式为(userId, VectorWritable<itemId,preference>),输出为(itemId,VectorWritable<userId,preference>)。
结果会保存到tempDir/preparePreferenceMatrix/ratingMatrix文件中。
2、 RowSimilarityJob作业
所在包为package org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob,打开此类进行分析。
2.1 countObservations
此作业执行CountObservationsMapper.class作为map类,执行SumObservationsReducer.class作为reduce类。
countObservations作用是计算出每个用户的评定物品数。输入为1.3中的结果ratingMatrix格式为(itemId,VectorWritable<userId,preference>),输出格式为(vector<userid,count>)。
计算结果保存在tempDir/observationsPerColumn.bin文件中(此文件读取需要使用org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors中的read函数)。
2.2 normsAndTranspose
此作业map函数为VectorNormMapper.class,reduce函数为MergeVectorsReducer.class。
normsAndTranspose作用是将首先将ratingMatrix(itemId,VectorWritable<userId,preference>)数据进行normalize化,之后数据写成(userId, VectorWritable<itemId,preference>)格式。在此步骤的map阶段计算了每个项目preference的norms值,并将数值存到tempDir/norms.bin文件中,norms计算方式依相似度函数而定,具体实现见相似度函数中的norm函数。此处以EuclideanDistanceSimilarity为例,计算方式如下:
@Override public double norm(Vector vector) { double norm = 0; for (Vector.Element e : vector.nonZeroes()) { double value = e.get(); norm += value * value; } return norm; }
normalize的计算方式与所选的相似度就算方式有关,此处以SIMILARITY_EUCLIDEAN_DISTANCE为例,计算方式如下:
@Override public Vector normalize(Vector vector) { return vector; }normalize以SIMILARITY_EUCLIDEAN_DISTANCE为例,计算方式如下:
@Override public Vector normalize(Vector vector) { if (vector.getNumNondefaultElements() == 0) { return vector; } // center non-zero elements double average = vector.norm(1) / vector.getNumNonZeroElements(); for (Vector.Element e : vector.nonZeroes()) { e.set(e.get() - average); } return super.normalize(vector); }
计算结果保存在tempDir/weights文件中。
2.3 pairwiseSimilarity
此作业以 CooccurrencesMapper.class作为map类,以SimilarityReducer.class作为reduce类。
pairwiseSimilarity作用为计算item之间的相似度。数据输入为weights文件格式为(userId, VectorWritable<itemId,preference>),输出为(itemM,VectorWritable<itemM+,similarscore>)(每条项目为key的数据中其它项目index一定比当前项目大,比如M对应M+1,M+2,....),由此可以看出此作业输出为对角阵的上半部分。
此作业计算相似度的方式为:在map阶段完成aggregate计算,计算方式依相似度函数而定,此处以SIMILARITY_EUCLIDEAN_DISTANCE为例,计算方式如下,
@Override public double aggregate(double valueA, double nonZeroValueB) { return valueA * nonZeroValueB; }
再reduce阶段通过使用map阶段计算出的aggregate值和2.2作业的norms值,得到item之间的相似度,相似度的计算方式依相似度函数而定,此处以EuclideanDistanceSimilarity为例,计算方式如下:
@Override public double similarity(double dots, double normA, double normB, int numberOfColumns) { // Arg can't be negative in theory, but can in practice due to rounding, so cap it. // Also note that normA / normB are actually the squares of the norms. double euclideanDistance = Math.sqrt(Math.max(0.0, normA - 2 * dots + normB)); return 1.0 / (1.0 + euclideanDistance); }
此作业的计算结果保存在tempDir/pairwiseSimilarity中。
2.4 asMatrix
此作业的map类为UnsymmetrifyMapper.class,reduce类为MergeToTopKSimilaritiesReducer.class。
asMatrix的作用是将2.3的矩阵矩阵整理为完整的相似度矩阵。输入为2.3的pairwiseSimilarity文件,格式为(itemM,VectorWritable<itemM+,similarscore>),输出为(itemM,VectorWritable((item1,similarscore),(item2,similarscore),....)。
asMatrix作业的工作方式为:在map阶段,计算pairwiseSimilarity中每个item的maxSimilaritiesPerRow(默认为100)个最相似的item,使用TopElementsQueue函数。在reduce阶段,首先将斜半矩阵合并成全矩阵,然后使用Vectors.topKElements()函数,取出最相似的maxSimilaritiesPerRow个item。此处可以看出map阶段相当于reduce的预处理阶段,将对角矩阵进行预处理,减少了reduce的计算量。
asMatrix计算结果保存在tempDir/similarityMatrix文件中。
2.5 outputSimilarityMatrix
如果设置了outputPathForSimilarityMatrix参数,此作业将被执行。
outputSimilarityMatrix作用是将2.4产生的结果保存到用户通过outputPathForSimilarityMatrix参数设置的目录里。
3、partialMultiply作业
此作业也位于org.apache.mahout.cf.taste.hadoop.item.RecommenderJob包中,位于run函数中。分析此作业代码:
if (shouldRunNextPhase(parsedArgs, currentPhase)) { Job partialMultiply = new Job(getConf(), "partialMultiply"); Configuration partialMultiplyConf = partialMultiply.getConfiguration(); MultipleInputs.addInputPath(partialMultiply, similarityMatrixPath, SequenceFileInputFormat.class, SimilarityMatrixRowWrapperMapper.class); MultipleInputs.addInputPath(partialMultiply, new Path(prepPath, PreparePreferenceMatrixJob.USER_VECTORS), SequenceFileInputFormat.class, UserVectorSplitterMapper.class); partialMultiply.setJarByClass(ToVectorAndPrefReducer.class); partialMultiply.setMapOutputKeyClass(VarIntWritable.class); partialMultiply.setMapOutputValueClass(VectorOrPrefWritable.class); partialMultiply.setReducerClass(ToVectorAndPrefReducer.class); partialMultiply.setOutputFormatClass(SequenceFileOutputFormat.class); partialMultiply.setOutputKeyClass(VarIntWritable.class); partialMultiply.setOutputValueClass(VectorAndPrefsWritable.class); partialMultiplyConf.setBoolean("mapred.compress.map.output", true); partialMultiplyConf.set("mapred.output.dir", partialMultiplyPath.toString()); if (usersFile != null) { partialMultiplyConf.set(UserVectorSplitterMapper.USERS_FILE, usersFile); } partialMultiplyConf.setInt(UserVectorSplitterMapper.MAX_PREFS_PER_USER_CONSIDERED, maxPrefsPerUser); boolean succeeded = partialMultiply.waitForCompletion(true); if (!succeeded) { return -1; } }
分析上面代码可知,此作业通过MultipleInputs.addInputPath()函数执行了SimilarityMatrixRowWrapperMapper.class和UserVectorSplitterMapper.class作为reduce的输入,reduce阶段执行了ToVectorAndPrefReducer.class类。
3.1 SimilarityMatrixRowWrapperMapper.class
此map作用是将similarity matrix每一行映射VectorOrPrefWritable格式。此map输入格式为(itemid,VectorWritable<itemid,similarscore>),输出格式为(itemid,VectorOrPrefWritable)。VectorOrPrefWritable类包含三个成员分别是:vector,long userid,float value。在此map阶段只是书写了VectorOrPrefWritable类的vector。
3.2 UserVectorSplitterMapper.class
此map作用是用户向量转化为VectorOrPrefWritable对象表示形式。此map输入为1.2作业结果即tempDir/preparePreferenceMatrix/userVectors文件,输出格式为(itemid,VectorOrPrefWritable)。此map阶段只书写了VectorOrPrefWritable类的userid和value。
3.3 ToVectorAndPrefReducer
这是partialMultiply作业的reduce阶段,此阶段的输入是3.1 map阶段和3.2 map阶段的输出结果,格式为(itemid,VectorOrPrefWritable),输出格式为(itemid,VectorAndPrefsWritable)。此阶段的工作方式为,将3.1 map和3.2 map传来的数据,分别写入VectorAndPrefsWritable中的List<Long> userIDs,List<Float> prefValues,Vector similarityMatrixColumn。
此作业的结果保存在tempDir/partialMultiply文件中
4、itemFiltering作业
当用户设置了filterFile文件时,此作业才会执行。此作业的作用是过滤掉文件中设置的item。
5、aggregateAndRecommend作业
此作业的map类为PartialMultiplyMapper,reduce类为AggregateAndRecommendReducer。
此作业的作用是提取出推荐结果。最后的推荐结果会保存到设置的output目录下。
当没有设置filterFile文件时,此作业的输入为3.3作业的输出文件tempDir/partialMultiply格式为(itemid,VectorAndPrefsWritable),输出格式为(userid,RecommendedItemsWritable)
此作业工作方式为:map阶段映射每个用户的preferences值和对应的item相似向量,即表示形式为(userIDWritable, PrefAndSimilarityColumnWritable),PrefAndSimilarityColumnWritable类里面记录了float prefValue和Vector similarityColumn。
reduce阶段首先判断booleanData参数,如果是true,执行reduceBooleanData,如果是false,执行reduceNonBooleanData,一般默认为false。reduce阶段计算方式如下所示:
/** * <p>computes prediction values for each user</p> * * <pre> * u = a user * i = an item not yet rated by u * N = all items similar to i (where similarity is usually computed by pairwisely comparing the item-vectors * of the user-item matrix) * * Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all n from N: abs(similarity(i,n))) * </pre> */AggregateAndRecommendReducer类reduce计算相似度代码如下:
@Override protected void reduce(VarLongWritable userID, Iterable<PrefAndSimilarityColumnWritable> values, Context context) throws IOException, InterruptedException { if (booleanData) { reduceBooleanData(userID, values, context); } else { reduceNonBooleanData(userID, values, context); } } private void reduceBooleanData(VarLongWritable userID, Iterable<PrefAndSimilarityColumnWritable> values, Context context) throws IOException, InterruptedException { /* having boolean data, each estimated preference can only be 1, * however we can't use this to rank the recommended items, * so we use the sum of similarities for that. */ Iterator<PrefAndSimilarityColumnWritable> columns = values.iterator(); Vector predictions = columns.next().getSimilarityColumn(); while (columns.hasNext()) { predictions.assign(columns.next().getSimilarityColumn(), Functions.PLUS); } writeRecommendedItems(userID, predictions, context); } private void reduceNonBooleanData(VarLongWritable userID, Iterable<PrefAndSimilarityColumnWritable> values, Context context) throws IOException, InterruptedException { /* each entry here is the sum in the numerator of the prediction formula */ Vector numerators = null; /* each entry here is the sum in the denominator of the prediction formula */ Vector denominators = null; /* each entry here is the number of similar items used in the prediction formula */ Vector numberOfSimilarItemsUsed = new RandomAccessSparseVector(Integer.MAX_VALUE, 100); for (PrefAndSimilarityColumnWritable prefAndSimilarityColumn : values) { Vector simColumn = prefAndSimilarityColumn.getSimilarityColumn(); float prefValue = prefAndSimilarityColumn.getPrefValue(); /* count the number of items used for each prediction */ for (Element e : simColumn.nonZeroes()) { int itemIDIndex = e.index(); numberOfSimilarItemsUsed.setQuick(itemIDIndex, numberOfSimilarItemsUsed.getQuick(itemIDIndex) + 1); } if (denominators == null) { denominators = simColumn.clone(); } else { denominators.assign(simColumn, Functions.PLUS_ABS); } if (numerators == null) { numerators = simColumn.clone(); if (prefValue != BOOLEAN_PREF_VALUE) { numerators.assign(Functions.MULT, prefValue); } } else { if (prefValue != BOOLEAN_PREF_VALUE) { simColumn.assign(Functions.MULT, prefValue); } numerators.assign(simColumn, Functions.PLUS); } } if (numerators == null) { return; } Vector recommendationVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 100); for (Element element : numerators.nonZeroes()) { int itemIDIndex = element.index(); /* preference estimations must be based on at least 2 datapoints */ if (numberOfSimilarItemsUsed.getQuick(itemIDIndex) > 1) { /* compute normalized prediction */ double prediction = element.get() / denominators.getQuick(itemIDIndex); recommendationVector.setQuick(itemIDIndex, prediction); } } writeRecommendedItems(userID, recommendationVector, context); } /** * find the top entries in recommendationVector, map them to the real itemIDs and write back the result */ private void writeRecommendedItems(VarLongWritable userID, Vector recommendationVector, Context context) throws IOException, InterruptedException { TopItemsQueue topKItems = new TopItemsQueue(recommendationsPerUser); for (Element element : recommendationVector.nonZeroes()) { int index = element.index(); long itemID; if (indexItemIDMap != null && !indexItemIDMap.isEmpty()) { itemID = indexItemIDMap.get(index); } else { //we don't have any mappings, so just use the original itemID = index; } if (itemsToRecommendFor == null || itemsToRecommendFor.contains(itemID)) { float value = (float) element.get(); if (!Float.isNaN(value)) { MutableRecommendedItem topItem = topKItems.top(); if (value > topItem.getValue()) { topItem.set(itemID, value); topKItems.updateTop(); } } } } List<RecommendedItem> topItems = topKItems.getTopItems(); if (!topItems.isEmpty()) { recommendedItems.set(topItems); context.write(userID, recommendedItems); } }
- 深入理解mahout基于hadoop的协同过滤流程
- Mahout基于hadoop实现itembased协同过滤流程解析
- Apache Mahout基于商品的协同过滤算法流程分析
- mahout基于用户的协同过滤-userCF
- mahout基于物品的协同过滤指令
- mahout基于项目的协同过滤步骤
- Mahout实现协同过滤的基本概念理解
- Apache Mahout的Taste基于Hadoop实现协同过滤推荐引擎的代码分析
- Apache Mahout的Taste基于Hadoop实现协同过滤推荐引擎的代码分析
- Apache Mahout的Taste基于Hadoop实现协同过滤推荐引擎的代码分析
- 基于 Apache Mahout 实现高效的协同过滤推荐
- 【Machine Learning】Mahout基于协同过滤(CF)的用户推荐
- Mahout中基于Item的协同过滤之pairwiseSimilarity
- Mahout基于item的协同过滤之asMatrix
- Mahout分布式程序开发 基于物品的协同过滤ItemCF
- mahout基于项目的协同过滤源码分析
- Mahout实现基于用户的协同过滤算法
- **基于 Apache Mahout 实现高效的协同过滤推荐电影**
- vim中寄存器使用和vim标记。
- python 模块的安装
- 适合练习听力的英文歌
- jsp按格式导出doc文件
- JDBC执行SQLCLOB类型字段内容大于4000字符报错
- 深入理解mahout基于hadoop的协同过滤流程
- 多彩的Log就是这么炫酷
- 为什么php+apache本地站点访问超级慢
- Java中实现回调函数
- SVN源码服务器搭建-详细教程
- Android - 线程同步
- shell 统计mysql数据库的表的行数
- RMI--java远程方法调用
- QT界面,鼠标滚轮实现缩放问题