Mahout协同过滤算法源码分析(3-1)--QR分解数据流
来源:互联网 发布:协方差矩阵公式cov方差 编辑:程序博客网 时间:2024/06/05 20:55
Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
接上篇,本篇主要分析下面的一行代码:
Vector solve(Matrix Ai, Matrix Vi) { return new QRDecomposition(Ai).solve(Vi).viewColumn(0); }虽说是一行代码,但是里面涉及的内容却好多。。。哎,当初高数的线性代数没学好呀。。。
为了比较清晰的分析上面代码的数据流,所以首先进行准备工作,主要就是Ai和Vi的获取工作。
数据集使用《mahout in action》中的list2.1数据集,如下:
1,101,5.01,102,3.01,103,2.52,101,2.02,102,2.52,103,5.02,104,2.03,101,2.53,104,4.03,105,4.53,107,5.04,101,5.04,103,3.04,104,4.54,106,4.05,101,4.05,102,3.05,103,2.05,104,4.05,105,3.55,106,4.0为了得到Vi和Ai,首先可以编写下面的代码,同时在ParallelALSFactorizationJob的initializeM函数后面设置断点,然后跑下面的代码:
package mahout.fansy.als.test;import org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob;public class TestParallelALSFactorizationJob {/** * 测试ParallelALSFactorizationJob ,使用小数据集 * ,在initializeM函数之后设置断点,获取必要数据 * 主要用于分析QR数据流前的数据准备 * 小数据集是<mahout in action>中的listing2.1 page 15: * @throws Exception */public static void main(String[] args) throws Exception {String[] arg=new String[]{"-jt","ubuntu:9001","-fs","ubuntu:9000","-i","hdfs://ubuntu:9000/test/input/user_item","-o","hdfs://ubuntu:9000/test/output","--lambda","0.065","--numFeatures","3","--numIterations","3","--tempDir","hdfs://ubuntu:9000/test/temp"};ParallelALSFactorizationJob.main(arg);}}这里的输入数据直接把前面的数据拷贝上传到HDFS相应的位置即可,然后使用下面的代码(这个代码也就是SolveExplicitFeedbackMapper的仿制代码和前篇博客一样,只是把相应的路径修改了下而已):
package mahout.fansy.als;import java.io.IOException;import java.util.Iterator;import java.util.List;import java.util.Map;import java.util.Map.Entry;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Writable;import org.apache.mahout.math.SequentialAccessSparseVector;import org.apache.mahout.math.Vector;import org.apache.mahout.math.VectorWritable;import org.apache.mahout.math.als.AlternatingLeastSquaresSolver;import org.apache.mahout.math.map.OpenIntObjectHashMap;import com.google.common.collect.Lists;import mahout.fansy.utils.read.ReadArbiKV;public class SolveExplicitFeedbackMapperFollow_1 {/** * 第一次调用SloveExplicitFeedBackMapper的仿制代码 * 使用小数据集 * @param args */private static double lambda=0.065; private static int numFeatures=3; private static OpenIntObjectHashMap<Vector> UorM; private static AlternatingLeastSquaresSolver solver;public static void main(String[] args) throws IOException {setup();map();}/** * 获得map输入文件; * @return * @throws IOException */public static Map<Writable,Writable> getMapData() throws IOException{String fPath="hdfs://ubuntu:9000/test/output/userRatings/part-r-00000";Map<Writable,Writable> mapData=ReadArbiKV.readFromFile(fPath);return mapData;}/** * 仿造setup函数 */public static void setup(){solver = new AlternatingLeastSquaresSolver();UorM = ALSUtilsFollow.readMatrixByRows(new Path("hdfs://ubuntu:9000/test/temp/M--1/part-m-00000"), getConf());}public static void map() throws IOException{Map<Writable,Writable> map=getMapData();for(Iterator<Entry<Writable, Writable>> iter=map.entrySet().iterator();iter.hasNext();){Entry<Writable,Writable> entry=(Entry<Writable, Writable>) iter.next();IntWritable userOrItemID=(IntWritable) entry.getKey();VectorWritable ratingsWritable=(VectorWritable) entry.getValue();// source codeVector ratings = new SequentialAccessSparseVector(ratingsWritable.get()); List<Vector> featureVectors = Lists.newArrayList(); Iterator<Vector.Element> interactions = ratings.iterateNonZero(); while (interactions.hasNext()) { int index = interactions.next().index(); featureVectors.add(UorM.get(index)); } Vector uiOrmj = solver.solve(featureVectors, ratings, lambda, numFeatures); System.out.println(userOrItemID+","+ new VectorWritable(uiOrmj));}}/** * 获得configuration * @return */private static Configuration getConf() {Configuration conf=new Configuration();conf.set("mapred.job.tracker", "ubuntu:9000");return conf;}}然后断点设置在solver.solve(...)这一行,设置在这里是为了看一些初始的变量值:
userRatings:
[1={101:5.0,102:3.0,103:2.5},2={101:2.0,102:2.5,103:5.0,104:2.0},3={101:2.5,104:4.0,105:4.5,107:5.0},4={101:5.0,103:3.0,104:4.5,106:4.0},5={101:4.0,102:3.0,103:2.0,104:4.0,105:3.5,106:4.0}]UorM: 第一列为项目平均分,其他列为随机评分(0,1)之间:
[101->{0:3.7,1:0.8671164945911651,2:0.34569609436188886},102->{0:2.833333333333333,1:0.26849761474873923,2:0.25305280900447447},103->{0:3.125,1:0.03761210458127495,2:0.8249152283326323},104->{0:3.625,1:0.7549644739393445,2:0.1152736727230218},105->{0:4.0,1:0.12274350577015558,2:0.862849667838315},106->{0:4.0,1:0.5113672636264076,2:0.5790585002437059}, 107->{0:5.0,1:0.4732039618109546,2:0.5447453232014403}]这里只分析第一个用户,即userid为1的用户,用到的变量值如下:
user1Ratings:
{101:5.0,102:3.0,103:2.5}user1_featureVectors: 取user1Ratings中item对应的UorM中的项
[{0:3.7,1:0.8671164945911651,2:0.34569609436188886}, --> item101{0:2.833333333333333,1:0.26849761474873923,2:0.25305280900447447}, --> item102 {0:3.125,1:0.03761210458127495,2:0.8249152283326323}] --> item103然后进入solve函数:
user1_MiIi: 把user1_featureVecots进行转置(行列转置),列分别对应item101、102、103
[[3.7, 2.833333333333333, 3.125], [0.8671164945911651, 0.26849761474873923, 0.03761210458127495],[0.34569609436188886, 0.25305280900447447, 0.8249152283326323]]RiIiMaybeTransposed:把user1Ratings去掉item进行转置,其实这里去掉item和user1_MiIi的列对应起来了
[[5.0], --> item101[3.0], --> item102[2.5]] --> item103Ai:MiIi矩阵乘以(MiIi的转置)然后把对角线(row=col)的项更新为原始值+lambda*user1中含有的item个数
MiIi的转置:其实就是和user1_featureVectors一样,
[[3.7, 0.8671164945911651, 0.34569609436188886],[2.833333333333333, 0.26849761474873923, 0.25305280900447447],[3.125, 0.03761210458127495, 0.8249152283326323]]MiIi矩阵乘以(MiIi的转置):矩阵相乘公式:(AB)ij=ai1*b1j+ai2*b2j+...+ain*bnj
[[31.483402777777776, 4.08661209859189, 4.573918596524476],[4.08661209859189, 0.8253966547288653, 0.3987296589988406], [4.573918596524476, 0.3987296589988406, 0.864026647737198]]更新后的值Ai: lambda*nui=0.065*3=0.195
[[31.678402777777777, 4.08661209859189, 4.573918596524476],[4.08661209859189, 1.0203966547288652, 0.3987296589988406],[4.573918596524476, 0.3987296589988406, 1.059026647737198]]Vi:矩阵MiIi和RiIiMaybeTransposed的相乘:
[[34.8125],[5.235105578655231],[4.549926969654448]]这样Ai和Vi的值就全部初始化好了,下篇详细分析new QRDecomposition(Ai).solve(Vi).viewColumn(0)。
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990
- Mahout协同过滤算法源码分析(3-1)--QR分解数据流
- Mahout协同过滤算法源码分析(3-3)QR分解数据流
- Mahout协同过滤算法源码分析(3-2)--QR分解数据流
- Mahout协同过滤算法源码分析(3)--parallelALS
- Mahout协同过滤算法源码分析(1)
- Mahout协同过滤算法源码分析(5)--拓展篇
- Mahout协同过滤算法源码分析(6)--并行思路
- Mahout基于项目的协同过滤算法源码分析(3)--RowSimilarityJob
- Mahout基于项目的协同过滤算法源码分析(1)--PreparePreferenceMatrixJob
- Mahout协同过滤算法源码分析--Itembased Collaborative Filtering实战
- Mahout协同过滤算法源码分析(2)--splitDataset 和parallelALS
- Mahout协同过滤算法源码分析(4)--评价和推荐
- Mahout基于项目的协同过滤算法源码分析(2)--RowSimilarityJob
- Mahout基于项目的协同过滤算法源码分析(4)共生矩阵乘法
- Mahout基于项目的协同过滤算法源码分析(5)--推荐
- Mahout基于项目的协同过滤算法源码分析(6)--总结
- Mahout 协同过滤 itemBase RecommenderJob源码分析
- Mahout 协同过滤 itemBase RecommenderJob源码分析
- phpcms v9中模板标签使用说明
- asp.net(vs2010,c#)视频教程下载
- 线性代数:坐标与基变换
- PHP下载CSS文件中图片的代码
- 用非递归实现二叉树的前序、中序、后序、层次遍历,用递归实现查找、统计个数、比较、求深度
- Mahout协同过滤算法源码分析(3-1)--QR分解数据流
- hdu 分类
- C#文件打包下载
- 面向对象的三大特征是继承、封装和多态
- daemon of camera
- hdu 4348 可持久化线段树
- 并查集(不相交集)C++实现
- ActionScript3-3流程控制
- 软件开发模式对比(瀑布、迭代、螺旋、敏捷)