hadoop学习-Netflix电影推荐系统

来源：互联网发布：销售提成算法编辑：程序博客网时间：2024/05/16 17:55

1、推荐系统概述

电子商务网站是推荐系统应用的重要领域之一，当当网的图书推荐，大众点评的美食推荐，QQ好友推荐等等，推荐无处不在。

从企业角度，推荐系统的应用可以增加销售额等等，对于用户而言，系统仿佛知道我们的喜好并给出推荐也是非常美妙的事情。

推荐算法分类：

按数据使用划分：

协同过滤算法：UserCF, ItemCF, ModelCF
基于内容的推荐: 用户内容属性和物品内容属性
社会化过滤：基于用户的社会网络关系

按模型划分：

最近邻模型:基于距离的协同过滤算法
Latent Factor Mode(SVD)：基于矩阵分解的模型
Graph：图模型，社会网络图模型

本文采用协同过滤算法来实现电影推荐。下面介绍下基于用户的协同过滤算法UserCF和基于物品的协同过滤算法ItemCF原理。

基于用户的协同过滤算法UserCF

基于用户的协同过滤，通过不同用户对物品的评分来评测用户之间的相似性，基于用户之间的相似性做出推荐。简单来讲就是：给用户推荐和他兴趣相似的其他用户喜欢的物品。

更多关于算法实现可参考Mahout In Action这本书

基于物品的协同过滤算法ItemCF
基于item的协同过滤，通过用户对不同item的评分来评测item之间的相似性，基于item之间的相似性做出推荐。简单来讲就是：给用户推荐和他之前喜欢的物品相似的物品。
用例说明：

更多关于算法实现可参考Mahout In Action这本书

目前商用较多采用该算法。

2、数据源

切入正题，本文采用的数据源是Netflix公司的电影评分数据。Netflix是一家以在线电影租赁为生的公司。他们根据网友对电影的打分来判断用户有可能喜欢什么电影，并结合会员看过的电影以及口味偏好设置做出判断，混搭出各种电影风格的需求。

Netflix数据下载：

完整数据集：http://www.lifecrunch.biz/wp-content/uploads/2011/04/nf_prize_dataset.tar.gz

3、算法模型 MapReduce实现

先看下数据格式；

1,101,5.01,102,3.01,103,2.52,101,2.02,102,2.52,103,5.02,104,2.03,101,2.03,104,4.03,105,4.53,107,5.04,101,5.04,103,3.04,104,4.54,106,4.05,101,4.05,102,3.05,103,2.05,104,4.05,105,3.55,106,4.0

每行3个字段，依次是用户ID,电影ID,用户对电影的评分(0-5分，每0.5为一个评分点！)

算法实现过程：

建立物品的同现矩阵
建立用户对物品的评分矩阵
矩阵计算推荐结果

1)建立物品的同现矩阵

按照用户选择，在每个用户选择的物品中，将两两同时出现的物品次数记录下来。

      [101] [102] [103] [104] [105] [106] [107][101]   5     3     4     4     2     2     1[102]   3     3     3     2     1     1     0[103]   4     3     4     3     1     2     0[104]   4     2     3     4     2     2     1[105]   2     1     1     2     2     1     1[106]   2     1     2     2     1     2     0[107]   1     0     0     1     1     0     1

该矩阵表示 ID为101 102的电影同时被一个用户评分的次数为3。该矩阵是一个对称矩阵。

2)建立用户对物品的评分矩阵

按用户分组，找到每个用户所选的物品及评分

       3[101] 2.0[102] 0.0[103] 0.0[104] 4.0[105] 4.5[106] 0.0[107] 5.0

表示ID为3的用户的评分数据

3)将以上2个矩阵相乘

      [101] [102] [103] [104] [105] [106] [107]              3         R[101]   5     3     4     4     2     2     1               2.0       40.0[102]   3     3     3     2     1     1     0               0.0       18.5[103]   4     3     4     3     1     2     0               0.0       24.5[104]   4     2     3     4     2     2     1       ×      4.0   =   40.0[105]   2     1     1     2     2     1     1               4.5       26.0[106]   2     1     2     2     1     2     0               0.0       16.5[107]   1     0     0     1     1     0     1               5.0       16.5

R列分数最高的就是推荐结果；

这里为什么是2个矩阵相乘，结果的分数越高就是推荐结果呢。

简单分析如下：

当前ID为3的用户对ID为107的电影评分较高，可以理解为该用户喜欢这种类型的电影。

因此如果有一部电影和ID为107电影同时出现的次数越高，则可以理解为该电影和107电影比较类似，那么我们就可以将其推荐给用户3，也许他会喜欢。

这里同时出现的次数指的是很多用户都看过这2部电影，也就是我们说的这2部电影比较类似。

上面的矩阵中，我们可以看到104电影和107电影同时出现的次数为1，假设次数为10 ，那么R矩阵的结果会是：

 R40.018.524.5<span style="color:#ff0000;">85.0</span>26.016.516.5

最大值变成85.0，那么我们可以推荐104电影给用户3观看。

以上是该算法的简单分析，详细原理介绍可以看Mahout In Action这本书

4、源代码实现

Recommend.java

[java] view plaincopy

import java.util.HashMap;
import java.util.Map;
public class Recommend {
public static void main(String[] args) throws Exception {
Map<String, String> path = new HashMap<String, String>();
path.put("data", args[0]);
path.put("Step1Input", args[1]);
path.put("Step1Output", path.get("Step1Input") + "/step1");
path.put("Step2Input", path.get("Step1Output"));
path.put("Step2Output", path.get("Step1Input") + "/step2");
path.put("Step3Input1", path.get("Step1Output"));
path.put("Step3Output1", path.get("Step1Input") + "/step3_1");
path.put("Step3Input2", path.get("Step2Output"));
path.put("Step3Output2", path.get("Step1Input") + "/step3_2");
path.put("Step5Input1", path.get("Step3Output1"));
path.put("Step5Input2", path.get("Step3Output2"));
path.put("Step5Output", path.get("Step1Input") + "/step5");
path.put("Step6Input", path.get("Step5Output"));
path.put("Step6Output", path.get("Step1Input") + "/step6");
Step1.step1Run(path);
Step2.step2Run(path);
Step3.step3Run1(path);
Step3.step3Run2(path);
Step4_1.run(path);
Step4_2.run(path);
System.exit(0);
}
}

Step1.java

[java] view plaincopy

import java.io.IOException;
import java.util.Iterator;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.io.IntWritable;
public class Step1 {
public static class MapClass
extends Mapper<Object, Text, IntWritable, Text > {
public void map(Object key, Text value,
Context context ) throws IOException,
InterruptedException {
String[] list = value.toString().split(",");
context.write(new IntWritable(Integer.parseInt(list[0])),new Text(list[1] + ":" +list[2]));
}
}
public static class Reduce extends Reducer<IntWritable, Text, IntWritable, Text> {
private Text value = new Text();
public void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException,InterruptedException {
StringBuilder sb = new StringBuilder();
for (Text val : values) {
sb.append( "," + val.toString());
}
value.set(sb.toString().replaceFirst(",", ""));
context.write(key, new Text(value));
}
}
public static void step1Run(Map<String, String> path) throws Exception {
Configuration conf = new Configuration();
String input = path.get("data");
String output = path.get("Step1Output");
Job job = new Job(conf, "step1Run");
job.setJarByClass(Step1.class);
job.setMapperClass(MapClass.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
}
}

Step2.java

[java] view plaincopy

import java.io.IOException;
import java.util.Iterator;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.io.IntWritable;
public class Step1 {
public static class MapClass
extends Mapper<Object, Text, IntWritable, Text > {
public void map(Object key, Text value,
Context context ) throws IOException,
InterruptedException {
String[] list = value.toString().split(",");
context.write(new IntWritable(Integer.parseInt(list[0])),new Text(list[1] + ":" +list[2]));
}
}
public static class Reduce extends Reducer<IntWritable, Text, IntWritable, Text> {
private Text value = new Text();
public void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException,InterruptedException {
StringBuilder sb = new StringBuilder();
for (Text val : values) {
sb.append( "," + val.toString());
}
value.set(sb.toString().replaceFirst(",", ""));
context.write(key, new Text(value));
}
}
public static void step1Run(Map<String, String> path) throws Exception {
Configuration conf = new Configuration();
String input = path.get("data");
String output = path.get("Step1Output");
Job job = new Job(conf, "step1Run");
job.setJarByClass(Step1.class);
job.setMapperClass(MapClass.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
}
}

Step3.java

[java] view plaincopy

import java.io.IOException;
import java.util.Iterator;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.io.IntWritable;
public class Step3 {
public static class Map1
extends Mapper<Object, Text, IntWritable, Text> {
private IntWritable k = new IntWritable();
private Text v = new Text();
public void map(Object key, Text value,
Context context ) throws IOException,
InterruptedException {
String[] list = value.toString().split("\\\t|,");
for(int i = 1;i<list.length ; i++)
{
String[] vector = list[i].split(":");
int nItemID = Integer.parseInt(vector[0]);
k.set(nItemID);
v.set(list[0] + ":" + vector[1]);
context.write(k,v);
}
}
}
public static class Map2
extends Mapper<Object, Text, Text, IntWritable > {
private IntWritable v = new IntWritable();
private Text k = new Text();
public void map(Object key, Text value,
Context context ) throws IOException,
InterruptedException {
String[] list = value.toString().split("\\\t|,");
k.set(list[0]);
v.set(Integer.parseInt(list[1]));
context.write(k,v);
}
}
public static void step3Run1(Map<String, String> path) throws Exception {
Configuration conf = new Configuration();
String input = path.get("Step3Input1");
String output = path.get("Step3Output1");
Job job = new Job(conf, "step3Run1");
job.setJarByClass(Step3.class);
job.setMapperClass(Map1.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
}
public static void step3Run2(Map<String, String> path) throws Exception {
Configuration conf = new Configuration();
//String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
//if(otherArgs.length != 2){
// System.err.println("Usage: KPI <in> <out>");
// System.exit(2);
//}
String input = path.get("Step3Input2");
String output = path.get("Step3Output2");
Job job = new Job(conf, "step3Run2");
job.setJarByClass(Step3.class);
job.setMapperClass(Map2.class);
//job.setCombinerClass(Reduce.class);
//job.setReducerClass(Reduce.class);
//job.setInputFormat(KeyValueTextInputFormat.class);
//job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
}
}

Step4_1.java

[java] view plaincopy

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Step4_1 {
public static class Step4_1_Mapper extends
Mapper<Object, Text, Text, Text> {
private String flag;// A同现矩阵 or B评分矩阵
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
flag = split.getPath().getParent().getName();// 判断读的数据集
// System.out.println(flag);
}
public void map(Object key, Text values, Context context) throws IOException, InterruptedException {
String[] tokens = values.toString().split("\\\t|,");
if (flag.equals("step3_2")) {// 同现矩阵
String[] v1 = tokens[0].split(":");
String itemID1 = v1[0];
String itemID2 = v1[1];
String num = tokens[1];
Text k = new Text(itemID1);
Text v = new Text("A:" + itemID2 + "," + num);
context.write(k, v);
} else if (flag.equals("step3_1")) {// 评分矩阵
String[] v2 = tokens[1].split(":");
String itemID = tokens[0];
String userID = v2[0];
String pref = v2[1];
Text k = new Text(itemID);
Text v = new Text("B:" + userID + "," + pref);
context.write(k, v);
}
}
}
public static class Step4_1_Reducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Map<String, String> mapA = new HashMap<String, String>();
Map<String, String> mapB = new HashMap<String, String>();
for (Text line : values) {
String val = line.toString();
if (val.startsWith("A:")) {
String[] kv = val.substring(2).split("\\\t|,");
mapA.put(kv[0], kv[1]);
} else if (val.startsWith("B:")) {
String[] kv = val.substring(2).split("\\\t|,");
mapB.put(kv[0], kv[1]);
}
}
double result = 0;
Iterator iter = mapA.keySet().iterator();
while (iter.hasNext()) {
String mapk = (String) iter.next();// itemID
int num = Integer.parseInt(mapA.get(mapk));
Iterator iterb = mapB.keySet().iterator();
while (iterb.hasNext()) {
String mapkb = (String) iterb.next();// userID
double pref = Double.parseDouble(mapB.get(mapkb));
result = num * pref;// 矩阵乘法相乘计算
Text k = new Text(mapkb);
Text v = new Text(mapk + "," + result);
context.write(k, v);
}
}
}
}
public static void run(Map<String, String> path) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
String input1 = path.get("Step5Input1");
String input2 = path.get("Step5Input2");
String output = path.get("Step5Output");
Job job = new Job(conf,"Step4_1");
job.setJarByClass(Step4_1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Step4_1_Mapper.class);
job.setReducerClass(Step4_1_Reducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(input1), new Path(input2));
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
}
}

Step4_2.java

[java] view plaincopy

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Step4_2 {
public static class Step4_2_Mapper extends
Mapper<Object, Text, Text, Text> {
public void map(Object key, Text values, Context context) throws IOException, InterruptedException {
String[] tokens = values.toString().split("\\\t|,");
Text k = new Text(tokens[0]);
Text v = new Text(tokens[1]+","+tokens[2]);
context.write(k, v);
}
}
public static class Step4_2_Reducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Map<String, Double> map = new HashMap<String, Double>();
for (Text line : values) {
String[] tokens = line.toString().split("\\\t|,");
String itemID = tokens[0];
Double score = Double.parseDouble(tokens[1]);
if (map.containsKey(itemID)) {
map.put(itemID, map.get(itemID) + score);// 矩阵乘法求和计算
} else {
map.put(itemID, score);
}
}
Iterator<String> iter = map.keySet().iterator();
while (iter.hasNext()) {
String itemID = iter.next();
double score = map.get(itemID);
Text v = new Text(itemID + "," + score);
context.write(key, v);
}
}
}
public static void run(Map<String, String> path) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
String input = path.get("Step6Input");
String output = path.get("Step6Output");
Job job = new Job(conf,"Step4_2");
job.setJarByClass(Step4_2.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Step4_2_Mapper.class);
job.setReducerClass(Step4_2_Reducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
}
}

以上代码还有很多需要完善的地方，下次再重新整理。
附上github地址：https://github.com/y521263/Hadoop_in_Action

参考资料：

http://blog.fens.me/hadoop-mapreduce-recommend/

0 0