Use apache pig to process the MovieLens datasets --- recommend system
来源:互联网 发布:淘宝怎么找学位证办理 编辑:程序博客网 时间:2024/06/05 01:06
MovieLens datasets webset :
http://grouplens.org/datasets/movielens/
Pig script :
-- 1. register the jar file named piggybank.jar, use the function org.apache.pig.piggybank.storage.MyRegExLoader
register '/opt/ibm/biginsights/pig/contrib/piggybank/java/piggybank.jar';
-- 2. load source dataset into variable named source
source = load '/user/liuhuaj/ml10m/my.train.bz2' using org.apache.pig.piggybank.storage.MyRegExLoader('^(\\S+)::(\\S+)::(\\S+)::\\S+') as (userid:int, itemid:int, score:double);
-- 3. compute the score^2 in order to compute the norm next.
source_tmp = foreach source generate TOTUPLE(userid,score) as userid_score, itemid, score, score*score as norm;
-- 4. group by itermid
itemid_group = group source_tmp by itemid;
-- 5. compute the average score and norm (sqrt(score1^1 + score2^2 +....)) for each items
get_avg_norm_tmp = foreach itemid_group generate flatten(source_tmp.userid_score), group as itemid, AVG(source_tmp.score) as avg,SQRT(SUM(source_tmp.norm)) as norm;
get_avg_norm_1 = foreach get_avg_norm_tmp generate null::userid_score.userid as userid, itemid, null::userid_score.score as score, avg, norm;
-- 6. group by userid
merge_2 = group get_avg_norm_1 by userid;
--store merge_2 into '/user/liuhuaj/ml10m/merge_2';
-- 7. register my own pig udf which is written in JAVA
register '/mnt/home/liuhuaj/pig_exec/pig_udf.jar';
-- 8. generate rows (itemid1, norm1, itemid2, norm2, norm_multiply, score_multiply) using my own function pig_udf.Sim_3_mapper
sim_3_mapper_tmp = foreach merge_2 generate flatten(pig_udf.Sim_3_mapper(get_avg_norm_1));
-- 9. filter the null rows
sim_3_mapper_del_null = filter sim_3_mapper_tmp by sim::itemid1 is not null;
sim_3_mapper = foreach sim_3_mapper_del_null generate sim::itemid1 as itemid1, sim::norm1 as norm1, sim::itemid2 as itemid2, sim::norm2 as norm2, sim::score1 as score1, sim::score2 as score2, sim::norm_multiply as norm_multiply, sim::score_multiply as score_multiply;
-- 10. group by (itermid1,itemid2)
sim_3_mapper_group = group sim_3_mapper by (itemid1,itemid2);
-- 11. compute the similarity of two itemid
sim_3_reduce_tmp = foreach sim_3_mapper_group generate flatten(group) as (itemid1:int, itemid2:int), SUM(sim_3_mapper.score_multiply)/SUM(sim_3_mapper.norm_multiply) as (sim:double);
-- 12. group by itemid1 and limit top 100 similarity
sim_3_reduce_group_for_item1 = group sim_3_reduce_tmp by itemid1;
sim_3_reduce_group_for_item1_order_limit = foreach sim_3_reduce_group_for_item1 { tmp_order = order sim_3_reduce_tmp by sim desc; first100 = limit tmp_order 100; generate group, first100; };
-- 13. group by itemid2 and limit top 100 similarity
sim_3_reduce_tmp2 = foreach sim_3_reduce_tmp generate itemid2 as itemid1, itemid1 as itemid2, sim;
sim_3_reduce_group_for_item2 = group sim_3_reduce_tmp2 by itemid1;
sim_3_reduce_group_for_item2_order_limit = foreach sim_3_reduce_group_for_item2 { tmp_order = order sim_3_reduce_tmp2 by sim desc; first100 = limit tmp_order 100; generate group, first100; };
-- 14. union item1 group and item2 group
sim_3_reduce_group = union sim_3_reduce_group_for_item1_order_limit, sim_3_reduce_group_for_item2_order_limit;
--store sim_3_reduce_group into '/user/liuhuaj/ml10m/sim_3_reduce_group';
-- 15. extract itemid and average score from get_avg_norm_1 which is in step 5
avg_4_tmp = foreach get_avg_norm_1 generate itemid, avg;
avg_4 = distinct avg_4_tmp;
--store avg_4 into '/user/liuhuaj/ml10m/avg_4';
-- 16. load test dataset into variable named testset
testset = load '/user/liuhuaj/ml10m/my.test.bz2' using org.apache.pig.piggybank.storage.MyRegExLoader('^(\\S+)::(\\S+)::(\\S+)::\\S+') as (userid:int, itemid:int, real_score:double);
-- 17. join test dataset and avg_4(step 15) by itemid
merge_4_and_test_tmp = join avg_4 by itemid, testset by itemid;
merge_4_and_test = foreach merge_4_and_test_tmp generate avg_4::itemid as itemid, avg_4::avg as avg, testset::userid as userid, testset::real_score as real_score;
-- 18. join test dataset and similarity result in step 14 by itemid
merge_5_and_3 = join merge_4_and_test by itemid, sim_3_reduce_group by group;
merge_5_and_3_reduce_for_6 = foreach merge_5_and_3 generate merge_4_and_test::userid as userid_testset, merge_4_and_test::avg as avg, merge_4_and_test::itemid as itemid_testset, merge_4_and_test::real_score as real_score, sim_3_reduce_group::first100 as sim_first_100;
--store merge_5_and_3_reduce_for_6 into '/user/liuhuaj/ml10m/merge_5_and_3_reduce_for_6';
-- 19. join merge_5_and_3_reduce_for_6(test data step 17) and merge_2 (step 6) by userid
merge_6_and_2 = join merge_5_and_3_reduce_for_6 by userid_testset, merge_2 by group;
--store merge_6_and_2 into '/user/liuhuaj/ml10m/merge_6_and_2';
-- 20. compute the forecast score using my own function pig_udf.Forecast_score
Forecast_score_tmp = foreach merge_6_and_2 generate pig_udf.Forecast_score(merge_5_and_3_reduce_for_6::userid_testset, merge_5_and_3_reduce_for_6::avg, merge_5_and_3_reduce_for_6::itemid_testset, merge_5_and_3_reduce_for_6::real_score, merge_5_and_3_reduce_for_6::sim_first_100, merge_2::get_avg_norm_1);
--store Forecast_score_tmp into '/user/liuhuaj/ml10m/Forecast_score_tmp';
Forecast_score = foreach Forecast_score_tmp generate mapper_8.key as key, mapper_8.userid as userid, mapper_8.itemid_testset as itemid_testset, mapper_8.real_score as real_score, mapper_8.forecast_score as forecast_score, mapper_8.diff_value as diff_value;
-- 21. filter the null value
Forecast_score_without_null = filter Forecast_score by key is not null;
--store Forecast_score_without_null into '/user/liuhuaj/ml10m/Forecast_score_without_null';
-- 22. group by key, all the data has same key '1'. the purpose is to compute the mae value
Forecast_score_group = group Forecast_score_without_null by key;
-- 23. compute mae value
mae = foreach Forecast_score_group generate SUM(Forecast_score_without_null.diff_value) / COUNT(Forecast_score_without_null.key) as (mae_value:double);
-- 24. store mae or you can use dump mae to check the value
store mae into '/user/liuhuaj/ml10m/mae';
Result :
grunt> cat part-r-00001
0.8431270314028648
Thinking :
1. make dataset to below format, (step 5 above)
Userid, Itemid, Score, Average, Norm
......
2. make dataset to below format, (step 6 above)
Userid(group), {(Userid, Itemid, Score, Average, Norm), (Userid, Itemid, Score, Average, Norm), .............}
......
3. make dataset to below format, (step 8 above)
Itemid1, Norm1, Itemid2, Norm2, Norm1 * Norm2, Score1 * Score2
Itemid2, Norm2, Itemid1, Norm1, Norm2 * Norm1, Score2 * Score1
......
4. make dataset to below format, Limit 100 items. (step 14 above)
Item1(group), {(Item1, Item2, Similarity), (Item1, Item3, Similarity)..........}
Item2(group), {(Item2, Item1, Similarity), (Item2, Item3, Similarity)..........}
......
5. make dataset to below format, (step 15 above)
Itemid1, Average
Itemid2, Average
......
6. make dataset to below format, (step 17 above)
Itemid, Average, Userid, Real_Score
Itemid, Average, Userid, Real_Score
......
7. make dataset to below format, (step 18 above)
userid_testset, avg, itemid_testset, real_score, {(Item1, Item2, Similarity), (Item1, Item3, Similarity)..........}
userid_testset, avg, itemid_testset, real_score, {(Item1, Item2, Similarity), (Item1, Item3, Similarity)..........}
......
8. make dataset to below format, (step 19 above)
userid_testset, avg, itemid_testset, real_score, {(Item1, Item2, Similarity), (Item1, Item3, Similarity)..........}, {(Userid, Itemid, Score, Average, Norm), (Userid, Itemid, Score, Average, Norm), .............}
userid_testset, avg, itemid_testset, real_score, {(Item1, Item2, Similarity), (Item1, Item3, Similarity)..........}, {(Userid, Itemid, Score, Average, Norm), (Userid, Itemid, Score, Average, Norm), .............}
......
9. compute the forecast score, (step 20 above)
use pig udf in pig_udf.jar .
10. compute mae value, (step 22 above)
the purpose to use key '1' : all the data will go to same reduce bacause the same key '1'.
Formula :
Similarity:
Compute the similarity of item i and item j.
r_ui is the rating of item i which user u give.
r_uj is the rating of item j which user u give.
Forecast score:
P_u,j is the forecast score of item i which user u give.
-r_i is the average score of item i.
Mae value:
Mae -- mean absolute error.
N is the count of forecast scores.
p_i is the forecast score which item i gets.
r_i is the real score which item i gets.
The formula is refer to below paper:
戚丽丽, 孙静宇, 陈俊杰. 基于均模型的IBCF 算法研究[J]. 山东大学学报(理学版), 2013 (11).
Resources:
pig_udf.jar :
http://pan.baidu.com/s/1gdELGYn
ml-10m datasets
http://pan.baidu.com/s/1gdzmDpT
- Use apache pig to process the MovieLens datasets --- recommend system
- the process to use oop
- When to use RDDs, Datasets, and DataFrames?
- pig--- Use the Parallel Features
- Force the system to use the embedded jar
- Error:OX wants to use the "System" keychain
- CentOS 7 use utfs-3g to recongnize the ntfs system
- how to set a user-defined process as the system key process
- [Android] Use Jsoup to grab the web data and process the data with string.indexOf()
- How-to: Use the ShareLib in Apache Oozie (CDH 5)
- How-to: Use the ShareLib in Apache Oozie
- How to use the System Restore API to save and to restore system data in Visual C++
- How to Use `strace` to Trace the System Call in Linux
- Notice: how to use Process.waitFor()?
- Where can I get large datasets open to the public?
- tsm客户端 the system date and use the ACCEPT DATE command to establish the current
- Introducing Apache Spark Datasets
- Use xargs to pass command to kill a process
- 空、。。
- PAT A1011 World Cup betting
- Linux下的计划任务:crontab命令详解
- Sum It Up(DFS+剪枝+去重)
- 内部类
- Use apache pig to process the MovieLens datasets --- recommend system
- 在hp-ux下安装oracle 数据库软件时unzip的问题
- uvalive4094
- myeclipse错误: 找不到或无法加载主类
- 用C/C++获取CPU温度总结(一)
- android开发中关于VersionCode和VersionName
- OpenCV Using Python——应用统计肤色模型和相对于块原点能量的肤色分割
- 八数码问题
- 优秀程序员的七大特征