mahout itemCF 简单使用

来源:互联网 发布:sumts是什么网络类型 编辑:程序博客网 时间:2024/06/07 09:20

一、itemCF 测试

mahout版本 0.10.0

mahout 提供了很多的算法,其中比较常用的算是itemCF了这里记录一下itemcf的使用方法

1、数据准备,这里是使用自己采集的一些行为数据 ,数据不多,但是可以测试出结果:

下面三列分别是 user_id , item_id , perfence   

把以下数据存放到hdfs上,我存放的路径是/mahout/itemcf/data1/itemdata.data

  1. 0162381440670851711,4,7.0
  2. 0162381440670851711,11,4.0
  3. 0162381440670851711,32,1.0
  4. 0162381440670851711,176,27.0
  5. 0162381440670851711,183,11.0
  6. 0162381440670851711,184,5.0
  7. 0162381440670851711,207,9.0
  8. 0162381440670851711,256,3.0
  9. 0162381440670851711,258,4.0
  10. 0162381440670851711,259,16.0
  11. 0162381440670851711,260,8.0
  12. 0162381440670851711,261,18.0
  13. 0162381440670851711,301,1.0
  14. 0162381440670851711,307,1.0
  15. 0162381440670851711,477,1.0
  16. 0162381440670851711,518,1.0
  17. 0162381440670851711,549,3.0
  18. 0162381440670851711,570,1.0
  19. 0162381440670851711,826,2.0
  20. 0357211441096952115,207,1.0
  21. 0617721441096186493,184,1.0
  22. 0617721441096186493,207,1.0
  23. 1205421441071459451,5,1.0
  24. 1214361441096861254,207,1.0
  25. 1401731441095483081,258,1.0
  26. 1401731441095483081,814,4.0
  27. 1401731441095483081,826,1.0
  28. 1917281441163686119,259,10.0
  29. 1917281441163686119,260,1.0
  30. 1917281441163686119,261,3.0
  31. 1966141441163860798,176,1.0
  32. 2294491441095342047,176,1.0
  33. 2441031440670827430,4,13.0
  34. 2441031440670827430,259,29.0
  35. 2441031440670827430,261,14.0
  36. 2441031440670827430,460,2.0
  37. 2441031440670827430,477,6.0
  38. 2441031440670827430,570,1.0
  39. 2441031440670827430,577,6.0
  40. 2441031440670827430,702,1.0
  41. 2441031440670827430,758,2.0
  42. 2441031440670827430,809,1.0
  43. 2475791441161318569,176,1.0
  44. 2987091441068878630,261,1.0
  45. 3114261440726814722,549,1.0
  46. 3445831441096810087,207,1.0
  47. 3846061441096937902,207,1.0
  48. 4266911441160164599,176,1.0
  49. 4698311441097046150,176,2.0
  50. 4698311441097046150,183,2.0
  51. 4698311441097046150,184,4.0
  52. 4698311441097046150,207,6.0
  53. 4946291441097563245,183,1.0
  54. 4956331440750398178,159,1.0
  55. 4956331440750398178,160,1.0
  56. 5307571441160362208,4,1.0
  57. 5307571441160362208,176,1.0
  58. 5719691441098504387,176,5.0
  59. 5719691441098504387,184,1.0
  60. 5719691441098504387,207,1.0
  61. 5813281441095425044,184,2.0
  62. 5813281441095425044,258,1.0
  63. 5894601441095265604,184,1.0
  64. 5981521441096106535,207,1.0
  65. 6292291441096870187,207,1.0
  66. 6533651441161410910,176,1.0
  67. 6810691441096902907,207,1.0
  68. 6836071440729632252,4,3.0
  69. 6836071440729632252,49,1.0
  70. 6836071440729632252,259,2.0
  71. 6836071440729632252,570,1.0
  72. 6836071440729632252,577,2.0
  73. 6964141441160527746,176,1.0
  74. 7495291441096796843,207,1.0
  75. 7616681441095305067,183,1.0
  76. 7616681441095305067,184,2.0
  77. 7616681441095305067,258,2.0
  78. 7616681441095305067,261,1.0
  79. 7732211441095211112,183,1.0
  80. 7732211441095211112,259,2.0
  81. 7732211441095211112,260,9.0
  82. 7732211441095211112,261,1.0
  83. 7732211441095211112,632,6.0
  84. 8211761441096060717,176,1.0
  85. 8211761441096060717,183,1.0
  86. 8305691441168039389,259,3.0
  87. 8305691441168039389,260,2.0
  88. 8305691441168039389,261,1.0
  89. 8375281440837772178,527,1.0
  90. 8432311440724457499,290,1.0
  91. 8641451441097297246,183,1.0
  92. 8641451441097297246,184,1.0
  93. 8641451441097297246,207,1.0
  94. 8641451441097297246,259,1.0
  95. 8641451441097297246,263,1.0
  96. 8641451441097297246,838,1.0
  97. 8641451441097297246,839,1.0
  98. 8641451441097297246,840,1.0
  99. 8651081441095283643,176,2.0
  100. 8651081441095283643,183,7.0
  101. 8753221441095342356,176,1.0

2、使用mahout自带的算法 实现协同过滤:

语句如下:

  1. bin/hadoop jar /home/lin/hadoop/mahout-distribution-0.10.0/mahout-examples-0.10.0-job.jar  org.apache.mahout.cf.taste.hadoop.item.RecommenderJob  -i /mahout/itemcf/data1   -o  /mahout/itemcf/result1 -s SIMILARITY_LOGLIKELIHOOD --tempDir /mahout/itemcf/temp1

其中 -i 后面是输入数据存放地址,也就是上面给的测试数据;

        -o 后面是结果输出地址,这个文件夹不用建立,mahout会自动建立,若是已经存在则会报错

      --tempDir 是指临时存放的一些输出数据,mahout自己的一些输出 ,这个路径mahout自动创建,若是存在会报错

        -s  是指定使用算法;可以根据自己的需要选择;

具体的help如下

  1. Job-Specific Options:
  2. --input (-i) input Path to job input
  3. directory.
  4. --output (-o) output The directory
  5. pathname for output.
  6. --similarityClassname (-s) similarityClassname Name of distributed
  7. similarity measures
  8. class to instantiate,
  9. alternatively use one
  10. of the predefined
  11. similarities
  12. ([SIMILARITY_COOCCURRE
  13. NCE,
  14. SIMILARITY_LOGLIKELIHO
  15. OD,
  16. SIMILARITY_TANIMOTO_CO
  17. EFFICIENT,
  18. SIMILARITY_CITY_BLOCK,
  19. SIMILARITY_COSINE,
  20. SIMILARITY_PEARSON_COR
  21. RELATION,
  22. SIMILARITY_EUCLIDEAN_D
  23. ISTANCE])
  24. --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem try to cap the number
  25. of similar items per
  26. item to this number
  27. (default: 100)
  28. --maxPrefs (-mppu) maxPrefs max number of
  29. preferences to
  30. consider per user or
  31. item, users or items
  32. with more preferences
  33. will be sampled down
  34. (default: 500)
  35. --minPrefsPerUser (-mp) minPrefsPerUser ignore users with
  36. less preferences than
  37. this (default: 1)
  38. --booleanData (-b) booleanData Treat input as
  39. without pref values
  40. --threshold (-tr) threshold discard item pairs
  41. with a similarity
  42. value below this
  43. --randomSeed randomSeed use this seed for
  44. sampling
  45. --help (-h) Print out help
  46. --tempDir tempDir Intermediate output
  47. directory
  48. --startPhase startPhase First phase to run
  49. --endPhase endPhase Last phase to run

3、执行上述命令后,等待执行完毕,在目录 /mahout/itemcf/result1 可以看到如下数据:

  1. 162381440670851711[809:13.535571,702:13.535571,460:13.535571,758:13.535571,632:13.182321,577:12.929438,49:11.368558,307:10.562227,32:10.562227,518:10.562227]
  2. 617721441096186493[839:1.0,259:1.0,518:1.0,826:1.0,11:1.0,260:1.0,4:1.0,32:1.0,176:1.0,840:1.0]
  3. 1401731441095483081[11:1.0,570:1.0,518:1.0,307:1.0,260:1.0,259:1.0,549:1.0,32:1.0,207:1.0,184:1.0]
  4. 1917281441163686119[577:7.365086,702:6.5,809:6.5,758:6.5,460:6.5,184:5.9840446,176:5.981493,4:5.577299,570:5.3220325,477:4.9567957]
  5. 2441031440670827430[632:21.5,176:18.084661,183:15.684914,260:14.2175,207:13.510652,11:12.28147,307:12.28147,32:12.28147,518:12.28147,256:12.28147]
  6. 4698311441097046150[263:3.9337947,839:3.9337947,840:3.9337947,838:3.9337947,11:3.4747553,307:3.4747553,32:3.4747553,518:3.4747553,256:3.4747553,301:3.4747553]
  7. 5307571441160362208[826:1.0,259:1.0,518:1.0,307:1.0,11:1.0,260:1.0,549:1.0,32:1.0,207:1.0,184:1.0]
  8. 5719691441098504387[4:3.6454906,259:3.6147578,260:2.67091,261:2.6694102,183:2.517088,307:2.2876854,11:2.2876854,32:2.2876854,518:2.2876854,256:2.2876854]
  9. 5813281441095425044[207:1.8607497,259:1.6642486,183:1.5539461,301:1.4806436,11:1.4806436,307:1.4806436,32:1.4806436,518:1.4806436,256:1.4806436,549:1.4099455]
  10. 6836071440729632252[207:2.6088793,176:2.3617313,477:1.9966183,460:1.9945599,758:1.9945599,809:1.9945599,702:1.9945599,11:1.9926376,307:1.9926376,32:1.9926376]
  11. 7616681441095305067[826:1.5790755,207:1.5721571,549:1.535743,301:1.50748,307:1.50748,11:1.50748,32:1.50748,518:1.50748,256:1.50748,839:1.5]
  12. 7732211441095211112[826:3.7059078,549:3.7059078,307:3.3461132,256:3.3461132,518:3.3461132,11:3.3461132,301:3.3461132,32:3.3461132,570:3.1800203,477:3.1795032]
  13. 8211761441096060717[826:1.0,259:1.0,518:1.0,307:1.0,11:1.0,260:1.0,549:1.0,32:1.0,207:1.0,184:1.0]
  14. 8305691441168039389[577:2.2471673,4:2.083036,570:2.0549815,809:2.0,460:2.0,11:2.0,826:2.0,32:2.0,307:2.0,549:2.0]
  15. 8641451441097297246[11:1.0,632:1.0,518:1.0,826:1.0,260:1.0,570:1.0,549:1.0,32:1.0,307:1.0,477:1.0]
  16. 8651081441095283643[184:6.597979,258:6.1955295,260:6.1955295,826:5.5266876,549:5.5266876,477:5.5266876,259:4.662548,261:4.662548,11:4.626224,307:4.626224]
这样就得出了每个用户的推荐物品;

mahout 还有一个经常用到的算法 物品相似度 ,这样得到的结果是物品间的相度:

  1. mahout itemsimilarity  -i /mahout/itemcf/data1   -o  /mahout/itemcf/result1 -s SIMILARITY_LOGLIKELIHOOD --tempDir /mahout/itemcf/temp1
0 0
原创粉丝点击