mahout on hadoop2 实践

来源:互联网 发布:php获取时间轴 编辑:程序博客网 时间:2024/06/07 21:34

1. 感谢sunshine_junge在about上发的帖子《hadoop2.2+mahout0.9实战》,让我跨过这个难题:

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected      at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)      at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)      at org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob.run(PreparePreferenceMatrixJob.java:73)      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)  

方法如:

这个是因为目前mahout只支持hadoop1 的缘故。在这里可以找到解决方法:https://issues.apache.org/jira/browse/MAHOUT-1329。主要就是修改pom文件,修改mahout的依赖。大家可以下载修改后的源码包(http://download.csdn.net/detail/fansy1990/7165957)自己编译mahout(mvn clean install -Dhadoop2 -Dhadoop.2.version=2.2.0 -DskipTests),或者直接下载已经编译好的jar包(http://download.csdn.net/detail/fansy1990/7166017、http://download.csdn.net/detail/fansy1990/7166055)。

我当时没注意看编译命令,直接使用mvn clean package,是编译通过,但还是不支持hadoop2,后来直接下载作者编译好的包跑过的。


2. 另一个问题也是比较奇怪,被我自己解决:

运行时需要一个commons-cli2的包,这个包是我下载源码,并且编译得到的,官网上也找不到可以直接使用的包。下载源码后,去掉了工程中pom.xml中的<parent>...</parent>,否则编译不过。最好设置一下环境变量:

export HADOOP_CLASSPATH=$(hadoop classpath)export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/hdfs/go1233recom/lib/commons-cli2-2.0-SNAPSHOT.jar

运行的命令如下:

hadoop jar  /home/hdfs/mahout-core-0.9-job.jar  org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /input/user.csv -s SIMILARITY_EUCLIDEAN_DISTANCE --output output1

截取了最后的运行过程:

15/11/27 15:24:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1447409744056_002215/11/27 15:24:09 INFO impl.YarnClientImpl: Submitted application application_1447409744056_002215/11/27 15:24:09 INFO mapreduce.Job: The url to track the job: http://bd2.com:8088/proxy/application_1447409744056_0022/15/11/27 15:24:09 INFO mapreduce.Job: Running job: job_1447409744056_002215/11/27 15:24:20 INFO mapreduce.Job: Job job_1447409744056_0022 running in uber mode : false15/11/27 15:24:20 INFO mapreduce.Job:  map 0% reduce 0%15/11/27 15:24:30 INFO mapreduce.Job:  map 50% reduce 0%15/11/27 15:24:31 INFO mapreduce.Job:  map 100% reduce 0%15/11/27 15:24:38 INFO mapreduce.Job:  map 100% reduce 100%15/11/27 15:24:38 INFO mapreduce.Job: Job job_1447409744056_0022 completed successfully15/11/27 15:24:38 INFO mapreduce.Job: Counters: 49        File System Counters                FILE: Number of bytes read=326                FILE: Number of bytes written=381232                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=1489                HDFS: Number of bytes written=572                HDFS: Number of read operations=11                HDFS: Number of large read operations=0                HDFS: Number of write operations=2        Job Counters                 Launched map tasks=2                Launched reduce tasks=1                Data-local map tasks=2                Total time spent by all maps in occupied slots (ms)=18295                Total time spent by all reduces in occupied slots (ms)=9074                Total time spent by all map tasks (ms)=18295                Total time spent by all reduce tasks (ms)=4537                Total vcore-seconds taken by all map tasks=18295                Total vcore-seconds taken by all reduce tasks=4537                Total megabyte-seconds taken by all map tasks=9367040                Total megabyte-seconds taken by all reduce tasks=4645888        Map-Reduce Framework                Map input records=12                Map output records=28                Map output bytes=453                Map output materialized bytes=324                Input split bytes=647                Combine input records=0                Combine output records=0                Reduce input groups=7                Reduce shuffle bytes=324                Reduce input records=28                Reduce output records=7                Spilled Records=56                Shuffled Maps =2                Failed Shuffles=0                Merged Map outputs=2                GC time elapsed (ms)=207                CPU time spent (ms)=2940                Physical memory (bytes) snapshot=1109094400                Virtual memory (bytes) snapshot=3818016768                Total committed heap usage (bytes)=938999808        Shuffle Errors                BAD_ID=0                CONNECTION=0                IO_ERROR=0                WRONG_LENGTH=0                WRONG_MAP=0                WRONG_REDUCE=0        File Input Format Counters                 Bytes Read=0        File Output Format Counters                 Bytes Written=57215/11/27 15:24:38 INFO impl.TimelineClientImpl: Timeline service address: http://bd2.com:8188/ws/v1/timeline/15/11/27 15:24:38 INFO client.RMProxy: Connecting to ResourceManager at bd2.com/10.252.169.250:805015/11/27 15:24:39 INFO input.FileInputFormat: Total input paths to process : 115/11/27 15:24:39 INFO mapreduce.JobSubmitter: number of splits:115/11/27 15:24:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1447409744056_002315/11/27 15:24:39 INFO impl.YarnClientImpl: Submitted application application_1447409744056_002315/11/27 15:24:39 INFO mapreduce.Job: The url to track the job: http://bd2.com:8088/proxy/application_1447409744056_0023/15/11/27 15:24:39 INFO mapreduce.Job: Running job: job_1447409744056_002315/11/27 15:24:48 INFO mapreduce.Job: Job job_1447409744056_0023 running in uber mode : false15/11/27 15:24:48 INFO mapreduce.Job:  map 0% reduce 0%15/11/27 15:24:55 INFO mapreduce.Job:  map 100% reduce 0%15/11/27 15:25:02 INFO mapreduce.Job:  map 100% reduce 100%15/11/27 15:25:02 INFO mapreduce.Job: Job job_1447409744056_0023 completed successfully15/11/27 15:25:03 INFO mapreduce.Job: Counters: 49        File System Counters                FILE: Number of bytes read=306                FILE: Number of bytes written=254265                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=887                HDFS: Number of bytes written=192                HDFS: Number of read operations=10                HDFS: Number of large read operations=0                HDFS: Number of write operations=2        Job Counters                 Launched map tasks=1                Launched reduce tasks=1                Data-local map tasks=1                Total time spent by all maps in occupied slots (ms)=4088                Total time spent by all reduces in occupied slots (ms)=9604                Total time spent by all map tasks (ms)=4088                Total time spent by all reduce tasks (ms)=4802                Total vcore-seconds taken by all map tasks=4088                Total vcore-seconds taken by all reduce tasks=4802                Total megabyte-seconds taken by all map tasks=2093056                Total megabyte-seconds taken by all reduce tasks=4917248        Map-Reduce Framework                Map input records=7                Map output records=21                Map output bytes=927                Map output materialized bytes=298                Input split bytes=128                Combine input records=0                Combine output records=0                Reduce input groups=5                Reduce shuffle bytes=298                Reduce input records=21                Reduce output records=5                Spilled Records=42                Shuffled Maps =1                Failed Shuffles=0                Merged Map outputs=1                GC time elapsed (ms)=97                CPU time spent (ms)=1910                Physical memory (bytes) snapshot=589557760                Virtual memory (bytes) snapshot=2723651584                Total committed heap usage (bytes)=455606272        Shuffle Errors                BAD_ID=0                CONNECTION=0                IO_ERROR=0                WRONG_LENGTH=0                WRONG_MAP=0                WRONG_REDUCE=0        File Input Format Counters                 Bytes Read=572        File Output Format Counters                 Bytes Written=192[hdfs@bd4 ~]$ 


参考的资料还有:

《RecommenderJob源码分析(Step by Step)》





0 0