maven打包hadoop项目(含第三方jar)

来源:互联网 发布:q群排名优化 编辑:程序博客网 时间:2024/04/28 12:50

 maven打包hadoop项目(含第三方jar)

 

问题背景:

1 写map-reduce程序,用到第三方jar,怎么打包并提交项目到服务器执行。

2 mahout中itembased算法,将uid从string映射为long。

 

我这里实现的具体功能是:

Mahout的itembased算法的数据格式是:uid,vid,score。其中uid和vid必须是数字型(long),score是小数整数都可以。

然而我这里每行记录的字段uid,vid,score,

uid是含有字母。因此我必须把uid从string映射到long。

考虑到速度,就用分布式程序来做这个转换。

此外,还直接调用了mahout里面的一个类

org.apache.mahout.cf.taste.impl.model.MemoryIDMigrator


用Maven创建一个标准化的Java项目

mvn archetype:generate-DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=org.linger.mahout-DartifactId=mahoutProject -DpackageName=org.linger.mahout -Dversion=1.0-DinteractiveMode=false

 

执行mvn clean install初始化项目,注意会自动生成一个pom.xml文件。

 

修改pom.xml,

1 先把junit的去掉。

 

2 在pom.xml添加mahout依赖jar(这里先不研究mahout这些jar依赖怎么得出来的)

         <properties>                   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>                   <mahout.version>0.8</mahout.version>         </properties>          <dependencies>                   <dependency>                            <groupId>org.apache.mahout</groupId>                            <artifactId>mahout-core</artifactId>                            <version>${mahout.version}</version>                   </dependency>                   <dependency>                            <groupId>org.apache.mahout</groupId>                            <artifactId>mahout-integration</artifactId>                            <version>${mahout.version}</version>                            <exclusions>                                     <exclusion>                                               <groupId>org.mortbay.jetty</groupId>                                               <artifactId>jetty</artifactId>                                     </exclusion>                                     <exclusion>                                               <groupId>org.apache.cassandra</groupId>                                               <artifactId>cassandra-all</artifactId>                                     </exclusion>                                     <exclusion>                                               <groupId>me.prettyprint</groupId>                                               <artifactId>hector-core</artifactId>                                     </exclusion>                            </exclusions>                   </dependency>         </dependencies>



3 pom.xml配置jar打包选项

<build>    <plugins>         <plugin>              <artifactId>maven-assembly-plugin</artifactId>               <configuration>                    <archive>                         <manifest>                             <mainClass>org.linger.mahout.mapreducer.UserVideoFormat</mainClass>                         </manifest>                    </archive>                    <descriptorRefs>                        <descriptorRef>jar-with-dependencies</descriptorRef>                    </descriptorRefs>               </configuration>               <executions>                    <execution>                        <id>make-assembly</id>                        <phase>package</phase>                         <goals>                             <goal>single</goal>                         </goals>                    </execution>               </executions>         </plugin>    </plugins></build>


 

我写的map-reduce代码

package org.linger.mahout.mapreducer;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.mahout.cf.taste.impl.model.MemoryIDMigrator;public class UserVideoFormat { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>  {      private Text userId = new Text();       private Text lefts = new Text();       private MemoryIDMigrator thing2long = new MemoryIDMigrator();        public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException        {       String line = value.toString();       int spliter = line.indexOf(',');       String userStr = line.substring(0, spliter);       String leftsStr = line.substring(spliter+1);               userId.set(Long.toString(thing2long.toLongID(userStr)));        lefts.set(leftsStr);      output.collect(userId, lefts);       } } public static void main(String[] args) throws IOException {// TODO Auto-generated method stubJobConf conf = new JobConf(UserVideoFormat.class);conf.setJobName("UserVideoFormat");    conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);    conf.setMapperClass(Map.class);    conf.set("mapred.textoutputformat.separator", ",");    conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();FileInputFormat.setInputPaths(conf, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(conf, new Path(otherArgs[1]));        JobClient.runJob(conf);}}


 

 

执行mvn package打包

在target目录下自动生成mahoutProject-1.0-jar-with-dependencies.jar

 

hadoop jar mahoutProject-1.0-jar-with-dependencies.jarinput output

注意到,由于pom.xml配置中指明该jar包的main函数,所以这里不需要再写明main函数。

否则,一般都会在jar包后面指明main函数。

 

参考资料:

Maven构建Mahout项目

http://blog.fens.me/hadoop-mahout-maven-eclipse/

Hadoop Job使用第三方依赖jar文件

http://shiyanjun.cn/archives/373.html

mahout做推荐时uid,pid为string类型

http://blog.csdn.net/pan12jian/article/details/38703569



本文链接:http://blog.csdn.net/lingerlanlan/article/details/42086623

本文作者:linger


1 0
原创粉丝点击