大数据基础(九)Maven构建Hadoop日志清洗项目(一)

来源:互联网 发布:联合国中国知乎 编辑:程序博客网 时间:2024/04/19 22:15


Maven Hadoop日志清洗项目(一)


hadoop 2.7.2



参考:
Maven Hadoop:
http://www.cnblogs.com/Leo_wl/p/4862820.html
http://blog.csdn.net/kongxx/article/details/42339581
日志清洗:
http://www.cnblogs.com/edisonchou/p/4458219.html


1、新建Maven工程


Eclipse-》新建Maven工程
http://mvnrepository.com/search?q=hadoop-mapreduce-client


groupid:com 
artifactid:first




依赖包
hadoop-common
hadoop-hdfs
hadoop-mapreduce-client-core
hadoop-mapreduce-client-jobclient
hadoop-mapreduce-client-common
我加了个hadoop-yarn-common,这个可以不要




pom.xml【注意:版本改成你自己的】
<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>jdk.tools</groupId>
            <artifactId>jdk.tools</artifactId>
            <version>1.8</version>
            <scope>system</scope>
            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-common</artifactId>
            <version>2.7.2</version>
        </dependency>
</dependencies>


点击保存,开始构建。
构建完成后可以在Maven Dependencies下看到依赖包。




2、新建LogCleanJob类
代码见附录【详细代码解释参考原文http://www.cnblogs.com/edisonchou/p/4458219.html】
注意:pom.xml要添加assembly插件,直接用jar导出一直报错,没找到原因。
还有原文的@Override在run方法编译不通过,注释掉了。


E:\fm-workspace\workspace_2\first>mvn assembly:assembly
cd first\target下
first-0.0.1-SNAPSHOT-jar-with-dependencies.jar
E:\fm-workspace\workspace_2\first\target>dir
2016/08/13  18:21    <DIR>          .
2016/08/13  18:21    <DIR>          ..
2016/08/13  18:19    <DIR>          archive-tmp
2016/08/13  17:34    <DIR>          classes
2016/08/13  18:21        42,996,951 first-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2016/08/13  18:21             9,266 first-0.0.1-SNAPSHOT.jar
2016/08/13  18:19    <DIR>          maven-archiver
2016/08/13  17:31    <DIR>          maven-status
2016/08/13  18:19    <DIR>          surefire-reports
2016/08/13  17:34    <DIR>          test-classes
               2 个文件     43,006,217 字节
               8 个目录 113,821,888,512 可用字节


重命名first-0.0.1-SNAPSHOT-jar-with-dependencies.jar 为first.jar并拷贝到linux下
root@py-server:/projects/data# ll
总用量 42008
drwxr-xr-x 4 root root     4096  8月 13 18:52 ./
drwxr-xr-x 7 root root     4096  8月 11 16:29 ../
-rw-r--r-- 1 root root 42996951  8月 13 18:21 first.jar
drwxr-xr-x 2 root root     4096  8月 13 15:36 hadoop-logs/
drwxr-xr-x 2 root root     4096  8月  3 21:04 test/




5、上传数据到HDFS
数据文件在原文找吧:http://www.cnblogs.com/edisonchou/p/4458219.html,大概200MB左右。
也可以用你自己的日志文件,不过格式要一致。
root@py-server:/projects/data/hadoop-logs# ll
总用量 213056
drwxr-xr-x 2 root root      4096  8月 13 15:36 ./
drwxr-xr-x 4 root root      4096  8月 13 18:25 ../
-rw-r--r-- 1 root root  61084192  4月 26  2015 access_2013_05_30.log
-rw-r--r-- 1 root root 157069653  4月 26  2015 access_2013_05_31.log


HDFS默认路径是 /user/root/
root@py-server:/projects/data# hadoop fs -put hadoop-logs/ .
root@py-server:/projects/data# hadoop fs -ls 


Found 14 items
drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 .sparkStaging
drwxr-xr-x   - root supergroup          0 2016-08-13 15:38 hadoop-logs
-rw-r--r--   2 root supergroup      85285 2016-08-06 07:59 imdb_labelled.txt
-rw-r--r--   2 root supergroup         72 2016-08-04 09:29 kmeans_data.txt
drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 kmeans_result
drwxr-xr-x   - root supergroup          0 2016-08-05 16:16 kmeans_result.txt
-rw-r--r--   2 root supergroup      43914 2016-08-04 12:33 ks_aio.py
drwxr-xr-x   - root supergroup          0 2016-08-09 10:51 mymlresult
drwxr-xr-x   - root supergroup          0 2016-08-09 10:28 naive_bayes_result
-rw-r--r--   2 root supergroup      66288 2016-08-09 23:57 price_data.txt
-rw-r--r--   2 root supergroup       1619 2016-08-08 17:54 price_data2.txt
-rw-r--r--   2 root supergroup       1619 2016-08-09 09:13 price_train_data.txt
-rw-r--r--   2 root supergroup        120 2016-08-04 09:24 sample_kmeans_data.txt
-rw-r--r--   2 root supergroup     104736 2016-08-08 17:14 sample_libsvm_data.txt






6、Hadoop测试
root@py-server:/projects/data# hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output

结果:【速度超快,不到瞬间啊!36s】

在hadoop UI看(本人的是:py-server:8088)下看:

User:rootName:LogCleanJobApplication Type:MAPREDUCEApplication Tags: YarnApplicationState:FINISHEDFinalStatus Reported by AM:SUCCEEDEDStarted:星期六 八月 13 18:46:18 +0800 2016Elapsed:36secTracking URL:HistoryDiagnostics: 

Clean process success!
root@py-server:/projects/data# hadoop fs -ls /user/root/
Found 15 items
drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 /user/root/.sparkStaging
drwxr-xr-x   - root supergroup          0 2016-08-13 18:45 /user/root/hadoop-logs
-rw-r--r--   2 root supergroup      85285 2016-08-06 07:59 /user/root/imdb_labelled.txt
-rw-r--r--   2 root supergroup         72 2016-08-04 09:29 /user/root/kmeans_data.txt
drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 /user/root/kmeans_result
drwxr-xr-x   - root supergroup          0 2016-08-05 16:16 /user/root/kmeans_result.txt
-rw-r--r--   2 root supergroup      43914 2016-08-04 12:33 /user/root/ks_aio.py
drwxr-xr-x   - root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output
drwxr-xr-x   - root supergroup          0 2016-08-09 10:51 /user/root/mymlresult
drwxr-xr-x   - root supergroup          0 2016-08-09 10:28 /user/root/naive_bayes_result
-rw-r--r--   2 root supergroup      66288 2016-08-09 23:57 /user/root/price_data.txt
-rw-r--r--   2 root supergroup       1619 2016-08-08 17:54 /user/root/price_data2.txt
-rw-r--r--   2 root supergroup       1619 2016-08-09 09:13 /user/root/price_train_data.txt
-rw-r--r--   2 root supergroup        120 2016-08-04 09:24 /user/root/sample_kmeans_data.txt
-rw-r--r--   2 root supergroup     104736 2016-08-08 17:14 /user/root/sample_libsvm_data.txt


root@py-server:/projects/data# hadoop fs -ls /user/root/logcleanjob_output
Found 2 items
-rw-r--r--   2 root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output/_SUCCESS
-rw-r--r--   2 root supergroup   50810594 2016-08-13 18:46 /user/root/logcleanjob_output/part-r-00000


root@py-server:/projects/data# hadoop fs -cat /user/root/logcleanjob_output/part-r-00000




118.112.191.88 20130530204006source/plugin/wsh_wx/img/wsh_zk.css
113.107.237.31 20130530204005thread-10500-1-1.html
110.251.129.203 20130531081904forum.php?mod=ajax&action=forumchecknew&fid=111&time=1369959258&inajax=yes
118.112.191.88 20130530204006data/cache/style_1_common.css?y7a
220.231.55.69 20130530204005home.php?mod=spacecp&ac=pm&op=checknewpm&rand=1369917603
110.75.174.58 20130531081903thread-21066-1-1.html
118.112.191.88 20130530204006data/cache/style_1_forum_viewthread.css?y7a
110.75.174.55 20130531081904home.php?do=thread&from=space&mod=space&uid=71469&view=me
14.17.29.89 20130530204006home.php?mod=misc&ac=sendmail&rand=1369917604
121.25.131.148 20130531081906data/attachment/common/c2/common_12_usergroup_icon.jpg
59.174.191.135 20130530204003forum.php?mod=forumdisplay&fid=111&page=1&filter=author&orderby=dateline
118.112.191.88 20130530204007data/attachment/common/65/common_11_usergroup_icon.jpg
121.25.131.148 20130531081905home.php?mod=misc&ac=sendmail&rand=1369959541
101.229.199.98 20130530204007data/cache/style_1_widthauto.css?y7a
59.174.191.135 20130530204005home.php?mod=space&uid=71081&do=profile&from=space












#######################################
问题解决:
1. mave 中断怎么办
http://www.cnblogs.com/tangyanbo/p/4329303.html
右键项目:maven->update project并勾选force选项,如果勾选force,那么不用删除未下载成功的残余文件,在大量jar包未下载成功的时候可以选择勾选force
重新build一下。
2. hadoop jar 没有指定主类名,直接将主类名放在first.jar后会提示找不到input那个文件夹
hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output




#######################################
附录:LogCleanJob.java


package com.first;


//package techbbs;


import java.net.URI;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class LogCleanJob extends Configured implements Tool {


    public static void main(String[] args) {
        Configuration conf = new Configuration();
        try {
            int res = ToolRunner.run(conf, new LogCleanJob(), args);
            System.exit(res);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


    //@Override
    public int run(String[] args) throws Exception {
        final Job job = new Job(new Configuration(),
                LogCleanJob.class.getSimpleName());
        // 设置为可以打包运行
        job.setJarByClass(LogCleanJob.class);
        FileInputFormat.setInputPaths(job, args[0]);
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // 清理已存在的输出文件
        FileSystem fs = FileSystem.get(new URI(args[0]), getConf());
        Path outPath = new Path(args[1]);
        if (fs.exists(outPath)) {
            fs.delete(outPath, true);
        }
        
        boolean success = job.waitForCompletion(true);
        if(success){
            System.out.println("Clean process success!");
        }
        else{
            System.out.println("Clean process failed!");
        }
        return 0;
    }


    static class MyMapper extends
            Mapper<LongWritable, Text, LongWritable, Text> {
        LogParser logParser = new LogParser();
        Text outputValue = new Text();


        protected void map(
                LongWritable key,
                Text value,
                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)
                throws java.io.IOException, InterruptedException {
            final String[] parsed = logParser.parse(value.toString());


            // step1.过滤掉静态资源访问请求
            if (parsed[2].startsWith("GET /static/")
                    || parsed[2].startsWith("GET /uc_server")) {
                return;
            }
            // step2.过滤掉开头的指定字符串
            if (parsed[2].startsWith("GET /")) {
                parsed[2] = parsed[2].substring("GET /".length());
            } else if (parsed[2].startsWith("POST /")) {
                parsed[2] = parsed[2].substring("POST /".length());
            }
            // step3.过滤掉结尾的特定字符串
            if (parsed[2].endsWith(" HTTP/1.1")) {
                parsed[2] = parsed[2].substring(0, parsed[2].length()
                        - " HTTP/1.1".length());
            }
            // step4.只写入前三个记录类型项
            outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
            context.write(key, outputValue);
        }
    }


    static class MyReducer extends
            Reducer<LongWritable, Text, Text, NullWritable> {
        protected void reduce(
                LongWritable k2,
                java.lang.Iterable<Text> v2s,
                org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context)
                throws java.io.IOException, InterruptedException {
            for (Text v2 : v2s) {
                context.write(v2, NullWritable.get());
            }
        };
    }


    /*
     * 日志解析类
     */
    static class LogParser {
        public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
                "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
        public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
                "yyyyMMddHHmmss");


        public static void main(String[] args) throws ParseException {
            final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";
            LogParser parser = new LogParser();
            final String[] array = parser.parse(S1);
            System.out.println("样例数据: " + S1);
            System.out.format(
                    "解析结果:  ip=%s, time=%s, url=%s, status=%s, traffic=%s",
                    array[0], array[1], array[2], array[3], array[4]);
        }


        /**
         * 解析英文时间字符串
         * 
         * @param string
         * @return
         * @throws ParseException
         */
        private Date parseDateFormat(String string) {
            Date parse = null;
            try {
                parse = FORMAT.parse(string);
            } catch (ParseException e) {
                e.printStackTrace();
            }
            return parse;
        }


        /**
         * 解析日志的行记录
         * 
         * @param line
         * @return 数组含有5个元素,分别是ip、时间、url、状态、流量
         */
        public String[] parse(String line) {
            String ip = parseIP(line);
            String time = parseTime(line);
            String url = parseURL(line);
            String status = parseStatus(line);
            String traffic = parseTraffic(line);


            return new String[] { ip, time, url, status, traffic };
        }


        private String parseTraffic(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String traffic = trim.split(" ")[1];
            return traffic;
        }


        private String parseStatus(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String status = trim.split(" ")[0];
            return status;
        }


        private String parseURL(String line) {
            final int first = line.indexOf("\"");
            final int last = line.lastIndexOf("\"");
            String url = line.substring(first + 1, last);
            return url;
        }


        private String parseTime(String line) {
            final int first = line.indexOf("[");
            final int last = line.indexOf("+0800]");
            String time = line.substring(first + 1, last).trim();
            Date date = parseDateFormat(time);
            return dateformat1.format(date);
        }


        private String parseIP(String line) {
            String ip = line.split("- -")[0].trim();
            return ip;
        }
    }
}




######################################################
完整的pom.xml 
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>


  <groupId>com</groupId>
  <artifactId>first</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>


  <name>first</name>
  <url>http://maven.apache.org</url>


  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>


  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>jdk.tools</groupId>
            <artifactId>jdk.tools</artifactId>
            <version>1.8</version>
            <scope>system</scope>
            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-common</artifactId>
            <version>2.7.2</version>
        </dependency>
  </dependencies>
   <build>
  <defaultGoal>compile</defaultGoal>
  <plugins>  
            <plugin>  
                <artifactId>maven-assembly-plugin</artifactId>  
                <configuration>  
                    <archive>  
                        <manifest>  
                            <mainClass>com.first.LogCleanJob</mainClass>  
                        </manifest>  
                    </archive>  
                    <descriptorRefs>  
                        <descriptorRef>jar-with-dependencies</descriptorRef>  
                    </descriptorRefs>  
                </configuration>  
            </plugin>  
        </plugins>  
  </build>
</project>


0 0