Hadoop好友推荐系统-原始数据去重操作（包含MapReduce任务监控）

来源：互联网发布：守望先锋mac能玩吗编辑：程序博客网时间：2024/06/06 05:56

项目总目录：基于Hadoop的好友推荐系统项目综述

一、MapReduce任务实时监控实现

1、前端展示

jsp页面

当我们启动一个MapReduce任务后，后台会自动打开一个监控页面，其jsp页面如下：

    <table>             <tr>                <td><label for="name">所有任务个数:</label></td>                <td><input class="easyui-validatebox" type="text"                        id="jobnums" data-options="required:true" value="#" />                    </td>                </tr>                <tr>                <td><label for="name">当前任务:</label></td>                <td><input class="easyui-validatebox" type="text"                        id="currjob" data-options="required:true" value="#" />                    </td>                </tr>        <tr>                <td><label for="name">JobID:</label></td>                <td><input class="easyui-validatebox" type="text"                        id="jobid" data-options="required:true" style="width:300px" value="#" />                    </td>                </tr>                <tr>                    <td><label for="name">JobName:</label></td>                    <td><input class="easyui-validatebox" type="text"                        id="jobname" data-options="required:true" style="width:600px"                        value="#" />                    </td>                </tr>                <tr>                    <td><label for="name">Map进度:</label></td>                    <td><input class="easyui-validatebox" type="text"                        id="mapprogress" data-options="required:true"                        value="0.0%" />                    </td>                </tr>                <tr>                    <td><label for="name">Reduce进度:</label></td>                    <td><input class="easyui-validatebox" type="text"                        id="redprogress" data-options="required:true"                        value="0.0%" />                    </td>                </tr>                <tr>                    <td><label for="name">任务执行状态:</label></td>                    <td><input class="easyui-validatebox" type="text"                        id="state" data-options="required:true"                        value="#" />                    </td>                </tr>            </table>    </div>

该jsp页面显示了监控的各项信息，包括任务个数、任务Id、任务状态等等。

js逻辑

首先来看如何自动打开监控页面：
前端启动任何一个任务都是通过调用callByAJax这个方法来实现，其定义如下：

function callByAJax(url,data_){    $.ajax({        url : url,        data: data_,        async:true,        dataType:"json",        context : document.body,        success : function(data) {//          $.messager.progress('close');            closeProgressbar();            console.info("data.flag:"+data.flag);            var retMsg;            if("true"==data.flag){                retMsg='操作成功！';            }else{                retMsg='操作失败！失败原因：'+data.msg;            }            $.messager.show({                title : '提示',                msg : retMsg            });            if("true"==data.flag&&"true"==data.monitor){// 添加监控页面                // 使用单独Tab的方式                layout_center_addTabFun({                    title : 'MR算法监控',                    closable : true,                    // iconCls : node.iconCls,                    href : 'cluster/monitor_one.jsp'                });            }        }    });}function layout_center_addTabFun(opts) {        var t = $('#layout_center_tabs');        if (t.tabs('exists', opts.title)) {            t.tabs('select', opts.title);        } else {            t.tabs('add', opts);        }        console.info("打开页面："+opts.title);}

从上述逻辑中可以看到，如果响应任务提交的URL（也就是对应的action）返回的的data.flag为true（表示任务提交到后端成功）并且data.monitor（表示该任务需要启动监控页面）也为true，那么就会自动打开一个单独的监控页面。

然后看一下前端是如何获取MapReduce任务的各项运行数据的：
1、在任务监控页面设置自动刷新

<script type="text/javascript">        // 自动定时刷新        var monitor_cf_interval= setInterval("monitor_one_refresh()",3000);    </script>

2、在js逻辑中向后台请求数据

/** * 刷新，对应前端页面中的var monitor_cf_interval= setInterval("monitor_one_refresh()",3000); */function monitor_one_refresh(){    $.ajax({ // ajax提交        url : 'cloud/cloud_monitorone.action',        dataType : "json",        success : function(data) {            if (data.finished == 'error') {// 获取信息错误 ，返回数据设置为0，否则正常返回                clearInterval(monitor_cf_interval);                setJobInfoValues(data);                console.info("monitor,finished:"+data.finished);                $.messager.show({                    title : '提示',                    msg : '任务运行失败！'                });            } else if(data.finished == 'true'){                // 所有任务运行成功则停止timer                console.info('monitor,data.finished='+data.finished);                setJobInfoValues(data);                clearInterval(monitor_cf_interval);                $.messager.show({                    title : '提示',                    msg : '所有任务成功运行完成！'                });            }else{                // 设置提示，并更改页面数据,多行显示job任务信息                setJobInfoValues(data);            }        }    });}function setJobInfoValues(data){    $('#jobnums').val(data.jobnums);    $('#currjob').val(data.currjob);    $('#jobid').val(data.rows.jobId);    $('#jobname').val(data.rows.jobName);//  (n*100).toFixed(2)+"%" // 保留两位小数，同时转为百分数    $('#mapprogress').val((data.rows.mapProgress*100).toFixed(2)+'%');    $('#redprogress').val((data.rows.redProgress*100).toFixed(2)+'%');    $('#state').val(data.rows.runState);}

从上述代码中可以看出其请求的URL是cloud/cloud_monitorone.action，也就是对应的action。

2、后台逻辑

action层

/**     * 单个任务监控     * @throws IOException     */    public void monitorone() throws IOException{        Map<String ,Object> jsonMap = new HashMap<String,Object>();        List<CurrentJobInfo> currJobList =null;        try{            currJobList= HUtils.getJobs();//          jsonMap.put("rows", currJobList);// 放入数据            jsonMap.put("jobnums", HUtils.JOBNUM);            // 任务完成的标识是获取的任务个数必须等于jobNum，同时最后一个job完成            // true 所有任务完成            // false 任务正在运行            // error 某一个任务运行失败，则不再监控            if(currJobList.size()>=HUtils.JOBNUM){// 如果返回的list有JOBNUM个，那么才可能完成任务                if("success".equals(HUtils.hasFinished(currJobList.get(currJobList.size()-1)))){                    //currJobList.get(currJobList.size()-1)是获取最后一个任务的状态信息                    jsonMap.put("finished", "true");                    // 运行完成，初始化时间点                    HUtils.setJobStartTime(System.currentTimeMillis());//当前任务完成，重新设定JobStartTime，以便下一个任务的判断                }else if("running".equals(HUtils.hasFinished(currJobList.get(currJobList.size()-1)))){                    jsonMap.put("finished", "false");                }else{// fail 或者kill则设置为error                    jsonMap.put("finished", "error");                    HUtils.setJobStartTime(System.currentTimeMillis());                }            }else if(currJobList.size()>0){                if("fail".equals(HUtils.hasFinished(currJobList.get(currJobList.size()-1)))||                        "kill".equals(HUtils.hasFinished(currJobList.get(currJobList.size()-1)))){                    jsonMap.put("finished", "error");                    HUtils.setJobStartTime(System.currentTimeMillis());                }else{                    jsonMap.put("finished", "false");                }            }            if(currJobList.size()==0){                jsonMap.put("finished", "false");//              return ;            }else{                if(jsonMap.get("finished").equals("error")){                    CurrentJobInfo cj =currJobList.get(currJobList.size()-1);                    cj.setRunState("Error!");                    jsonMap.put("rows", cj);                }else{                    jsonMap.put("rows", currJobList.get(currJobList.size()-1));                }            }            jsonMap.put("currjob", currJobList.size());        }catch(Exception e){            e.printStackTrace();            jsonMap.put("finished", "error");            HUtils.setJobStartTime(System.currentTimeMillis());        }        System.out.println(new java.util.Date()+":"+JSON.toJSONString(jsonMap));        Utils.write2PrintWriter(JSON.toJSONString(jsonMap));// 使用JSON数据传输        return ;    }

其中涉及的Utils中的方法如下：

/**     * 根据时间来判断，然后获得Job的状态，以此来进行监控 Job的启动时间和使用system.currentTimeMillis获得的时间是一致的，最后返回的任务是启动时间在jobStartTime之后的任务。     *      *      * @return     * @throws IOException     */    public static List<CurrentJobInfo> getJobs() throws IOException {        JobStatus[] jss = getJobClient().getAllJobs();//返回所有的Job，不管是失败还是成功的        List<CurrentJobInfo> jsList = new ArrayList<CurrentJobInfo>();        jsList.clear();        for (JobStatus js : jss) {            if (js.getStartTime() > jobStartTime) {//只查找任务启动时间在jobStartTime之后的任务                jsList.add(new CurrentJobInfo(getJobClient().getJob(                        js.getJobID()), js.getStartTime(), js.getRunState()));            }        }        Collections.sort(jsList);        return jsList;    }/**     * @return the jobClient     */    public static JobClient getJobClient() {        if (jobClient == null) {            try {                jobClient = new JobClient(getConf());            } catch (IOException e) {                e.printStackTrace();            }        }        return jobClient;    }/**     * 判断一组MR任务是否完成     *      * @param currentJobInfo     * @return     */    public static String hasFinished(CurrentJobInfo currentJobInfo) {        if (currentJobInfo != null) {            if ("SUCCEEDED".equals(currentJobInfo.getRunState())) {                return "success";            }            if ("FAILED".equals(currentJobInfo.getRunState())) {                return "fail";            }            if ("KILLED".equals(currentJobInfo.getRunState())) {                return "kill";            }        }        return "running";    }

通过如上方法，前端就可以通过定时刷新不断获取到MapReduce任务的运行状态信息，从而实现任务的远程监控。

二、去重任务的实现（MapReduce的远程提交）

1、前端展示

jsp页面

<table>            <tr>                <td><label for="name">输入路径:</label>                </td>                <td><input class="easyui-validatebox" type="text"                    id="dedup_input_id" data-options="required:true" style="width:300px"                    value="/user/root/_source/source_users.xml" /></td>            </tr>            <tr>                <td><label for="name">输出路径:</label>                </td>                <td><input class="easyui-validatebox" type="text"                    id="dedup_output_id" data-options="required:true" style="width:300px"                    value="/user/root/_filter/deduplicate" /></td>            </tr>            <tr>                <td></td>                <td><a id="dedup_submit_id" href="" class="easyui-linkbutton"                    data-options="iconCls:'icon-door_in'">去重</a></td>            </tr>        </table>

jsp页面指定了任务的输入输出目录，默认输入路径为/user/root/_source/source_users.xml，这与数据上传的输出目录一致。默认输出目录为/user/root/_filter/deduplicate。

js逻辑

// =====dedup_submit_id,数据去重    $('#dedup_submit_id').bind('click', function(){        var input_i=$('#dedup_input_id').val();        var output_i=$('#dedup_output_id').val();        // 弹出进度框        popupProgressbar('提交任务','提交任务到云平台中...',1000);        // ajax 异步提交任务        callByAJax('cloud/cloud_deduplicate.action',{input:input_i,output:output_i});    });

任务提交对应的URL为cloud/cloud_deduplicate.action。

2、后台逻辑

action层

对应的action从这里得到：callByAJax(‘cloud/cloud_deduplicate.action’，–）。

/**     * 去重任务提交     */    public void deduplicate(){        Map<String ,Object> map = new HashMap<String,Object>();        try{            HUtils.setJobStartTime(System.currentTimeMillis()-10000);//设置任务开始时间            //-10000是为了消除延时的影响，将任务提交时间提前,保证实际任务启动时间一定在JobStartTime之后。            HUtils.JOBNUM=1;//设置任务数            new Thread(new Deduplicate(input,output)).start();//启动任务线程            map.put("flag", "true");//任务启动完毕标志（不代表任务运行完成，仅仅是启动完毕）            map.put("monitor", "true");//打开监控页面标志        } catch (Exception e) {            e.printStackTrace();            map.put("flag", "false");            map.put("monitor", "false");            map.put("msg", e.getMessage());        }        Utils.write2PrintWriter(JSON.toJSONString(map));    }

上述代码中的关键语句是

new Thread(new Deduplicate(input,output)).start();//启动任务线程

它通过启动线程的方式来启动一个MapReduce任务。而Deduplicate的定义如下：

/** * 去重 */public class Deduplicate implements Runnable {    private String input;    private String output;    public Deduplicate(String input,String output){        this.input=input;        this.output=output;    }    @Override    public void run() {        String [] args ={                HUtils.getHDFSPath(input),//获取输入路径                HUtils.getHDFSPath(output)//获取输出路径        };        try {            ToolRunner.run(HUtils.getConf(), new DeduplicateJob(),args );        } catch (Exception e) {            e.printStackTrace();        }    }}

可以看到Deduplicate通过实现Runnable接口并实现其run方法来启动多线程。其关键执行语句是

ToolRunner.run(HUtils.getConf(), new DeduplicateJob(),args );

它利用了Hadoop的工具类ToolRunner，ToolRunner使用十分简单，这里不做详述，具体可参考相关文档。这里只说一下ToolRunner.run方法的参数要求：
（1）HUtils.getConf()是Hadoop的基本配置信息；
（2）new DeduplicateJob()中的DeduplicateJob是任务类，该类必须继承Configured类并实现Tool接口；
（3）args是运行参数，相当于命令行运行任务时后面输入的参数。

HUtils.getConf()代码如下：

public static Configuration getConf() {        if (conf == null) {            conf = new Configuration();            // get configuration from db or file            conf.setBoolean("mapreduce.app-submission.cross-platform", "true"                    .equals(Utils.getKey(                            "mapreduce.app-submission.cross-platform", flag)));// 配置使用跨平台提交任务            conf.set("fs.defaultFS", Utils.getKey("fs.defaultFS", flag));// 指定namenode            conf.set("mapreduce.framework.name",                    Utils.getKey("mapreduce.framework.name", flag)); // 指定使用yarn框架            conf.set("yarn.resourcemanager.address",                    Utils.getKey("yarn.resourcemanager.address", flag)); // 指定resourcemanager            conf.set("yarn.resourcemanager.scheduler.address", Utils.getKey(                    "yarn.resourcemanager.scheduler.address", flag));// 指定资源分配器            conf.set("mapreduce.jobhistory.address",                    Utils.getKey("mapreduce.jobhistory.address", flag));        }        return conf;    }

DeduplicateJob的定义如下：

/** * users.xml * 去除重复记录 */public class DeduplicateJob extends Configured implements Tool {    @Override    public int run(String[] args) throws Exception {        Configuration conf = HUtils.getConf();        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();//解析命令行参数        if (otherArgs.length !=2) {//要求必须有输入和输出路径两个参数          System.err.println("Usage: com.kang.filter.DeduplicateJob <in> <out>");          System.exit(2);        }        Job job =  Job.getInstance(conf,"Deduplicate input  :"+otherArgs[0]+" to "+otherArgs[1]);        job.setJarByClass(DeduplicateJob.class);        job.setMapperClass(DeduplicateMapper.class);        job.setReducerClass(DeduplicateReducer.class);//      job.setNumReduceTasks(0);        job.setNumReduceTasks(1);        job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(Text.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(NullWritable.class);//      job.setOutputFormatClass(SequenceFileOutputFormat.class);        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));        FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));        FileSystem.get(conf).delete(new Path(otherArgs[1]), true);//调用任务前先删除输出目录        return job.waitForCompletion(true) ? 0 : 1;    }}

去重MapReduce任务的具体实现

首先看一下输入文件的数据格式：

<?xml version="1.0" encoding="utf-8"?><users>  <row Id="-1" Reputation="9" CreationDate="2010-07-28T16:38:27.683" DisplayName="Community" EmailHash="a007be5a61f6aa8f3e85ae2fc18dd66e" LastAccessDate="2010-07-28T16:38:27.683" Location="on the server farm" AboutMe="&lt;p&gt;Hi, I'm not really a person.&lt;/p&gt;&#xD;&#xA;&lt;p&gt;I'm a background process that helps keep this site clean!&lt;/p&gt;&#xD;&#xA;&lt;p&gt;I do things like&lt;/p&gt;&#xD;&#xA;&lt;ul&gt;&#xD;&#xA;&lt;li&gt;Randomly poke old unanswered questions every hour so they get some attention&lt;/li&gt;&#xD;&#xA;&lt;li&gt;Own community questions and answers so nobody gets unnecessary reputation from them&lt;/li&gt;&#xD;&#xA;&lt;li&gt;Own downvotes on spam/evil posts that get permanently deleted&#xD;&#xA;&lt;/ul&gt;" Views="0" UpVotes="142" DownVotes="119" />  <row Id="2" Reputation="101" CreationDate="2010-07-28T17:09:21.300" DisplayName="Geoff Dalgas" EmailHash="b437f461b3fd27387c5d8ab47a293d35" LastAccessDate="2011-09-01T23:16:56.353" WebsiteUrl="http://stackoverflow.com" Location="Corvallis, OR" Age="34" AboutMe="&lt;p&gt;Developer on the StackOverflow team.  Find me on&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;a href=&quot;http://www.twitter.com/SuperDalgas&quot; rel=&quot;nofollow&quot;&gt;Twitter&lt;/a&gt;&#xA;&lt;br&gt;&lt;br&gt;&#xA;&lt;a href=&quot;http://blog.stackoverflow.com/2009/05/welcome-stack-overflow-valued-associate-00003/&quot; rel=&quot;nofollow&quot;&gt;Stack Overflow Valued Associate #00003&lt;/a&gt; &lt;/p&gt;&#xA;" Views="25" UpVotes="7" DownVotes="0" />  <row Id="3" Reputation="101" CreationDate="2010-07-28T18:00:10.977" DisplayName="Jarrod Dixon" EmailHash="2dfa19bf5dc5826c1fe54c2c049a1ff1" LastAccessDate="2011-08-30T00:36:54.630" WebsiteUrl="http://stackoverflow.com" Location="New York, NY" Age="32" AboutMe="&lt;p&gt;&lt;a href=&quot;http://blog.stackoverflow.com/2009/01/welcome-stack-overflow-valued-associate-00002/&quot; rel=&quot;nofollow&quot;&gt;Developer on the Stack Overflow team&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Was dubbed &lt;strong&gt;SALTY SAILOR&lt;/strong&gt; by Jeff Atwood, as filth and flarn would oft-times fly when dealing with a particularly nasty bug!&lt;/p&gt;&#xA;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Twitter me: &lt;a href=&quot;http://twitter.com/jarrod_dixon&quot; rel=&quot;nofollow&quot;&gt;jarrod_dixon&lt;/a&gt;&lt;/li&gt;&#xA;&lt;li&gt;Email me: jarrod.m.dixon@gmail.com&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;" Views="15" UpVotes="10" DownVotes="0" />

输入文件格式是XML文件格式，我们数据去重的依据是EmailHash的值是否相同。如果存在相同的EmailHash数据，我们只保留Reputation较大的那一项数据。

MapReduce任务的map方法的实现：

/** * 输出emailHash 和原数据 */public class DeduplicateMapper extends Mapper<LongWritable, Text, Text, Text> {    private Text emailHashKey = new Text();    private String keyAttr="EmailHash";    public void map(LongWritable key, Text value, Context cxt)throws InterruptedException,IOException{        // 去掉非数据行        if(!value.toString().trim().startsWith("<row")){            return ;        }        String emailHash = Utils.getAttrValInLine(value.toString(),keyAttr);        emailHashKey.set(emailHash);        cxt.write(emailHashKey, value);    }}

其中的 Utils.getAttrValInLine实现如下：

/**     * 获取一行中的某个属性的值     * @param line     * @param attr     * @return     */    public static String getAttrValInLine(String line,String attr) {        String tmpAttr = " "+attr+"=\"";        int start = line.indexOf(tmpAttr);        if(start==-1){            return null;        }        start+=tmpAttr.length();        int end = line.indexOf("\"",start);        return line.substring(start, end);    }

MapReduce任务reduce方法实现：

/** * 把EmailHash重复的记录的四个属性只取reputation 最大的一个，如果没有，则返回第一条记录即可  */public class DeduplicateReducer extends Reducer<Text, Text, Text, NullWritable> {    public void reduce(Text key,Iterable<Text> values,Context cxt   )throws InterruptedException,IOException{        List<Text> vectors= new ArrayList<Text>();        for(Text t:values){            vectors.add(t);        }        if(vectors.size()==1){            cxt.write(vectors.get(0), NullWritable.get());            return ;        }        // 处理重复的记录        String attrV=null;        int repM=Integer.MAX_VALUE;        int index=-1;        int tmpRep=0;        for(int i=0; i<vectors.size();i++){            attrV= Utils.getAttrValInLine(vectors.get(i).toString(), "Reputation");            try{                tmpRep=Integer.parseInt(attrV);            }catch(Exception e){                tmpRep=repM;            }            if(tmpRep<repM){                index=i;                repM=tmpRep;            }        }        if(index!=-1){            cxt.write(vectors.get(index), NullWritable.get());        }else{            cxt.write(vectors.get(0), NullWritable.get());        }    }}

三、运行任务截图

1、任务提交

这里写图片描述

2、任务监控

这里写图片描述

3、后台控制台显示

这里写图片描述

4、HDFS目录

这里写图片描述

阅读全文

0 0