hadoop的诡异MR错误跟踪

来源:互联网 发布:王者荣耀代打接单软件 编辑:程序博客网 时间:2024/05/29 13:19
hadoop的诡异MR错误跟踪

转自http://f.dataguru.cn/thread-56030-1-1.html

我使用的是hadoop1.0.2

这两天在写一个MR遍历hbase的小代码,以及一个合并的小代码;在后台执行MR的时候,后台会出一个比较诡异的错误,此错误有时候在刚开始MR任务报出,有时候在执行的过程中报出,

java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:251) 13/01/24 10:35:09 WARN mapred.JobClient: Error reading task outputhttp://master:50060/tasklog?plaintext=true&attemptid=attempt_attempt_201301241037_0001_r_000002_0&filter=stdout 13/01/24 10:35:09 WARN mapred.JobClient: Error reading task outputhttp://master:50060/tasklog?plaintext=true&attemptid=attempt_attempt_201301241037_0001_r_000002_0&filter=stderr 13/01/24 10:35:13 INFO mapred.JobClient: Task Id : attempt_attempt_201301241037_0001_r_000002_0, Status : FAILED 

在网上查了不少资料,各大名家众说纷纭,大多都是说${HADOOP_HOME}/log/userlogs里的文件太多了,导致的;有人的确是这样解决掉了,其实这个实际的错误问题还是参看错误日志中的灰色部分的显示;直接打开这个地址,其实看不到什么东西,要到${HADOOP_HOME}/log/userlogs/{jobid}/{attemptid}下去找,一般还是比较准确;

我的这个错误对应的是

launchJvmAndWait(setupCmds, vargs, stdout, stderr, logSize, workDir);tracker.getTaskTrackerInstrumentation().reportTaskEnd(t.getTaskID());if (exitCodeSet) {if (!killed && exitCode != 0) {if (exitCode == 65) {tracker.getTaskTrackerInstrumentation().taskFailedPing(t.getTaskID());}throw new IOException("Task process exit with nonzero status of " +exitCode + ".");}}

这个很明显,是根据exitCode来判断的,仅仅可以认可65这一种情况;接着往下跟踪;

查看launchJvmAndWait方法,发现实际的MR任务是创建了Child JVM,不知道怎么称,就这样称呼吧,根据任务的类型,分map和reduce JVMManager,这样我们可以联想到在${HADOOP_HOME}/conf/mapred-site.xml里可以配置map使用的jvm的细节情况(具体的配置可以参看src\mapred\mapred-default.xml),实际上为这里做准备,但是不做配置的话,会有一些默认的参数,但是如果想让map更好,更快的运行,还是优化一下map使用的jvm参数,根据task的多少创建对应数量的jvm,1.0.2的代码位置在org.apache.hadoop.mapred.JvmManager的122行,方法名launchJvm,代码如下:

public void launchJvm(TaskRunner t, JvmEnv env) throws IOException, InterruptedException {if (t.getTask().isMapTask()) {mapJvmManager.reapJvm(t, env);} else {reduceJvmManager.reapJvm(t, env);}}

接着查看reapJvm方法:迭代的去让Child  jvm跑起来;这里面很混乱,各种状态的判断,当然这里的注释写的也很相信,点击有兴趣可以看看:

private synchronized void reapJvm( TaskRunner t, JvmEnv env) throws IOException, InterruptedException {if (t.getTaskInProgress().wasKilled()) {//the task was killed in-flight//no need to do the rest of the operationsreturn;}boolean spawnNewJvm = false;JobID jobId = t.getTask().getJobID();//Check whether there is a free slot to start a new JVM.//,or, Kill a (idle) JVM and launch a new one//When this method is called, we *must* // (1) spawn a new JVM (if we are below the max) // (2) find an idle JVM (that belongs to the same job), or,// (3) kill an idle JVM (from a different job) // (the order of return is in the order above)int numJvmsSpawned = jvmIdToRunner.size();JvmRunner runnerToKill = null;if (numJvmsSpawned >= maxJvms) {//go through the list of JVMs for all jobs.Iterator<Map.Entry<JVMId, JvmRunner>> jvmIter = jvmIdToRunner.entrySet().iterator();while (jvmIter.hasNext()) {JvmRunner jvmRunner = jvmIter.next().getValue();JobID jId = jvmRunner.jvmId.getJobId();//look for a free JVM for this job; if one exists then just breakif (jId.equals(jobId) && !jvmRunner.isBusy() && !jvmRunner.ranAll()){setRunningTaskForJvm(jvmRunner.jvmId, t); //reserve the JVMLOG.info("No new JVM spawned for jobId/taskid: " + jobId+"/"+t.getTask().getTaskID() +". Attempting to reuse: " + jvmRunner.jvmId);return;}//Cases when a JVM is killed: // (1) the JVM under consideration belongs to the same job // (passed in the argument). In this case, kill only when// the JVM ran all the tasks it was scheduled to run (in terms// of count).// (2) the JVM under consideration belongs to a different job and is// currently not busy//But in both the above cases, we see if we can assign the current//task to an idle JVM (hence we continue the loop even on a match)if ((jId.equals(jobId) && jvmRunner.ranAll()) ||(!jId.equals(jobId) && !jvmRunner.isBusy())) {runnerToKill = jvmRunner;spawnNewJvm = true;}}} else {spawnNewJvm = true;}if (spawnNewJvm) {if (runnerToKill != null) {LOG.info("Killing JVM: " + runnerToKill.jvmId);killJvmRunner(runnerToKill);}spawnNewJvm(jobId, env, t);return;}//*MUST* never reach thisLOG.fatal("Inconsistent state!!! " +"JVM Manager reached an unstable state " +"while reaping a JVM for task: " + t.getTask().getTaskID()+" " + getDetails() + ". Aborting. ");System.exit(-1);}
最终方法落实到了spawnNewJvm上,此方法里面是去启动Thread,一个JvmRunner的Thread,里面让childJvmstart,我们看到了thread的run方法,里面有个runChild方法,

exitCode = tracker.getTaskController().launchTask(user,jvmId.jobId.toString(), taskAttemptIdStr, env.setup,env.vargs, env.workDir, env.stdout.toString(),env.stderr.toString());

这个就是制造jvm的退出代码的,看了一下tracker,此为一个task的轨迹类,说白了就是记录task的运行状态,通过调用task控制器,来运行任务,task控制类是一个抽象类,具体的实现类有2个,可以在mapred-site设置具体的控制类,不配置当然是默认的控制类;叫:org.apache.hadoop.mapred.DefaultTaskController,另一个是:org.apache.hadoop.mapred.LinuxTaskController,当然我们也可以自己写自己的控制类,在mapred-site配置就可以;
下面我们来看DefaultTaskController的launchTask方法,找到了最终制造这个退出代码的根源,new shellCommandExecutor,执行bash的命令,exitCode就是根据bash的执行情况返回的错误码,

String commandFile = writeCommand(cmdLine, rawFs, p);rawFs.setPermission(p, TaskController.TASK_LAUNCH_SCRIPT_PERMISSION);shExec = new ShellCommandExecutor(new String[]{"bash", "-c", commandFile},currentWorkDirectory);shExec.execute();} catch (Exception e) {if (shExec == null) {return -1;}int exitCode = shExec.getExitCode();LOG.warn("Exit code from task is : " + exitCode);LOG.info("Output from DefaultTaskController's launchTask follows:");logOutput(shExec.getOutput());

然后通过线程中的updateOnJvmExit(jvmId, exitCode)的调用,实际上前面的迭代是循环的查看任务线程的状态,有问题了更新jvm的退出代码,然后kill掉任务;
我们在logs下的任务日志里看到错误是这样:红色的部分,显然是上面的代码打印的,由于底层是thread的实现,有时候可能会有系统级别的阻塞,会导致JvmManager杀不掉child jvm,但是却移除了迭代;就是下面第4行的日志输出;

2013-01-24 10:41:36,101 WARN org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 12013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.DefaultTaskController: Output from DefaultTaskController's launchTask follows:2013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.TaskController: 2013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.JvmManager: JVM Not killed jvm_201301241037_0001_r_-1615671564 but just removed2013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301241037_0001_r_-1615671564 exited with exit code 1. Number of tasks it ran: 02013-01-24 10:41:36,101 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201301241037_0001_r_000002_0 : Child Errorjava.io.IOException: Task process exit with nonzero status of 1.at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

到此,我们的查根问底就结束了,总结一下:


提交mr作业,形成job,job分2个部分,map-job,r-job,job下面分了一定数量的task,每个task其实底层就是一个Thread,有专门的TaskController去管理,实时的去判断状态和执行的情况;看起来很简单,但是实现起来却很复杂;


原创粉丝点击