hadoop执行jar流程分析
来源:互联网 发布:阿里云被攻击,不犯法吗 编辑:程序博客网 时间:2024/06/08 03:54
项目要结束了,最近在整理项目的相关文档,之前项目中在用hadoop jar **.jar提交作业时,设置了些公共依赖jar包到CLASSPATH中,这样算子在打包时就不需要把很多jar包再打进去离开 。
在hadoop-env.sh中和mapreduce.application.classpath、yarn.application.classpath将jar都设置进去了,这样在本地执行hadoop jar命令时就就不会报缺少依赖错误,但关于他们具体的工作原理不太清楚了,就着这个机会,就准备好好分析一下hadoop运行原理,这篇先分析hadoop jar提交任务。
一、命令行hadoop jar *** 命令的hadoop脚本为/usr/local/hadoop
二、/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop脚本执行流程
1、执行hadoop-config.sh脚本进行相关配置(hadoop-config.sh详细解析请参考另外一篇博客:http://blog.csdn.net/a822631129/article/details/50038883)
(1)、设置hadoop、hdfs、yarn、mapred目录的一些变量
(2)、设定配置文件目录
(3)、执行hadoop-env.sh文件,项目中就在这个脚本里将依赖的jar添加进去了(hadoop-env.sh脚本的作用就是设置在执行用户写的mapreduce程序中使用的变量,其中设定CLASSPATH使得在提交节点执行用户程序时能够找到依赖,至于container中使用的默认依赖就要通过其他配置搞定了)
(4)、设置JAVA_HOME,JAVA_HEAP_SIZE等变量。
(5)、设置需要加载执行的类CLASS(即CLASSPATH变量,将COMMON、HDFS、YANR、MAPREDUCE、HADOOP_CLASSPATH中的jar都添加上了)和HADOOP_OPTS参数(hadoop.log.dir、hadoop.log.file、hadoop.home.dir、hadoop.root.logger、java.library.path、hadoop.policy.file),
2、获得用户命令COMMAND,给命令分类,本例是jar。
确定要执行的CLASS是org.apache.hadoop.util.RunJar
3、 export CLASSPATH=$CLASSPATH
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
设定CLASSPATH,获取参数,调用java执行类(RunJar)
/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop脚本:
三、org.apache.hadoop.util.RunJar执行流程:
1.通过命令行参数获取程序jar包名,再通过jar包名称获得jar包中主类名称,若是jar包中没有manifest文件,读第二个参数为主类名称
2.准备运行环境,在hadoop.tmp.dir下创建hadoop-unjar*目录,作为作业执行的工作目录;然后调用unjar方法把jar文件解压到该目录下
3.调用ClassLoader加载CLASSPATH,把算子jar解压后的内容class、lib等加载到CLASSPATH。
4.根据java反射机制执行jar包主类的main方法。
/** Run a Hadoop job jar. If the main class is not in the jar's manifest,
* then it must be provided on the command line. */
public static void main(String[] args) throws Throwable {
//获取jar包名称,获得jar包中主类名称,若是jar包中没有manifest文件,读第二个参数为主类名称
四、由以上分析可知,在执行hadoop-config.sh脚本时,执行了hadoop-env.sh,就可将hadoop-env.sh中设置的CLASSPATH加载到了执行jar时的环境变量里,而像mapreduce.application.classpath、yarn.application.classpath这两个属性设置的东西在此时却是没有加载的,这个应该是hadoop的container任务会用到,这个问题以后再具体分析。$HADOOP_HOME/bin/hadoop脚本实现了很多hadoop命令,但是还有很多命令是通过$HADOOP_HOME/bin/mapred或$HADOOP_HOME/bin/hdfs来执行的,有兴趣的可以看看这些脚本,这里暂不做分析。本文主要分析了hadoop执行jar文件的流程:通过解析jar文件的到主类,利用java反射机制执行主类的main方法,进而执行相关程序。进而hadoop如何提交job在下文再做介绍。
在hadoop-env.sh中和mapreduce.application.classpath、yarn.application.classpath将jar都设置进去了,这样在本地执行hadoop jar命令时就就不会报缺少依赖错误,但关于他们具体的工作原理不太清楚了,就着这个机会,就准备好好分析一下hadoop运行原理,这篇先分析hadoop jar提交任务。
一、命令行hadoop jar *** 命令的hadoop脚本为/usr/local/hadoop
#!/bin/bash # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in SOURCE="${BASH_SOURCE[0]}" BIN_DIR="$( dirname "$SOURCE" )" while [ -h "$SOURCE" ] do SOURCE="$(readlink "$SOURCE")" [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" done BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" LIB_DIR=$BIN_DIR/../lib# Autodetect JAVA_HOME if not defined. $LIB_DIR/bigtop-utils/bigtop-detect-javahomeexport HADOOP_LIBEXEC_DIR=//$LIB_DIR/hadoop/libexecexec $LIB_DIR/hadoop/bin/hadoop "$@"监测JAVAHOMR,设定HADOOP_LIBEXEC_DIR变量,执行实际上的hadoop脚本,该脚本位置在/opt/cloudera/parcels/CDH/lib/hadoop/bin/下($LIB_DIR的值为/opt/cloudera/parcels/CDH/lib)
二、/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop脚本执行流程
1、执行hadoop-config.sh脚本进行相关配置(hadoop-config.sh详细解析请参考另外一篇博客:http://blog.csdn.net/a822631129/article/details/50038883)
(1)、设置hadoop、hdfs、yarn、mapred目录的一些变量
(2)、设定配置文件目录
(3)、执行hadoop-env.sh文件,项目中就在这个脚本里将依赖的jar添加进去了(hadoop-env.sh脚本的作用就是设置在执行用户写的mapreduce程序中使用的变量,其中设定CLASSPATH使得在提交节点执行用户程序时能够找到依赖,至于container中使用的默认依赖就要通过其他配置搞定了)
(4)、设置JAVA_HOME,JAVA_HEAP_SIZE等变量。
(5)、设置需要加载执行的类CLASS(即CLASSPATH变量,将COMMON、HDFS、YANR、MAPREDUCE、HADOOP_CLASSPATH中的jar都添加上了)和HADOOP_OPTS参数(hadoop.log.dir、hadoop.log.file、hadoop.home.dir、hadoop.root.logger、java.library.path、hadoop.policy.file),
2、获得用户命令COMMAND,给命令分类,本例是jar。
确定要执行的CLASS是org.apache.hadoop.util.RunJar
3、 export CLASSPATH=$CLASSPATH
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
设定CLASSPATH,获取参数,调用java执行类(RunJar)
/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop脚本:
# This script runs the hadoop core commands. bin=`which $0`bin=`dirname ${bin}`bin=`cd "$bin"; pwd`DEFAULT_LIBEXEC_DIR="$bin"/../libexecHADOOP_LIBEXEC_DIR=${HADOOP_LIBEXEC_DIR:-$DEFAULT_LIBEXEC_DIR}. $HADOOP_LIBEXEC_DIR/hadoop-config.shfunction print_usage(){ echo "Usage: hadoop [--config confdir] COMMAND" echo " where COMMAND is one of:" echo " fs run a generic filesystem user client" echo " version print the version" echo " jar <jar> run a jar file" echo " checknative [-a|-h] check native hadoop and compression libraries availability" echo " distcp <srcurl> <desturl> copy file or directories recursively" echo " archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive" echo " classpath prints the class path needed to get the" echo " credential interact with credential providers" echo " Hadoop jar and the required libraries" echo " daemonlog get/set the log level for each daemon" echo " trace view and modify Hadoop tracing settings" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "" echo "Most commands print help when invoked w/o parameters."}if [ $# = 0 ]; then print_usage exitfiCOMMAND=$1case $COMMAND in # usage flags --help|-help|-h) print_usage exit ;; #hdfs commands namenode|secondarynamenode|datanode|dfs|dfsadmin|fsck|balancer|fetchdt|oiv|dfsgroups|portmap|nfs3) echo "DEPRECATED: Use of this script to execute hdfs command is deprecated." 1>&2 echo "Instead use the hdfs command for it." 1>&2 echo "" 1>&2 #try to locate hdfs and if present, delegate to it. shift if [ -f "${HADOOP_HDFS_HOME}"/bin/hdfs ]; then exec "${HADOOP_HDFS_HOME}"/bin/hdfs ${COMMAND/dfsgroups/groups} "$@" elif [ -f "${HADOOP_PREFIX}"/bin/hdfs ]; then exec "${HADOOP_PREFIX}"/bin/hdfs ${COMMAND/dfsgroups/groups} "$@" else echo "HADOOP_HDFS_HOME not found!" exit 1 fi ;; #mapred commands for backwards compatibility pipes|job|queue|mrgroups|mradmin|jobtracker|tasktracker|mrhaadmin|mrzkfc|jobtrackerha) echo "DEPRECATED: Use of this script to execute mapred command is deprecated." 1>&2 echo "Instead use the mapred command for it." 1>&2 echo "" 1>&2 #try to locate mapred and if present, delegate to it. shift if [ -f "${HADOOP_MAPRED_HOME}"/bin/mapred ]; then exec "${HADOOP_MAPRED_HOME}"/bin/mapred ${COMMAND/mrgroups/groups} "$@" elif [ -f "${HADOOP_PREFIX}"/bin/mapred ]; then exec "${HADOOP_PREFIX}"/bin/mapred ${COMMAND/mrgroups/groups} "$@" else echo "HADOOP_MAPRED_HOME not found!" exit 1 fi ;; #core commands *) # the core commands if [ "$COMMAND" = "fs" ] ; then CLASS=org.apache.hadoop.fs.FsShell elif [ "$COMMAND" = "version" ] ; then CLASS=org.apache.hadoop.util.VersionInfo elif [ "$COMMAND" = "jar" ] ; then CLASS=org.apache.hadoop.util.RunJar elif [ "$COMMAND" = "key" ] ; then CLASS=org.apache.hadoop.crypto.key.KeyShell elif [ "$COMMAND" = "checknative" ] ; then CLASS=org.apache.hadoop.util.NativeLibraryChecker elif [ "$COMMAND" = "distcp" ] ; then CLASS=org.apache.hadoop.tools.DistCp CLASSPATH=${CLASSPATH}:${TOOL_PATH} elif [ "$COMMAND" = "daemonlog" ] ; then CLASS=org.apache.hadoop.log.LogLevel elif [ "$COMMAND" = "archive" ] ; then CLASS=org.apache.hadoop.tools.HadoopArchives CLASSPATH=${CLASSPATH}:${TOOL_PATH} elif [ "$COMMAND" = "credential" ] ; then CLASS=org.apache.hadoop.security.alias.CredentialShell elif [ "$COMMAND" = "trace" ] ; then CLASS=org.apache.hadoop.tracing.TraceAdmin elif [ "$COMMAND" = "classpath" ] ; then if [ "$#" -eq 1 ]; then # No need to bother starting up a JVM for this simple case. echo $CLASSPATH exit else CLASS=org.apache.hadoop.util.Classpath fi elif [[ "$COMMAND" = -* ]] ; then # class and package names cannot begin with a - echo "Error: No command named \`$COMMAND' was found. Perhaps you meant \`hadoop ${COMMAND#-}'" exit 1 else CLASS=$COMMAND fi shift # Always respect HADOOP_OPTS and HADOOP_CLIENT_OPTS HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" #make sure security appender is turned off HADOOP_OPTS="$HADOOP_OPTS -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,NullAppender}" export CLASSPATH=$CLASSPATH
三、org.apache.hadoop.util.RunJar执行流程:
1.通过命令行参数获取程序jar包名,再通过jar包名称获得jar包中主类名称,若是jar包中没有manifest文件,读第二个参数为主类名称
2.准备运行环境,在hadoop.tmp.dir下创建hadoop-unjar*目录,作为作业执行的工作目录;然后调用unjar方法把jar文件解压到该目录下
3.调用ClassLoader加载CLASSPATH,把算子jar解压后的内容class、lib等加载到CLASSPATH。
4.根据java反射机制执行jar包主类的main方法。
/** Run a Hadoop job jar. If the main class is not in the jar's manifest,
* then it must be provided on the command line. */
public static void main(String[] args) throws Throwable {
//获取jar包名称,获得jar包中主类名称,若是jar包中没有manifest文件,读第二个参数为主类名称
String usage = "RunJar jarFile [mainClass] args..."; if (args.length < 1) { System.err.println(usage); System.exit(-1); } int firstArg = 0; String fileName = args[firstArg++]; File file = new File(fileName); if (!file.exists() || !file.isFile()) { System.err.println("Not a valid JAR: " + file.getCanonicalPath()); System.exit(-1); } String mainClassName = null; JarFile jarFile; try { jarFile = new JarFile(fileName); } catch(IOException io) { throw new IOException("Error opening job jar: " + fileName) .initCause(io); } Manifest manifest = jarFile.getManifest(); if (manifest != null) { mainClassName = manifest.getMainAttributes().getValue("Main-Class"); } jarFile.close(); if (mainClassName == null) { if (args.length < 2) { System.err.println(usage); System.exit(-1); } mainClassName = args[firstArg++]; } mainClassName = mainClassName.replaceAll("/", "."); File tmpDir = new File(new Configuration().get("hadoop.tmp.dir")); ensureDirectory(tmpDir); final File workDir; try { workDir = File.createTempFile("hadoop-unjar", "", tmpDir); } catch (IOException ioe) { // If user has insufficient perms to write to tmpDir, default // "Permission denied" message doesn't specify a filename. System.err.println("Error creating temp dir in hadoop.tmp.dir " + tmpDir + " due to " + ioe.getMessage()); System.exit(-1); return; } if (!workDir.delete()) { System.err.println("Delete failed for " + workDir); System.exit(-1); } ensureDirectory(workDir); ShutdownHookManager.get().addShutdownHook( new Runnable() { @Override public void run() { FileUtil.fullyDelete(workDir); } }, SHUTDOWN_HOOK_PRIORITY); unJar(file, workDir); ArrayList<URL> classPath = new ArrayList<URL>(); classPath.add(new File(workDir+"/").toURI().toURL()); classPath.add(file.toURI().toURL()); classPath.add(new File(workDir, "classes/").toURI().toURL()); File[] libs = new File(workDir, "lib").listFiles(); if (libs != null) { for (int i = 0; i < libs.length; i++) { classPath.add(libs[i].toURI().toURL()); } } ClassLoader loader = new URLClassLoader(classPath.toArray(new URL[0])); Thread.currentThread().setContextClassLoader(loader); Class<?> mainClass = Class.forName(mainClassName, true, loader); Method main = mainClass.getMethod("main", new Class[] {Array.newInstance(String.class, 0).getClass()}); String[] newArgs = Arrays.asList(args).subList(firstArg, args.length).toArray(new String[0]); try { main.invoke(null, new Object[] { newArgs }); } catch (InvocationTargetException e) { throw e.getTargetException(); } }
四、由以上分析可知,在执行hadoop-config.sh脚本时,执行了hadoop-env.sh,就可将hadoop-env.sh中设置的CLASSPATH加载到了执行jar时的环境变量里,而像mapreduce.application.classpath、yarn.application.classpath这两个属性设置的东西在此时却是没有加载的,这个应该是hadoop的container任务会用到,这个问题以后再具体分析。$HADOOP_HOME/bin/hadoop脚本实现了很多hadoop命令,但是还有很多命令是通过$HADOOP_HOME/bin/mapred或$HADOOP_HOME/bin/hdfs来执行的,有兴趣的可以看看这些脚本,这里暂不做分析。本文主要分析了hadoop执行jar文件的流程:通过解析jar文件的到主类,利用java反射机制执行主类的main方法,进而执行相关程序。进而hadoop如何提交job在下文再做介绍。
0 0
- hadoop执行jar流程分析
- Hadoop执行jar文件流程分析
- hadoop 执行 jar文件流程
- hadoop jar xxxx.jar 执行的流程
- Hadoop源码流程分析4-Task节点执行任务
- hadoop mapreduce执行流程
- hadoop mapreduce执行流程
- 执行hadoop jar命令时报Not a valid JAR的原因分析和解决方案
- 执行hadoop jar命令时报Not a valid JAR的原因分析和解决方案
- hadoop jar分析
- (转)hadoop jar xxxx.jar的流程
- hadoop jar xxxx.jar的流程
- Hadoop运行流程分析
- Hadoop运行流程分析
- Hadoop运行流程分析
- hadoop工作流程分析
- Hadoop MapReduce执行流程详解
- classloader执行流程分析
- Bean的属性注入
- leetcode -- Single Number -- 重点--位运算
- rpc调用
- Redis中5种数据结构的使用场景介绍
- js实现继承的5种方式
- hadoop执行jar流程分析
- C# 中怎样去除DataTable表里面的重复行
- MySQL的information_schema的介绍
- Google推荐的图片加载库Glide介绍
- YK的问题。
- 【iOS学习笔记 15-12-15】引入第三方登陆后提交AppStore审核雷区
- python 启动shell报错Subprocess Startup Error
- web开发的一些经验....
- 四种方案解决ScrollView嵌套ListView问题