oozie 调度异常 JA009: Filesystem closed
来源:互联网 发布:java游戏nba2010中文 编辑:程序博客网 时间:2024/06/05 01:32
个人猜测跟hadoop集群状态(稳定性)有一定关系,但咨询hadoop运维人员后得知集群近几天并未做改动,也没异常。
被挂起任务截图:
从Error Code和Error Message可以看出,此action出现JA009 Filesystem closed异常
为了定位该问题,先来看看oozie日志吧,来到oozie安装目录, 找到oozie.log日志文件, 搜索 JA009: Filesystem closed 信息,果然,有很多该异常信息
- org.apache.oozie.action.ActionExecutorException: JA009: Filesystem closed
- at org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:361)
- at org.apache.oozie.action.hadoop.JavaActionExecutor.prepareActionDir(JavaActionExecutor.java:390)
- at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:636)
- at org.apache.oozie.command.wf.ActionStartCommand.call(ActionStartCommand.java:128)
- at org.apache.oozie.command.wf.ActionStartCommand.execute(ActionStartCommand.java:249)
- at org.apache.oozie.command.wf.ActionStartCommand.execute(ActionStartCommand.java:47)
- at org.apache.oozie.command.Command.call(Command.java:202)
- at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
- at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
- at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
- at java.lang.Thread.run(Thread.java:662)
- Caused by: java.io.IOException: Filesystem closed
- at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:232)
- at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:648)
- at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:255)
- at org.apache.oozie.action.hadoop.JavaActionExecutor.prepareActionDir(JavaActionExecutor.java:383)
- ... 10 more
有了该异常信息, 定位问题方便多了, 将ooize源码工程导入eclipse, 找到 JavaActionExecutor.java文件,定位到 prepareActionDir方法:
- void prepareActionDir(FileSystem actionFs, Context context) throws ActionExecutorException {
- try {
- Path actionDir = context.getActionDir();
- Path tempActionDir = new Path(actionDir.getParent(), actionDir.getName() + ".tmp");
- if (!actionFs.exists(actionDir)) {
- try {
- actionFs.copyFromLocalFile(new Path(getOozieRuntimeDir(), getLauncherJarName()), new Path(
- tempActionDir, getLauncherJarName()));
- actionFs.rename(tempActionDir, actionDir);
- }
- catch (IOException ex) {
- actionFs.delete(tempActionDir, true);
- actionFs.delete(actionDir, true);
- throw ex;
- }
- }
- }
- catch (Exception ex) {
- throw convertException(ex);
- }
- }
从prepareActionDir 方法可以看出, 在使用actionFs的时候有可能会出现Filesystem closed异常(如果拿到的这个actionFs已经关闭自然就会抛出异常了)
接下来看看prepareActionDir 方法中FileSystem actionFs参数是如何传入的,找到调用prepareActionDir 的方法:
- @Override
- public void start(Context context, WorkflowAction action) throws ActionExecutorException {
- try {
- XLog.getLog(getClass()).debug("Starting action " + action.getId() + " getting Action File System");
- FileSystem actionFs = getActionFileSystem(context, action);
- XLog.getLog(getClass()).debug("Preparing action Dir through copying " + context.getActionDir());
- prepareActionDir(actionFs, context);
- XLog.getLog(getClass()).debug("Action Dir is ready. Submitting the action ");
- submitLauncher(context, action);
- XLog.getLog(getClass()).debug("Action submit completed. Performing check ");
- check(context, action);
- XLog.getLog(getClass()).debug("Action check is done after submission");
- }
- catch (Exception ex) {
- throw convertException(ex);
- }
- }
通过FileSystem actionFs = getActionFileSystem(context, action);代码可知,actionFs 是通过getActionFileSystem方法获取的, 再来看getActionFileSystem方法:
- protected FileSystem getActionFileSystem(Context context, Element actionXml) throws ActionExecutorException {
- try {
- return context.getAppFileSystem();
- }
- catch (Exception ex) {
- throw convertException(ex);
- }
- }
- public FileSystem getAppFileSystem() throws HadoopAccessorException, IOException, URISyntaxException {
- WorkflowJob workflow = getWorkflow();
- XConfiguration jobConf = new XConfiguration(new StringReader(workflow.getConf()));
- Configuration fsConf = new Configuration();
- XConfiguration.copy(jobConf, fsConf);
- return Services.get().get(HadoopAccessorService.class).createFileSystem(workflow.getUser(),
- workflow.getGroup(), new URI(getWorkflow().getAppPath()), fsConf);
- }
至此终于找到actionFs是通过HadoopAccessorService来获取的,看看HadoopAccessorService的createFileSystem方法:
- public FileSystem createFileSystem(String user, String group, URI uri, Configuration conf)
- throws HadoopAccessorException {
- validateNameNode(uri.getAuthority());
- conf = createConfiguration(user, group, conf);
- try {
- return FileSystem.get(uri, conf);
- }
- catch (IOException e) {
- throw new HadoopAccessorException(ErrorCode.E0902, e);
- }
- }
真相大白,oozie通过调用hadoop的FileSystem.get(uri, conf); 方法来得到FileSystem.
接着看FileSystem.get(uri, conf)源码:
- public static FileSystem get(URI uri, Configuration conf) throws IOException {
- String scheme = uri.getScheme();
- String authority = uri.getAuthority();
- if (scheme == null) { // no scheme: use default FS
- return get(conf);
- }
- if (authority == null) { // no authority
- URI defaultUri = getDefaultUri(conf);
- if (scheme.equals(defaultUri.getScheme()) // if scheme matches default
- && defaultUri.getAuthority() != null) { // & default has authority
- return get(defaultUri, conf); // return default
- }
- }
- String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
- if (conf.getBoolean(disableCacheName, false)) {
- return createFileSystem(uri, conf);
- }
- return CACHE.get(uri, conf); // 有缓存哦
- }
FileSystem.get(uri, conf)方法会根据conf.getBoolean(disableCacheName, false)的值决定是创建FileSystem还是从cache中获取FileSystem, 而默认情况下conf.getBoolean(disableCacheName, false)值为flase(除非特别指定disableCacheName 值为true), 即从cache获取. 问题正是出在这里,我们的oozie作业是小时任务,并由多个action节点组成,每个action节点执行时从cache中获取FileSystem, 有可能该FileSystem因为网络原因或者其他原因已经被closed, 但仍旧被cache, 导致拿到该FileSystem的action在使用时发生IOException异常.
定位到问题原因后就需要设法改进,方法也很简单,只要使conf.getBoolean(disableCacheName, false) 为true即可,这样每次都会重新创建一个FileSystem, 也就不会从cache中拿到失效的FileSystem了.
在oozie的workflow里进行如下配置:
- <configuration>
- <name>oozie.launcher.fs.hdfs.impl.disable.cache</name>
- <value>true</value>
- </property>
- </configuration>
另外从源代码中发现oozie对action节点调度过程中的瞬态错误会有重试机制,默认状态下是3次,我在提交作业时修改成10次
- oozie.wf.action.max.retries=10
经过上述修改后, oozie调度健壮性得到了提升^_^
本文出自 “yyj0531” 博客,请务必保留此出处http://yaoyinjie.blog.51cto.com/3189782/762342
- oozie 调度异常 JA009: Filesystem closed
- FileSystem closed 异常问题
- 运行Hadoop程序,出现 Filesystem closed 异常
- 运行Hadoop程序,出现 Filesystem closed 异常
- Hadoop程序,出现 Filesystem closed 异常
- Filesystem closed
- Hadoop Filesystem closed Exception
- oozie的作业调度
- Oozie作业调度
- 工作流调度框架Oozie
- Oozie运行ExampleClassNotFound异常
- oozie 工作流调度引擎总结
- hadoop工作流调度oozie安装
- oozie 定时调度时区设置
- hue下oozie调度sqoop
- Oozie调度sqoop导入hive
- Java Stream Closed异常
- Oozie基础知识:调度器简介及Oozie功能架构
- 数据库还原
- SELECT 字段 FROM 表 WHERE 某字段 Like 条件
- exp导出分区表分区测试
- 自定义Android Dialog
- MySql学习(上)
- oozie 调度异常 JA009: Filesystem closed
- 像素大小与分辨率
- cocos2d-x嵌入移动MM短代支付IAP2.4的SDK,点击支付崩溃的解决办法
- shell 特殊变量
- hive-报错解决方案
- Linux清空内存和磁盘缓存
- 微信access_token存储与更新
- Lua 语言 15 分钟快速入门
- 两年网页设计的经验总结,给新手设计师一点个人建议