跨节点走IB网络任务报错
来源:互联网 发布:人人商城 源码 编辑:程序博客网 时间:2024/05/20 22:35
1. 故障现象,客户HPC任务,走千兆网路正常运算,但是走IB网络报以下错误
psolid.x 00000000005F55AB mpp_init_ 68 mpp_init.F
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007FCB4A9B4C36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007F4E28BFC7E0 Unknown Unknown Unknown
libibverbs.so.1 00007F4E23E4BC79 Unknown Unknown Unknown
libibverbs.so.1 00007F4E23E4CC08 Unknown Unknown Unknown
libmpi.so.1 00007F4E2669A8CF Unknown Unknown Unknown
libmpi.so.1 00007F4E2654CC45 Unknown Unknown Unknown
libmpi.so.1 00007F4E26514B3B Unknown Unknown Unknown
libmpi.so.1 00007F4E26516935 Unknown Unknown Unknown
libmpi.so.1 00007F4E26517BCF Unknown Unknown Unknown
libmpi.so.1 00007F4E265AD3BC Unknown Unknown Unknown
libmpm_platform-9 00007F4E26867F63 MPM_Mod_F_Init 23 MPM_Mod_F_Init.c
libmpm.so 00007F4E30B5209A mpi_init_ 44 MPM_Lib_F_Init.c
psolid.x 00000000005F55AB mpp_init_ 68 mpp_init.F
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007F4E28890C36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007FFBB1E877E0 Unknown Unknown Unknown
libibverbs.so.1 00007FFBAD0D6C79 Unknown Unknown Unknown
libibverbs.so.1 00007FFBAD0D7C08 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF9258CF Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7D7C45 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF79FB3B Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7A1935 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7A2BCF Unknown Unknown Unknown
libmpi.so.1 00007FFBAF8383BC Unknown Unknown Unknown
libmpm_platform-9 00007FFBAFAF2F63 MPM_Mod_F_Init 23 MPM_Mod_F_Init.c
libmpm.so 00007FFBB9DDD09A mpi_init_ 44 MPM_Lib_F_Init.c
psolid.x 00000000005F55AB mpp_init_ 68 mpp_init.F
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007FFBB1B1BC36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007F9653FB07E0 Unknown Unknown Unknown
libibverbs.so.1 00007F964F1FFC79 Unknown Unknown Unknown
libibverbs.so.1 00007F964F200C08 Unknown Unknown Unknown
libmpi.so.1 00007F9651A4E8CF Unknown Unknown Unknown
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007FCB4A9B4C36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007F4E28BFC7E0 Unknown Unknown Unknown
libibverbs.so.1 00007F4E23E4BC79 Unknown Unknown Unknown
libibverbs.so.1 00007F4E23E4CC08 Unknown Unknown Unknown
libmpi.so.1 00007F4E2669A8CF Unknown Unknown Unknown
libmpi.so.1 00007F4E2654CC45 Unknown Unknown Unknown
libmpi.so.1 00007F4E26514B3B Unknown Unknown Unknown
libmpi.so.1 00007F4E26516935 Unknown Unknown Unknown
libmpi.so.1 00007F4E26517BCF Unknown Unknown Unknown
libmpi.so.1 00007F4E265AD3BC Unknown Unknown Unknown
libmpm_platform-9 00007F4E26867F63 MPM_Mod_F_Init 23 MPM_Mod_F_Init.c
libmpm.so 00007F4E30B5209A mpi_init_ 44 MPM_Lib_F_Init.c
psolid.x 00000000005F55AB mpp_init_ 68 mpp_init.F
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007F4E28890C36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007FFBB1E877E0 Unknown Unknown Unknown
libibverbs.so.1 00007FFBAD0D6C79 Unknown Unknown Unknown
libibverbs.so.1 00007FFBAD0D7C08 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF9258CF Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7D7C45 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF79FB3B Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7A1935 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7A2BCF Unknown Unknown Unknown
libmpi.so.1 00007FFBAF8383BC Unknown Unknown Unknown
libmpm_platform-9 00007FFBAFAF2F63 MPM_Mod_F_Init 23 MPM_Mod_F_Init.c
libmpm.so 00007FFBB9DDD09A mpi_init_ 44 MPM_Lib_F_Init.c
psolid.x 00000000005F55AB mpp_init_ 68 mpp_init.F
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007FFBB1B1BC36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007F9653FB07E0 Unknown Unknown Unknown
libibverbs.so.1 00007F964F1FFC79 Unknown Unknown Unknown
libibverbs.so.1 00007F964F200C08 Unknown Unknown Unknown
libmpi.so.1 00007F9651A4E8CF Unknown Unknown Unknown
no matching hostkey found
ssh_keysign: no reply
key_sign failed
psolid.x: Rank 0:2: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:2: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:2: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:2: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:0: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:0: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:0: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:0: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:3: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:3: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:3: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:3: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:1: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:1: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:1: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:1: MPI_Init: Internal Error: Processes cannot connect to rdma device
MPI Application rank 0 exited before MPI_Finalize() with status 1
MPI Application rank 2 exited before MPI_Finalize() with status 1
pamcrash : Error :
==============================================================================
This process has exited with a nonzero exit code, indicating an error
termination.
You may have some unmerged files left behind like VW331-4CS_K_SAD_China-NCAP-MDB_51_40_v045_xxx.{LIS,msg}
in /CAE/home/tpbrls/pam2014.3_test_new directory, containing some relevant informations regarding this error
condition.
Please refer to your documentation, or contact you technical support for this
merging purpose.
ssh_keysign: no reply
key_sign failed
psolid.x: Rank 0:2: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:2: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:2: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:2: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:0: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:0: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:0: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:0: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:3: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:3: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:3: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:3: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:1: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:1: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:1: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:1: MPI_Init: Internal Error: Processes cannot connect to rdma device
MPI Application rank 0 exited before MPI_Finalize() with status 1
MPI Application rank 2 exited before MPI_Finalize() with status 1
pamcrash : Error :
==============================================================================
This process has exited with a nonzero exit code, indicating an error
termination.
You may have some unmerged files left behind like VW331-4CS_K_SAD_China-NCAP-MDB_51_40_v045_xxx.{LIS,msg}
in /CAE/home/tpbrls/pam2014.3_test_new directory, containing some relevant informations regarding this error
condition.
Please refer to your documentation, or contact you technical support for this
merging purpose.
2. 解决办法,刚开始以为是少安装了某些库文件,后来发现是资源限制的问题,在/etc/security/limits.conf后增加下面两条,重启后问题解决
admin:~ # cat /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
注: 其中memlock的含义为:max locked-in-memory address space (KB)
0 0
- 跨节点走IB网络任务报错
- 网络:IB 使用技巧
- IB
- iOS UI设计: SnapKit 或者 Masonry 时候 与StoryBoard混合时候IB报错冲突
- Xcode报错 Compiling IB documents for earlier than iOS7 is no longer supported.
- xcode9报错compiling ib doucument for earlier than ios7 is no longer supported
- 网络存储IP SAN与IB SAN
- 节点管理器启动报错
- iOS开发之Xcode9报错 Compiling IB documents for earlier than iOS7 is no longer supported.
- tez跑任务报错
- 定时任务crontab报错
- MapReduce运行任务报错
- 新手入门:认识网络存储IP SAN与IB SAN
- 教你认识网络存储IP SAN与IB SAN
- 网络请求报错
- hadoop启动namenode节点报错
- HBase停止节点报错,pid不存在
- mysql cluster管理节点启动报错
- Trafodion UNLOAD导出数据
- 【数据结构】用回溯法求解迷宫问题
- gradle脚本入门
- Java中静态块,静态成员变量,构造块,普通成员变量,构造方法的执行顺序
- weakrefrence
- 跨节点走IB网络任务报错
- 父类子类指针相互转换问题
- file_put_contents的用法
- CSS3 Gradient按钮demo
- 【JqGrid】JqGrid前端分页+排序+查询条件
- MYSQL学习心得(4) --SQL语句执行顺序
- python+flask
- Git SSH Key 生成步骤
- win10下IIS网站局域网无法访问的解决方法