跨节点走IB网络任务报错

来源:互联网 发布:人人商城 源码 编辑:程序博客网 时间:2024/05/20 22:35

1.  故障现象,客户HPC任务,走千兆网路正常运算,但是走IB网络报以下错误

 psolid.x           00000000005F55AB  mpp_init_                  68  mpp_init.F
psolid.x           0000000000519C2D  xmp_init_                  91  xmp_init.F
psolid.x           00000000005164BF  pamcsm_                    88  pamcsm.F
psolid.x           0000000000515D90  MAIN__                     26  pcrash.F
psolid.x           0000000000515D1C  Unknown               Unknown  Unknown
libc.so.6          00007FCB4A9B4C36  Unknown               Unknown  Unknown
psolid.x           0000000000515C29  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00007F4E28BFC7E0  Unknown               Unknown  Unknown
libibverbs.so.1    00007F4E23E4BC79  Unknown               Unknown  Unknown
libibverbs.so.1    00007F4E23E4CC08  Unknown               Unknown  Unknown
libmpi.so.1        00007F4E2669A8CF  Unknown               Unknown  Unknown
libmpi.so.1        00007F4E2654CC45  Unknown               Unknown  Unknown
libmpi.so.1        00007F4E26514B3B  Unknown               Unknown  Unknown
libmpi.so.1        00007F4E26516935  Unknown               Unknown  Unknown
libmpi.so.1        00007F4E26517BCF  Unknown               Unknown  Unknown
libmpi.so.1        00007F4E265AD3BC  Unknown               Unknown  Unknown
libmpm_platform-9  00007F4E26867F63  MPM_Mod_F_Init             23  MPM_Mod_F_Init.c
libmpm.so          00007F4E30B5209A  mpi_init_                  44  MPM_Lib_F_Init.c
psolid.x           00000000005F55AB  mpp_init_                  68  mpp_init.F
psolid.x           0000000000519C2D  xmp_init_                  91  xmp_init.F
psolid.x           00000000005164BF  pamcsm_                    88  pamcsm.F
psolid.x           0000000000515D90  MAIN__                     26  pcrash.F
psolid.x           0000000000515D1C  Unknown               Unknown  Unknown
libc.so.6          00007F4E28890C36  Unknown               Unknown  Unknown
psolid.x           0000000000515C29  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00007FFBB1E877E0  Unknown               Unknown  Unknown
libibverbs.so.1    00007FFBAD0D6C79  Unknown               Unknown  Unknown
libibverbs.so.1    00007FFBAD0D7C08  Unknown               Unknown  Unknown
libmpi.so.1        00007FFBAF9258CF  Unknown               Unknown  Unknown
libmpi.so.1        00007FFBAF7D7C45  Unknown               Unknown  Unknown
libmpi.so.1        00007FFBAF79FB3B  Unknown               Unknown  Unknown
libmpi.so.1        00007FFBAF7A1935  Unknown               Unknown  Unknown
libmpi.so.1        00007FFBAF7A2BCF  Unknown               Unknown  Unknown
libmpi.so.1        00007FFBAF8383BC  Unknown               Unknown  Unknown
libmpm_platform-9  00007FFBAFAF2F63  MPM_Mod_F_Init             23  MPM_Mod_F_Init.c
libmpm.so          00007FFBB9DDD09A  mpi_init_                  44  MPM_Lib_F_Init.c
psolid.x           00000000005F55AB  mpp_init_                  68  mpp_init.F
psolid.x           0000000000519C2D  xmp_init_                  91  xmp_init.F
psolid.x           00000000005164BF  pamcsm_                    88  pamcsm.F
psolid.x           0000000000515D90  MAIN__                     26  pcrash.F
psolid.x           0000000000515D1C  Unknown               Unknown  Unknown
libc.so.6          00007FFBB1B1BC36  Unknown               Unknown  Unknown
psolid.x           0000000000515C29  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00007F9653FB07E0  Unknown               Unknown  Unknown
libibverbs.so.1    00007F964F1FFC79  Unknown               Unknown  Unknown
libibverbs.so.1    00007F964F200C08  Unknown               Unknown  Unknown
libmpi.so.1        00007F9651A4E8CF  Unknown               Unknown  Unknown 

 no matching hostkey found
ssh_keysign: no reply
key_sign failed
psolid.x: Rank 0:2: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:2: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:2: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:2: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:0: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:0: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:0: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:0: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:3: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:3: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:3: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:3: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:1: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:1: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:1: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:1: MPI_Init: Internal Error: Processes cannot connect to rdma device
MPI Application rank 0 exited before MPI_Finalize() with status 1
MPI Application rank 2 exited before MPI_Finalize() with status 1

pamcrash : Error :
==============================================================================


This process has exited with a nonzero exit code, indicating an error
termination.
You may have some unmerged files left behind like VW331-4CS_K_SAD_China-NCAP-MDB_51_40_v045_xxx.{LIS,msg}
in /CAE/home/tpbrls/pam2014.3_test_new directory, containing some relevant informations regarding this error
condition.
Please refer to your documentation, or contact you technical support for this
merging purpose.

2.  解决办法,刚开始以为是少安装了某些库文件,后来发现是资源限制的问题,在/etc/security/limits.conf后增加下面两条,重启后问题解决

admin:~ # cat /etc/security/limits.conf

*                soft      memlock         unlimited
*                hard    memlock         unlimited


注: 其中memlock的含义为:max locked-in-memory address space (KB)

0 0