NWChem 6.1.1 CCSD(T) parallel running

来源：互联网发布：天猫魔盒上的直播软件编辑：程序博客网时间：2024/05/24 07:31

http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id887/NWChem_6.1.1_CCSD%28T%29_parallel_ru....html

Hi I trying to running NWchem 6.1.1 in a cluster, I compiled NWChem in my local user directory, Here are the environment variables I used to compile :

export NWCHEM_TOP="/home/diego/Software/NWchem/nwchem-6.1.1"export TARGET=LINUX64export LARGE_FILES=TRUEexport ENABLE_COMPONENT=yesexport TCGRSH=/usr/bin/sshexport NWCHEM_TARGET=LINUX64export NWCHEM_MODULES="all python"export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"export USE_MPI=yexport USE_MPIF=yexport USE_MPIF4=yexport IB_HOME=/usrexport IB_INCLUDE=$IB_HOME/include/infinibandexport IB_LIB=$IB_HOME/lib64export IB_LIB_NAME="-libumad -libverbs -lpthread -lrt"export ARMCI_NETWORK=OPENIBexport MKLROOT="/opt/intel/mkl"export MKL_INCLUDE=$MKLROOT/include/intel64/ilp64export BLAS_LIB="-L$MKLROOT/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"export BLASOPT="$BLAS_LIB"export BLAS_SIZE=8export SCALAPACK_SIZE=8export SCALAPACK="-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"export SCALAPACK_LIB="$SCALAPACK"export USE_SCALAPACK=yexport MPI_HOME=/opt/intel/impi/4.0.3.008export MPI_LOC=$MPI_HOMEexport MPI_LIB=$MPI_LOC/lib64export MPI_INCLUDE=$MPI_LOC/include64export LIBMPI="-lmpigf -lmpigi -lmpi_ilp64 -lmpi"export CXX=/opt/intel/bin/icpcexport CC=/opt/intel/bin/iccexport FC=/opt/intel/bin/ifortexport PYTHONPATH="/usr"export PYTHONHOME="/usr"export PYTHONVERSION="2.6"export USE_PYTHON64=yexport PYTHONLIBTYPE=soexport MPICXX=$MPI_LOC/bin/mpiicpcexport MPICC=$MPI_LOC/bin/mpiiccexport MPIF77=$MPI_LOC/bin/mpiifort

input file :

startmemory global 1000 mb heap 100 mb stack 600 mb title "ZrB10 CCSD(T) single point"echo scratch_dir /scratch/userscharge -1geometry units angstromZr          0.00001        -0.00002         0.12043B           2.46109         0.44546        -0.10200B           2.25583        -1.07189        -0.09994B           1.19305        -2.20969        -0.10354B          -0.32926        -2.46629        -0.09796B          -1.72755        -1.82109        -0.10493B          -2.46111        -0.44543        -0.10198B          -2.25583         1.07193        -0.09983B          -1.19306         2.20972        -0.10337B           0.32924         2.46632        -0.09779B           1.72753         1.82112        -0.10485endscf DOUBLET; UHFTHRESH 1.0e-10TOL2E 1.0e-8maxiter 200end tce ccsd(t) maxiter 200 freeze atomicend basisZr library def2-tzvp B library def2-tzvpendecpZr library def2-ecpendtask tce energy

pbs submit file:

c#!/bin/bash#PBS -N ZrB10_UHF#PBS -l nodes=10:ppn=16#PBS -q CABIN=/home/diego/Software/NWchem/nwchem-6.1.1/bin/LINUX64source /opt/intel/impi/4.0.3.008/bin/mpivars.shsource /home/diego/Software/NWchem/varsexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/impi/4.0.3/intel64/lib#ulimit -s unlimited#ulimit -d unlimited#ulimit -l unlimited#ulimit -n 32767    export ARMCI_DEFAULT_SHMMAX=8000#export MA_USE_ARMCI_MEM=TRUEcd $PBS_O_WORKDIRNP=`(wc -l < $PBS_NODEFILE) | awk '{print $1}'`cat $PBS_NODEFILE |sort|uniq> mpd.hoststime mpirun -f mpd.hosts -np $NP $BIN/nwchem ZrB10.nw > ZrB10.logexit 0

memory for procesador 2GB of RAM, in 16 proc with 32GB of RAM in 10 nodes
and other ticks :

kernel.shmmax = 68719476736

the file error is

Last System Error Message from Task 32:: Cannot allocate memory

(rank:32 hostname:node32 pid:27391):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))

Varying stack, heap or global and ARMCI_DEFAULT_SHMMAX does not really change anything (if I set them low, then another error occurs). Setting MA_USE_ARMCI_MEM = y/n does not have any effect.

ldd /home/diego/Software/NWchem/nwchem-6.1.1/bin/LINUX64/nwchem :

        linux-vdso.so.1 =>  (0x00007ffff7ffe000)        libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 (0x0000003f3aa00000)        libmkl_scalapack_ilp64.so => not found        libmkl_intel_ilp64.so => not found        libmkl_sequential.so => not found        libmkl_core.so => not found        libmkl_blacs_intelmpi_ilp64.so => not found        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f39200000)        libm.so.6 => /lib64/libm.so.6 (0x0000003f38600000)        libmpigf.so.4 => not found        libmpi_ilp64.so.4 => not found        libmpi.so.4 => not found        libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x000000308aa00000)        libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000308a600000)        librt.so.1 => /lib64/librt.so.1 (0x0000003f39a00000)        libutil.so.1 => /lib64/libutil.so.1 (0x0000003f3c200000)        libdl.so.2 => /lib64/libdl.so.2 (0x0000003f38e00000)        libc.so.6 => /lib64/libc.so.6 (0x0000003f38a00000)        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffff7dce000)        /lib64/ld-linux-x86-64.so.2 (0x0000003f38200000)

So what could be the reason for the failure? Any help would be appreciated.

Diego

EdoapraForum:Admin, Forum:Mod, bureaucrat, sysop

Forum Vet

Threads 1
Posts 30812:34:55 PM PDT - Fri, Jul 19th 2013 Diego
I have managed to get this input working on a Infiniband cluster using NWChem 6.3.
Here is some details of what I have done on a run using 224 processors(16 processors on each one of the 14 nodes)

1) Increased global memory input line to 1.6GB
memory global 1600 mb heap 100 mb stack 600 mb

2) Set ARMCI_DEFAULT_SHMMAX=8192

3) You need to have the system administrators to modify some of the kernel driver options for your Infiniband Hardware
Here are some webpages related to this very topic
http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
http://community.mellanox.com/docs/DOC-1120

In my case, the cluster I am using has the following parameter for the mlx4_core driver (but older
hardware might require different setting, as mentioned in the two webpages above)
log_num_mtt=20
log_mtts_per_seg=4