SLURM集群简介和部署

来源:互联网 发布:js的object对象 编辑:程序博客网 时间:2024/06/06 10:55
1、概念
SLURM 是一个可用于大型计算节点集群的高度可伸缩的集群管理器和作业调度系统。SLURM 维护着一个待处理工作的队列并管理此工作的整体资源利用。SLURM 将作业分发给一组已分配的节点来执行。

2、环境搭建
1) 三台centos6.6,一台为控制结点,两个计算节点
2) 搭建国内yum源
①修改/etc/yum.repos.d/CentOS-Base.repo,用到是上海交大yum源
  #vi  /etc/yum.repos.d/CentOS-Base.repo
  注:CentOS-Base.repo配置文件见附录一
②加入KEY  
#rpm --importhttp://ftp.sjtu.edu.cn/centos/RPM-GPG-KEY-CentOS-6
③升级
#yum update
3) 安装munge(提供组件间的认证通信机制)
①安装依赖包
#yum install -y rpm-build rpmdevtools bzip2-devel openssl-devel zlib-devel gcc
     ②编译并安装munge包
 下载地址:https://github.com/dun/munge
# rpmbuild -tb --clean munge-0.5.11.tar.bz2
# cd /root/rpmbuild/RPMS/x86_64
# rpm --install munge*.rpm
     ③修改文件权限
# chmod -Rf 700 /etc/munge
# chmod -Rf 711 /var/lib/munge
# chmod -Rf 700 /var/log/munge
# chmod -Rf 0755 /var/run/munge
   ④将主节点的munge生成的key,拷贝到其他节点
 #scp /etc/munge/munge.keyroot@<IP >: /etc/munge
4) 创建slurm用户
 # useradd slurm
# passwd slurm
5) 安装slurm
①安装依赖包
#yum -y install readline-devel pam-devel perl-DBI perl-ExtUtils-MakeMaker
        ② 编译并安装
# rpmbuild -ta --clean slurm-16.05.8.tar.bz2
#cd /root/rpmbuild/RPMS/x86_64
#rpm --install slurm*.rpm
  ③修改命令属组
# sudo chown slurm:slurm /var/spool
下载地址:http://slurm.schedmd.com/ 
    这里使用是slurm-16.05.8.tar.bz2版本
6) 在管理节点上修改/etc/slurm/slurm.conf文件,然后拷贝到计算节点
注:slurm.conf源文件附录2
7) 启动slurm服务
   #/etc/init.d/slurm start
3、如何使用
1) 安装后执行如下命令
#sinfo
 
2) 查看集群状态命令
#scontrol show config
#scontrol show partition
#scontrol show node
#scontrol show jobs
3) 提交作业    
#srun hostname
#srun –N 3 -1 hostname
4) 查询作业
#Squeue –a
#Scancel (job_id)


附录一
CentOS-Base.repo
# CentOS-Base.repo
#
# The mirror system uses the connecting IP address of the client and the
# update status of each mirror to pick mirrors that are updated to and
# geographically close to the client.  You should use this for CentOS updates
# unless you are manually picking other mirrors.
#
# If the mirrorlist= does not work for you, as a fall back you can try the 
# remarked out baseurl= line instead.
#
#


[base]
name=CentOS-$releasever - Base
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5


#released updates 
[updates]
name=CentOS-$releasever - Updates
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=updates
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5


#additional packages that may be useful
[extras]
name=CentOS-$releasever - Extras
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/extras/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5


#additional packages that extend functionality of existing packages
[centosplus]
name=CentOS-$releasever - Plus
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=centosplus
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/centosplus/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5


#contrib - packages by Centos Users
[contrib]
name=CentOS-$releasever - Contrib
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=contrib
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/contrib/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5


附录二
slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=localhost.localdomain
#集群名称
ControlMachine=localhost.localdomain
#主节点名
ControlAddr=192.168.11.125
#主节点地址
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#组件间认证授权通信方式,使用munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=localhost.localdomain CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8  Procs=1 State=IDLE
#节点名称,CPUs核数,corepersocket,threadspersocket,使用lscpu查看,realmemory实际分配给slurm内存,procs是实际CPU个数,/proc/cpuinfo里查看 state=unknown是刚启动集群的时候为unknown,之后会变成idle
NodeName=localhost.localdomain2 CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8  Procs=1 State=IDLE
NodeName=localhost.localdomain3 CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8  Procs=1 State=IDLE
PartitionName=control Nodes=localhost.localdomain Default=NO MaxTime=INFINITE State=UP
#//partitionname是分成control和compute,default=yes是说这个用来计算,我们设置localhost.localdomain2,localhost.localdomain3这两台default为yes,用来计算的
PartitionName=compute Nodes=localhost.localdomain2 Default=YES MaxTime=INFINITE State=UP
PartitionName=compute Nodes=localhost.localdomain3 Default=YES MaxTime=INFINITE State=UP
原创粉丝点击