SLURM集群简介和部署
来源:互联网 发布:js的object对象 编辑:程序博客网 时间:2024/06/06 10:55
1、概念
SLURM 是一个可用于大型计算节点集群的高度可伸缩的集群管理器和作业调度系统。SLURM 维护着一个待处理工作的队列并管理此工作的整体资源利用。SLURM 将作业分发给一组已分配的节点来执行。
2、环境搭建
1) 三台centos6.6,一台为控制结点,两个计算节点
2) 搭建国内yum源
①修改/etc/yum.repos.d/CentOS-Base.repo,用到是上海交大yum源
#vi /etc/yum.repos.d/CentOS-Base.repo
注:CentOS-Base.repo配置文件见附录一
②加入KEY
#rpm --importhttp://ftp.sjtu.edu.cn/centos/RPM-GPG-KEY-CentOS-6
③升级
#yum update
3) 安装munge(提供组件间的认证通信机制)
①安装依赖包
#yum install -y rpm-build rpmdevtools bzip2-devel openssl-devel zlib-devel gcc
②编译并安装munge包
下载地址:https://github.com/dun/munge
# rpmbuild -tb --clean munge-0.5.11.tar.bz2
# cd /root/rpmbuild/RPMS/x86_64
# rpm --install munge*.rpm
③修改文件权限
# chmod -Rf 700 /etc/munge
# chmod -Rf 711 /var/lib/munge
# chmod -Rf 700 /var/log/munge
# chmod -Rf 0755 /var/run/munge
④将主节点的munge生成的key,拷贝到其他节点
#scp /etc/munge/munge.keyroot@<IP >: /etc/munge
4) 创建slurm用户
# useradd slurm
# passwd slurm
5) 安装slurm
①安装依赖包
#yum -y install readline-devel pam-devel perl-DBI perl-ExtUtils-MakeMaker
② 编译并安装
# rpmbuild -ta --clean slurm-16.05.8.tar.bz2
#cd /root/rpmbuild/RPMS/x86_64
#rpm --install slurm*.rpm
③修改命令属组
# sudo chown slurm:slurm /var/spool
下载地址:http://slurm.schedmd.com/
这里使用是slurm-16.05.8.tar.bz2版本
6) 在管理节点上修改/etc/slurm/slurm.conf文件,然后拷贝到计算节点
注:slurm.conf源文件附录2
7) 启动slurm服务
#/etc/init.d/slurm start
3、如何使用
1) 安装后执行如下命令
#sinfo
2) 查看集群状态命令
#scontrol show config
#scontrol show partition
#scontrol show node
#scontrol show jobs
3) 提交作业
#srun hostname
#srun –N 3 -1 hostname
4) 查询作业
#Squeue –a
#Scancel (job_id)
附录一
CentOS-Base.repo
# CentOS-Base.repo
#
# The mirror system uses the connecting IP address of the client and the
# update status of each mirror to pick mirrors that are updated to and
# geographically close to the client. You should use this for CentOS updates
# unless you are manually picking other mirrors.
#
# If the mirrorlist= does not work for you, as a fall back you can try the
# remarked out baseurl= line instead.
#
#
[base]
name=CentOS-$releasever - Base
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#released updates
[updates]
name=CentOS-$releasever - Updates
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=updates
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#additional packages that may be useful
[extras]
name=CentOS-$releasever - Extras
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/extras/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#additional packages that extend functionality of existing packages
[centosplus]
name=CentOS-$releasever - Plus
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=centosplus
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/centosplus/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#contrib - packages by Centos Users
[contrib]
name=CentOS-$releasever - Contrib
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=contrib
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/contrib/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
附录二
slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=localhost.localdomain
#集群名称
ControlMachine=localhost.localdomain
#主节点名
ControlAddr=192.168.11.125
#主节点地址
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#组件间认证授权通信方式,使用munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=localhost.localdomain CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8 Procs=1 State=IDLE
#节点名称,CPUs核数,corepersocket,threadspersocket,使用lscpu查看,realmemory实际分配给slurm内存,procs是实际CPU个数,/proc/cpuinfo里查看 state=unknown是刚启动集群的时候为unknown,之后会变成idle
NodeName=localhost.localdomain2 CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8 Procs=1 State=IDLE
NodeName=localhost.localdomain3 CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8 Procs=1 State=IDLE
PartitionName=control Nodes=localhost.localdomain Default=NO MaxTime=INFINITE State=UP
#//partitionname是分成control和compute,default=yes是说这个用来计算,我们设置localhost.localdomain2,localhost.localdomain3这两台default为yes,用来计算的
PartitionName=compute Nodes=localhost.localdomain2 Default=YES MaxTime=INFINITE State=UP
PartitionName=compute Nodes=localhost.localdomain3 Default=YES MaxTime=INFINITE State=UP
SLURM 是一个可用于大型计算节点集群的高度可伸缩的集群管理器和作业调度系统。SLURM 维护着一个待处理工作的队列并管理此工作的整体资源利用。SLURM 将作业分发给一组已分配的节点来执行。
2、环境搭建
1) 三台centos6.6,一台为控制结点,两个计算节点
2) 搭建国内yum源
①修改/etc/yum.repos.d/CentOS-Base.repo,用到是上海交大yum源
#vi /etc/yum.repos.d/CentOS-Base.repo
注:CentOS-Base.repo配置文件见附录一
②加入KEY
#rpm --importhttp://ftp.sjtu.edu.cn/centos/RPM-GPG-KEY-CentOS-6
③升级
#yum update
3) 安装munge(提供组件间的认证通信机制)
①安装依赖包
#yum install -y rpm-build rpmdevtools bzip2-devel openssl-devel zlib-devel gcc
②编译并安装munge包
下载地址:https://github.com/dun/munge
# rpmbuild -tb --clean munge-0.5.11.tar.bz2
# cd /root/rpmbuild/RPMS/x86_64
# rpm --install munge*.rpm
③修改文件权限
# chmod -Rf 700 /etc/munge
# chmod -Rf 711 /var/lib/munge
# chmod -Rf 700 /var/log/munge
# chmod -Rf 0755 /var/run/munge
④将主节点的munge生成的key,拷贝到其他节点
#scp /etc/munge/munge.keyroot@<IP >: /etc/munge
4) 创建slurm用户
# useradd slurm
# passwd slurm
5) 安装slurm
①安装依赖包
#yum -y install readline-devel pam-devel perl-DBI perl-ExtUtils-MakeMaker
② 编译并安装
# rpmbuild -ta --clean slurm-16.05.8.tar.bz2
#cd /root/rpmbuild/RPMS/x86_64
#rpm --install slurm*.rpm
③修改命令属组
# sudo chown slurm:slurm /var/spool
下载地址:http://slurm.schedmd.com/
这里使用是slurm-16.05.8.tar.bz2版本
6) 在管理节点上修改/etc/slurm/slurm.conf文件,然后拷贝到计算节点
注:slurm.conf源文件附录2
7) 启动slurm服务
#/etc/init.d/slurm start
3、如何使用
1) 安装后执行如下命令
#sinfo
2) 查看集群状态命令
#scontrol show config
#scontrol show partition
#scontrol show node
#scontrol show jobs
3) 提交作业
#srun hostname
#srun –N 3 -1 hostname
4) 查询作业
#Squeue –a
#Scancel (job_id)
附录一
CentOS-Base.repo
# CentOS-Base.repo
#
# The mirror system uses the connecting IP address of the client and the
# update status of each mirror to pick mirrors that are updated to and
# geographically close to the client. You should use this for CentOS updates
# unless you are manually picking other mirrors.
#
# If the mirrorlist= does not work for you, as a fall back you can try the
# remarked out baseurl= line instead.
#
#
[base]
name=CentOS-$releasever - Base
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#released updates
[updates]
name=CentOS-$releasever - Updates
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=updates
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#additional packages that may be useful
[extras]
name=CentOS-$releasever - Extras
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/extras/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#additional packages that extend functionality of existing packages
[centosplus]
name=CentOS-$releasever - Plus
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=centosplus
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/centosplus/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#contrib - packages by Centos Users
[contrib]
name=CentOS-$releasever - Contrib
#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=contrib
baseurl=http://ftp.sjtu.edu.cn/centos/$releasever/contrib/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
附录二
slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=localhost.localdomain
#集群名称
ControlMachine=localhost.localdomain
#主节点名
ControlAddr=192.168.11.125
#主节点地址
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#组件间认证授权通信方式,使用munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=localhost.localdomain CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8 Procs=1 State=IDLE
#节点名称,CPUs核数,corepersocket,threadspersocket,使用lscpu查看,realmemory实际分配给slurm内存,procs是实际CPU个数,/proc/cpuinfo里查看 state=unknown是刚启动集群的时候为unknown,之后会变成idle
NodeName=localhost.localdomain2 CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8 Procs=1 State=IDLE
NodeName=localhost.localdomain3 CPUs=8 RealMemory=10000 Sockets=8 CoresPerSocket=8 ThreadsPerCore=8 Procs=1 State=IDLE
PartitionName=control Nodes=localhost.localdomain Default=NO MaxTime=INFINITE State=UP
#//partitionname是分成control和compute,default=yes是说这个用来计算,我们设置localhost.localdomain2,localhost.localdomain3这两台default为yes,用来计算的
PartitionName=compute Nodes=localhost.localdomain2 Default=YES MaxTime=INFINITE State=UP
PartitionName=compute Nodes=localhost.localdomain3 Default=YES MaxTime=INFINITE State=UP
阅读全文
0 0
- SLURM集群简介和部署
- slurm简介
- zookeeper简介,集群部署
- docker下安装slurm集群搭建
- 基于虚拟机的slurm集群搭建
- zookeeper简介,原理及集群部署
- Postgres-XL9.5简介与集群部署
- Saltstack部署和简介
- Zookeeper简介和部署
- 关于集群和分布式部署
- Spark安装和集群部署
- Apache和tomcat集群部署
- 关于集群和分布式部署
- WebLogic部署集群和代理服务器
- Nginx和Tomcat集群部署
- Slurm查看作业CPU和MEM
- Mesos单点Master集群部署和High Availability集群部署
- Mysql 集群简介和配置
- while/do...while循环(笔记整理)
- 【强化学习】MCTS (Monte Carlo Tree Search)
- 【codevs 1080】线段树练习(单点修改+区间和)
- 好书推荐 | 《PHP精粹:编写高效PHP代码》
- 时间序列数据的存储和计算
- SLURM集群简介和部署
- Visual Box 的 桥接 连接
- 酷派又出新品啦!全身都是好戏~
- URL stl map
- Irrlicht 源码学习笔记 【line2d.h】
- poj 3071 Football
- SQLite(1)
- 魔法卡片测试
- Redis数据类型以及应用场景