Batch Jobs

来源:互联网 发布:三菱plc模拟软件 编辑:程序博客网 时间:2024/05/01 16:33

Batch Jobs

 

  • commands: qsub, qstat, qdel
    • qsub
    • qstat
    • qdel
  • examples
    • serial programme
    • parallel: MPI
    • parallel: OpenMP

 

commands: qsub, qstat, qdel

Within the alibaba cluster, the batch queing system torque is used. torque an open source resource manager providing control over batch jobs running on the compute nodes.

The most important commands are qsub for submitting a job, qstat for monitoring its status, and qdel for deleting a job.

For the description of these and related other commands:

qalter, pbs_alterjob, pbs_statjob, pbs_statque, pbs_statserver, pbs_submit, pbs_job_attributes, pbs_queue_attributes, pbs_server_attributes, pbs_resources_*

see http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki or the corresponding man-pages.

 

qsub

The qsub command usually is called with the filename of a script as a parameter. That script holds job parameters as well as the call of the actual programme. Parameters are placed as a comment ("#") in the first lines of the script and start with the command prefix "PBS" followed by the parameter setting, eg

...# PBS -l walltime=6:10:00...

to set the maximum amount of real time during which the job can be in the running state. Parameters can also be specified can as command-line arguments. eg

 > qsub -l nodes=12  

to request 12 nodes. Command line arguments take precedence over parameters set in the script.

Important options are:

Option Description -N name job name -q queue destination queue -S shell path to the shell that interprets the job script -j oe join stdout and stderr streams -o filename file or directory to which the (joined) output is written -m a send an e-mail notification in case of a job abort -M e-mail e-mail address to which the notification is sent -l resource(s) see below

The important resource parameters are:

Ressource Format Description nodes number-of-nodes [:ppn= processes-per-node ] Number and type of nodes to be reserved for exclusive use by the job. walltime seconds or [[HH:]MM:]SS Maximum amount of real time during which the job can be in the running state

At run time the following environment variables are set:

Variablename Used for PBS_JOBNAME user specified jobname PBS_O_WORKDIR job's submission directory PBS_TASKNUM number of tasks requested PBS_O_HOME home directory of submitting user PBS_O_LOGNAME name of submitting user PBS_JOBCOOKIE job cookie PBS_NODENUM node offset number PBS_O_SHELL script shell PBS_O_JOBID unique pbs job id PBS_O_HOST host on which job script is currently running PBS_QUEUE job queue PBS_NODEFILE file containing line delimited list on nodes allocated to the job PBS_O_PATH path variable used to locate executables within job script

 

 

qstat

To monitor submitted jobs, the qstat command is used. Though not all job-information will be presented to a normal user, one can get information like job-ID, name, queing status etc. The output can be given in different formats and verbosity.

To get an overview in table form, type qstat without any argument.

Job id Name User Time Use S Queue <ID> <Jobname> (given by user) <Username> Used CPU time Status (see below) <Jobque>

Status can be

C Job is completed after having run E Job is exiting after having run H Job is held Q job is queued, eligible to run or routed R job is running T job is being moved to new location W job is waiting for its execution time (-a option) to be reached

More detailed information can be requested by using the "-f" option:

> qstat -f [job_id]

For more information, see the man-page or the online man page of qstat at torque.

 

qdel

The qdel command is used to delete a job, which has to be specified by its job-identifier, that is, type

> qdel <job_id>

to delete the job with the id <id>. After submission of the command, a "Delete Job batch request" will be sent to the batch server that owns the job. See the man-page for more information.

 

 

examples

Simple examples are given for serial and parallel batch jobs.

 

serial programme

 

#PBS -N test1#PBS -j oe#PBS -o /home/user/test/test1.log#PBS -l walltime=100#set -xcd /work/user/home/user/test/a.outexit

The batch jobs executes /home/user/test/a.out in directory /work/user and writes a log file to /home/user/test/test1.log

 

parallel: MPI

 

#PBS -N test2#PBS -j oe#PBS -o /home/user/test/test2.log#PBS -l nodes=3:ppn=8#PBS -l walltime=100#set -xcd /work/user/home/user/test/a.out -np 24exit

The job executes on 3 nodes using all 8 cores of each node. On has to make sure that the "-l nodes=...:ppa=..." and "-np ..." specifications match. In parallel jobs one should always request 8 cores per node (ie ppn=8). Otherwise one would share nodes with other users what should be avoided.

A special case is ppn=4 (or smaller). In that case one should also specify the interconnect one wants to use. This can be done by adding #PBS -q gbe for gigabit-ethernet or #PBS -q ib for inifiniband. (For ppn larger than 4 the system will automatically use the large nodes and infiniband.)

 

 

parallel: OpenMP

 

#PBS -N test3#PBS -j oe#PBS -o /home/user/test/test3.log#PBS -l nodes=1:ppn=T#PBS -l walltime=100#set -xcd /work/userexport OMP_NUM_THREADS=T/home/user/test/a.outexit

Where T stands for the number of threads. In parallel jobs one should always request T = 8 cores per node (ie ppn=8). Otherwise one would share nodes with other users what should be avoided. As a consequence OMP_NUM_THREADS should be set to 8.

An alterative is requesting 4 threads. Then one should explicitly request small nodes by adding #PBS -q gbe.

原创粉丝点击