Using the Job Scheduler Slurm
Edit me

Slurm Workload Manager Basics

The Benefit AI Lab Cluster uses slurm as a scheduler and workload manager. As a warning, note that on a cluster, you do not run the computations on the login node. Computations belong on the compute nodes, when, and where they will be run is decided by the scheduler (like slurm). In the Benefit AI Lab cluster, this is the master node: hayrat.

After logging in to hayrat you can submit a job using slurm, and it will run it on the compute or GPU nodes that you specify in the submission script.

The workload manager tries to distribute the resources based on the cluster rules. Resources available for slurm include:

  • CPU cores
  • RAM
  • GPUs

You can request these resources nicely through slurm using the shell script and slurm sbatch or srun commands. But the ultimate decision is taken by the workload manager.

The system administrator also divides the cluster into partitions, and each user group will have some of these partitions available to them based on their privileges. A partition is a set of compute nodes (computers dedicated to… computing) grouped logically based on either physical properties of the hardware or job scheduling policies. Once the submitted job is executed, the output of the jobs is then written into disk (or storage).

Gathering Cluster Information

Slurm offers the sinfo command to get an overview of the resources offered by the cluster. By default, sinfo lists the partitions that are available on the cluster.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*    up   12:00:00      4   idle cn[01-04]
compute      up 1-00:00:00      8   idle cn[01-08]
gpu          up 3-00:00:00      2  alloc gpu[01-02]

As you can see from the result of the basic sinfo command you can see that there are three partitions in this cluster: standard with 4 compute nodes cn01 to cn04 (which is the default), then compute with eight nodes, and finally gpu with the two GPU nodes.

You can output node information using sinfo –Nl. With the -l argument, more information about the nodes is provided, among which the number of “CPUs” (CPUS), which is the number of processing units that the jobs can use. It should generally correspond to the number of sockets (S) times number of cores per socket (C) times number of hardware threads per core (T in the S:C:T column) but can be lower in the case some CPUs are reserved for system use.

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
cn01           1 standard*        idle 24     2:12:1 385443        0      1   (null) none                
cn01           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
cn02           1 standard*        idle 24     2:12:1 385443        0      1   (null) none                
cn02           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
cn03           1 standard*        idle 24     2:12:1 385443        0      1   (null) none                
cn03           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
cn04           1 standard*        idle 24     2:12:1 385443        0      1   (null) none                
cn04           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
cn05           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
cn06           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
cn07           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
cn08           1   compute        idle 24     2:12:1 385443        0      1   (null) none                
gpu01          1       gpu   allocated 192    2:48:2 103162        0      1 gpu,cent none                
gpu02          1       gpu   allocated 192    2:48:2 103162        0      1 gpu,cent none

The other columns report the volatile working memory (RAM – MEMORY), the size of the local temporary disk (also called local scratch space – TMP_DISK), and the node “weight” (an internal parameter specifying preferences in nodes for allocations when there are multiple possibilities).

Running Jobs on Slurm

Jobs are made of one or multiple sequential steps, each consisting in one or multiple parallel tasks that will be dispatched to possibly distinct nodes. Each task will be allocated CPUs, memory, and possible other generic resources in an exclusive manner by slurm.

Two jobs cannot share the same resource unless explicitly forced by the admins, but that is generally not the case. Therefore, jobs can only start when all needed resources are free and not needed by another higher priority job. Jobs are indeed assigned a priority when they are submitted, which can depend upon multiple factors.

For the scheduling process to work properly, you will need to describe your job before you submit it:

  • What are the steps (i.e. which program must be run and how) ;
  • How many tasks there will be ;
  • What resource each task needs (CPU, memory, etc.), and
  • For how long the job is supposed to run.

All of these, along with potentially additional job parameters submission options, can be described in a submission script. The expression #SBATCH is used to specify these parameters in the script.

As an example, the following should be entered in a file named submit.sh:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH --partition=standard
#
#SBATCH --time=10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100

srun hostname
srun sleep 60

This job does not do a lot. It will only display the hostname on the compute node it is running on and then sleep for 60 minutes. Note that these programs are run using the Slurm command srun. You should use an editor like nano to write this job in the file (here called submit.sh) and save. Then, to run this job, you can use the sbatch command as follows:

$ sbatch submit.sh 
Submitted batch job 4306
$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              4306   compute     test  username  R       0:02      1 cn01

Note that the squeue command can show you the job queue and its current compute node and status

As we specified in the job script that we want our output to be stored in a file called res.txt, after the job is finished, you can view the output as follows:

==========================================
SLURM_CLUSTER_NAME = linux
SLURM_ARRAY_JOB_ID = 
SLURM_ARRAY_TASK_ID = 
SLURM_ARRAY_TASK_COUNT = 
SLURM_ARRAY_TASK_MAX = 
SLURM_ARRAY_TASK_MIN = 
SLURM_JOB_ACCOUNT = faculty
SLURM_JOB_ID = 4306
SLURM_JOB_NAME = test
SLURM_JOB_NODELIST = cn01
SLURM_JOB_USER = test
SLURM_JOB_UID = 1132
SLURM_JOB_PARTITION = compute
SLURM_TASK_PID = 2206380
SLURM_SUBMIT_DIR = /home/nfs/test
SLURM_CPUS_ON_NODE = 24
SLURM_NTASKS = 1
SLURM_TASK_PID = 2206380
==========================================
cn01

The res.txt file contains a brief report on the job, plus the output which is in this case just the name of the compute node (result from running the hostname command).

This was just a basic example, and there are many other commands that you can use to control your submitted jobs such as: srun, scancel, and sview. For more information please refer to the slurm user documentation https://slurm.schedmd.com/documentation.html.