Running jobs - Introduction¶

This page covers what you need to know to get your code running on the CREATE HPC cluster. It covers quick start steps for beginners with simple jobs, configuration steps for GPU and MPI jobs, along with general advice on using the scheduler that will help you get the most out of the system. This uses only trivial code examples, specific scientific software can be accessed using modules or containers.

Quickstart¶

Prerequisites¶

These instructions assume the following to be true:

Your account has been enabled.
You have an ssh client installed on a computer with internet connectivity or on the KCL network.
You are able to succesfully access the cluster login nodes.
You know how to create files with a Linux based text editor.

What is the CREATE HPC cluster?¶

The CREATE HPC cluster is a collection of highly spec'd compute nodes (computers) with a shared network and storage servers. More information about the hardware specifications of the nodes is available on the Compute Nodes page. The scheduler (Slurm) allows users of the cluster to submit jobs (software applications) to run on this pool of hardware, and for the underlying hardware to be efficiently allocated according to compute requirements, access policies and priorities based on the framework set out in the Scheduler Policy.

Loading software dependencies¶

The cluster makes use of Module Environments to provide a means for loading specific versions of scientific software or development tools on the cluster. When submitting jobs to the cluster you should load the software modules required for your program to run.

Identify your partition¶

The scheduler is configured to group the compute nodes in to a number of paritions in order to apply a sharing policy. The first task when submitting jobs to the cluster is to identify which partition you should use from the tables on Compute Nodes.

Based on which User Group you belong to in the above table you must use the associated Partition Name with the -p option of the srun command in the following examples. The _gpu partitions should be used when submitting GPU jobs.

Running an interactive job¶

Note

4 hours is the maximum runtime for interactive jobs. You must request a time limit no longer than 4 hours for interactive jobs: --time minutes

Running an interactive job is the cluster quivalent of executing a command on a standard Linux command line. This means you will be able to provide input and read output (via the terminal) in real-time. This is often used as a means to test that code runs before submitting a batch job. You can start a new interactive job by using the srun command; the scheduler will search for an available compute node, and provide you with an interactive login shell on the node if one is available.

k1234567@erc-hpc-login1:~$ srun -p cpu --time 60 --pty /bin/bash -l
k1234567@erc-hpc-comp-021:$ echo 'Hello, World!'
Hello, World!
k1234567@erc-hpc-comp-021:$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8178       cpu     bash k1234567  R       4:44      1 erc-hpc-comp-021
k1234567@erc-hpc-comp-021:$ exit
exit
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
k1234567@erc-hpc-login1:~$

In the above example, the srun command is used together with several options: --time minutes, -p public_cpu, --pty and /bin/bash -l. -p cpu specifies the shared partition. The --time option sets a limit on the total runtime of the job allocation and can be extended. The --pty option executes the task in pseudo terminal mode, allowing the session to act like a standard terminal session. The /bin/bash option is the command to be run, in this case the default Linux shell, bash. Once srun is run a terminal session is acquired where you can run standard Linux commands (echo, cd, ls, etc) on the allocated compute node (erc-hpc-comp-021 here). squeue -u <username> shows the details of the running interactive job and reports an empty list once we have exited bash.

Submitting a batch job¶

Batch (or non-interactive) jobs allow users to leverage one of the main benefits of having a cluster scheduler; jobs can be queued up with instructions on how to run them and then executed across the cluster while the user does something else. Users submit jobs as scripts, which include instructions on how to run the job - the output of the job (stdout and stderr in Linux terminology) is written to a file on disk for review later on. You can write a batch job that does anything that can be typed on the command-line.

We'll start with a basic example - the following script is written in bash. You can create the script yourself using your editor of choice. The script does nothing more than print some messages to the screen (the echo lines), and sleeps for 15 seconds. We've saved the script to a file called helloworld.sh - the .sh extension helps to remind us that this is a shell script, but adding a filename extension isn't strictly necessary for Linux.

  #!/bin/bash -l
  #SBATCH --output=/scratch/users/%u/%j.out
  echo "Hello, World! From $HOSTNAME"
  sleep 15
  echo "Goodbye, World! From $HOSTNAME"

Note

We use the -l option to bash on the first line of the script to request a login session. This ensures that Environment Modules can be loaded from your script.

Note

SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (ceph). We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.

We can execute that script directly on the login node by using the command bash helloworld.sh, we get the following output:

k1234567@erc-hpc-login1:~$ bash helloworld.sh
Hello, World! From erc-hpc-login1
Goodbye, World! From erc-hpc-login1

To submit your job script to the cluster job scheduler, use the command sbatch -p <partition> helloworld.sh. Where <partition> is your partition name. The job scheduler should immediately report the job ID for your job; your job ID is a unique identifier which can be used to when viewing or controlling queued jobs

k1234567@erc-hpc-login1:~$ sbatch -p cpu helloworld.sh
Submitted batch job 8256
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8256        cpu hellowor k1234567  R       0:11      1 nodea01
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
k1234567@erc-hpc-login1:~$ ls -l /scratch/users/k1234567/
total 5
-rw-r--r-- 1 k1234567 clusterusers 112 Oct 17 17:08 8244.out
-rw-r--r-- 1 k1234567 clusterusers 112 Oct 17 17:21 8256.out
k1234567@erc-hpc-login1:~$ cat /scratch/users/k1234567/8256.out
Hello, World! From erc-hpc-comp-018
Goodbye, World! From erc-hpc-comp-018

Default resources¶

The job launched above didn't make any explicit requests for resources (e.g. CPU cores, memory) or specify a runtime and so inherited the cluster defaults. If more resources are needed (this is HPC after all) the batch job should include instructions to request more resources. It is important to remember that if you exhaust the resource limits (e.g. runtime or memory) your job will be killed.

Job monitoring¶

Viewing and controlling queued jobs¶

Once your job has been submitted, use the squeue command to view the status of the job queue (adding -u <your_username> to see only your jobs. If there are available compute nodes with the resources you've requested, your job should be shown in the R (running) state, if not your job may be shown in the PD (pending) state until resources are available to run it. If a job is in PD state - the reason for being unable to run will be displayed in the NODELIST(REASON) column of the squeue output.

k1234567@erc-hpc-login1:~$ squeue
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8306        cpu 2D_36_si k1802890 PD       0:00      5 (Resources)
              8307        cpu 2D_36_si k1802890 PD       0:00      5 (Priority)
              8291        cpu  tvmc.sh k1623514  R    3:19:54      5 nodek[03-07]
              8312        cpu test.slu k1898460  R    2:00:38      1 nodek54

You can use the scancel <jobid> command to delete a job you've submitted, whether it's running or still in the queued state.

k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8327        cpu hellowor k1234567  R       0:00      1 nodea01
              8325        cpu hellowor k1234567  R       0:03      1 nodea01
              8326        cpu hellowor k1234567  R       0:03      1 nodea01
k1234567@erc-hpc-login1:~$ scancel 8325
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8325        cpu hellowor k1234567 CG       0:11      1 nodea01
              8327        cpu hellowor k1234567  R       0:09      1 nodea01
              8326        cpu hellowor k1234567  R       0:12      1 nodea01

Checking job resource usage¶

To check the resource usage of a running or completed job, you can use the sacct command. Looking at the actual time taken, memory used, etc. of a completed job can be very helpful for optimising the amount of resources you request in future jobs.

Use the -j <job_id> option to specify the ID of the job you want to investigate, and use --format=<fields> to specify which job information you want to output. Some useful fields included in the suggested command below are:

the requested and elapsed time (Timelimit and Elapsed; CPUTime is the time elapsed multiplied by the number of CPUs used)
the requested and allocated number of CPUs (ReqCPUS and NCPUS)
the requested and maximum memory usage (ReqMem and MaxRSS)

k1234567@erc-hpc-login1:~$ sacct -j 8328 --format=JobID,JobName,NodeList%24,Start,End,State,Timelimit,Elapsed,CPUTime,ReqCPUS,NCPUS,ReqMem,MaxRSS,MaxVMSize
JobID           JobName                 NodeList               Start                 End      State  Timelimit    Elapsed    CPUTime  ReqCPUS      NCPUS     ReqMem     MaxRSS
------------ ---------- ------------------------ ------------------- ------------------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ----------
8328           hellowor          erc-hpc-comp041 2024-05-02T10:39:54 2024-05-02T10:40:01  COMPLETED   00:02:00   00:00:07   00:00:28        4          4         2G
8328.ba+          batch          erc-hpc-comp041 2024-05-02T10:39:54 2024-05-02T10:40:01  COMPLETED              00:00:07   00:00:28        4          4               325772K

The full list of available fields is given by sacct -e. More information on how to interpret the fields is available by running man sacct.

To save remembering all the fields of interest, you can create an alias by adding the following line to your .profile:

alias sacct='sacct --format=JobID,JobName,NodeList%24,Start,End,State,Timelimit,Elapsed,CPUTime,ReqCPUS,NCPUS,ReqMem,MaxRSS,MaxVMSize'

Scheduler instructions¶

Job instructions can be provided in two ways; they are:

On the command line, as parameters to your sbatch or srun command. For example, you can set the name of your job using the --job-name=[name] | -J [name] option:

k1234567@erc-hpc-login1:~$ sbatch -p cpu --job-name hello helloworld.sh
Submitted batch job 8333
k1234567@erc-hpc-login1:~$ squeue -u k1234567
            JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             8333        cpu    hello k1234567  R       0:09      1 nodea01

In your job script, by including scheduler directives at the top of your job script - you can achieve the same effect as providing options with the sbatch or srun commands. To add the --job-name to our previous example:
1 2 3 4 5 6
#!/bin/bash -l #SBATCH --output=/scratch/users/%u/%j.out #SBATCH --job-name=hello echo "Hello, World! From $HOSTNAME" sleep 15 echo "Goodbye, World! From $HOSTNAME"

Including job scheduler instructions in your job-scripts is often the most convenient method of working for batch jobs - follow the guidelines below for the best experience:

Lines in your script that include job-scheduler directives must start with #SBATCH at the beginning of the line
You can have multiple lines starting with #SBATCH in your job-script
You can put multiple instructions separated by a space on a single line starting with #SBATCH
The scheduler will parse the script from top to bottom and set instructions in order; if you set the same parameter twice, the second value will be used.
Instructions are parsed at job submission time, before the job itself has actually run.
- This means you can't, for example, tell the scheduler to put your job output in a directory that you create in the job-script itself - the directory will not exist when the job starts running, and your job will fail with an error.
You can use dynamic variables in your instructions (see below)

Dynamic scheduler variables¶

When writing submission scripts you will often need to reference values set by the scheduler (e.g. jobid), values inherited from the OS (e.g. username), or values you set elsewhere in your script (e.g. ntasks in the SBATCH directives). For this purpose a number of dynamic variables are made available. In the list below variables starting % can be referenced in the SBATCH directives and those starting $ from the body of the shell script.

%u / $USER The Linux username of the submitting user
%a / $SLURM_ARRAY_TASK_ID Job array ID (index) number
%A / $SLURM_ARRAY_JOB_ID Job allocation number for an array job
%j / $SLURM_JOB_ID Job allocation number
%x / $SLURM_JOB_NAME Job name
$SLURM_NTASKS Number of CPU cores requested with -n, --ntasks, this can be provided to your code to make use of the allocated CPU cores

Simple scheduler instruction examples¶

Here are some commonly used scheduler instructions, along with some example of their usage:

Setting output file location¶

To set the output file location for your job, use the -o [file_name] | --output=[file_name] option - both standard-out and standard-error from your job-script, including any output generated by applications launched by your job-script will be saved in the filename you specify.

By default, the scheduler stores data relative to your home-directory - but to avoid confusion, we recommend specifying a full path to the filename to be used. Although Linux can support several jobs writing to the same output file, the result is likely to be garbled - it's common practice to include something unique about the job (e.g. it's job ID) in the output filename to make sure your job's output is clear and easy to read.

Note

The directory used to store your job output file must exist and be writable by your user before you submit your job to the scheduler. Your job may fail to run if the scheduler cannot create the output file in the directory requested.

The following example uses the --output=[file_name] instruction to set the output file location:

  #!/bin/bash -l
  #SBATCH --output=/scratch/users/%u/%j.out
  echo "Hello, World! From $HOSTNAME"
  sleep 15
  echo "Goodbye, World! From $HOSTNAME"

Note

SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (ceph). We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.

Setting working directory for your job¶

By default, jobs are executed from your home-directory on the cluster (i.e. /home/<your-user-name>, $HOME or ~). You can include cd commands in your job-script to change to different directories; alternatively, you can provide an instruction to the scheduler to change to a different directory to run your job. The available options are:

-D | --chdir=[directory path] - instruct the job scheduler to move into the directory specified before starting to run the job on a compute node

For example, to change your working directory in a batch job.

1	`#SBATCH --chdir=/scratch/k1234567/research/ai/`

Note

The directory specified must exist and be accessible by the compute node in order for the job you submitted to run.

Waiting for a previous job before running¶

You can instruct the scheduler to wait for an existing job to finish before starting to run the job you are submitting with the -d [state:job_id] | --depend=[state:job_id] option.

Running task array jobs¶

A common workload is having a large number of jobs to run which basically do the same thing, aside perhaps from having different input data. You could generate a job-script for each of them and submit it, but that's not very convenient - especially if you have many hundreds or thousands of tasks to complete. Such jobs are known as task arrays - an embarrassingly parallel job will often fit into this category.

A convenient way to run such jobs on a cluster is to use a task array, using the -a [array_spec] | --array=[array_spec] directive. Your job-script can then use the pseudo environment variables created by the scheduler to refer to data used by each task in the job. The following job-script uses the $SLURM_ARRAY_TASK_ID/%a variable to echo its current task ID to an output file:

  #!/bin/bash -l
  #SBATCH --job-name=array
  #SBATCH --output=output.array.%A.%a
  #SBATCH --array=1-1000
  echo "I am $SLURM_ARRAY_TASK_ID from job $SLURM_ARRAY_JOB_ID"

All tasks in an array job are given a job ID with the format [job_ID]_[task_number] e.g. 77_81 would be job number 77, array task 81.

Array jobs can easily be cancelled using the scancel command - the following examples show various levels of control over an array job:

scancel 77 Cancels all array tasks under the job ID 77

scancel 77_[100-200] Cancels array tasks 100-200 under the job ID 77

scancel 77_5 Cancels array task 5 under the job ID 77

A common use case for array jobs is running the same script on many different input files. One way to use an array job to accomplish this is to create an text file listing all the input files, then use the array ID to select the corresponding line from the file.

k1234567@erc-hpc-login1:~$ ls *.txt > input_files.txt
k1234567@erc-hpc-login1:~$ cat input_files.txt
file1.txt
other_file.txt
some_more_data.txt
.
.
.
file_100.txt

The job submission script could look something like this:

#!/bin/bash -l
#SBATCH --job-name=array
#SBATCH --output=output.array.%A.%a
#SBATCH --array=1-100

input_file=`head input_files.txt -n $SLURM_ARRAY_TASK_ID | tail -n 1`

my_script.py --input $input_file

Requesting more resources¶

By default, jobs are constrained to the cluster defaults (see table below) - users can use scheduler instructions to request more resources for their jobs as needed. The following documentation shows how these requests can be made.

CPU cores	Memory	Runtime
1 core	1GB	24 hours

In order to promote best-use of the cluster scheduler, particularly in a shared environment, it is recommended that you inform the scheduler of the amount of time, memory and CPU cores your job is expected to need. This helps the scheduler appropriately place jobs on the available nodes in the cluster and should minimise any time spent queuing for resources to become available (to this end you should always request the minimal amount of resources you require to run). See the section on checking job resource usage with sacct for some tips on how to identify how the resources used by past jobs - you can use this as a guide for resource requests for new jobs.

Requesting a longer runtime¶

Note

48 hours is the maximum runtime for batch jobs. If it is not possible to checkpoint a specific job (and we know it is not possible to checkpoint all applications) please see Long Partition Access for details on how to request access to partitions with longer maximum run times.

You can inform the cluster scheduler of the expected runtime using the -t, --time=<time> option. For example - to submit a job that runs for 2 hours, the following example job script could be used:

  #!/bin/bash -l
  #SBATCH --job-name=sleep
  #SBATCH --time=0-2:00
  sleep 7200

You can then see any time limits assigned to running jobs using the command squeue --long:

k1234567@erc-hpc-login1:~$ squeue --long -u k1234567
Fri Oct 18 16:21:39 2019
             JOBID  PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             11439        cpu sleep.sh k1234567  RUNNING       0:29   2:00:00      1 nodea01

Requesting more memory¶

You can specify the maximum amount of memory required per submitted job with the --mem=<MB> option. This informs the scheduler of the memory required for the submitted job. Optionally - you can also request an amount of memory per CPU core rather than a total amount of memory required per job. To specify an amount of memory to allocate per core, use the --mem-per-cpu=<MB> option.

Note

When running a job across multiple compute hosts, the --mem=<MB> option informs the scheduler of the required memory per node

Running multi-threaded jobs¶

If you want to use multiple cores on a compute node to run a multi-threaded application, they need to inform the scheduler. Using multiple CPU cores is achieved by specifying the -n, --ntasks=<number> option in either your submission command or the scheduler directives in your job script. The --ntasks option informs the scheduler of the number of cores you wish to reserve for use. If the parameter is omitted, the default --ntasks=1 is assumed. You could specify the option -n 4 to request 4 CPU cores for your job. Besides the number of tasks, you will need to add --nodes=1 to your scheduler command or at the top of your job script with #SBATCH --nodes=1, this will set the maximum number of nodes to be used to 1 and prevent the job selecting cores from multiple nodes (multi node jobs require MPI).

Note

If you request more cores than are available on a node in your cluster, the job will not run until a node capable of fulfilling your request becomes available. The scheduler will display the error in the output of the squeue command.

Note

Just asking for more cores will not necessarily mean your code makes use of them. It is generally required to inform your application of how many cores to use. This can be done using the $SLURM_NTASKS variable in your submission script.

Further documentation¶

This guide is a an overview of some of the many available options of the SLURM cluster scheduler.

Additional information on this site is available for running jobs across nodes, running jobs using GPUs and running jobs using a graphical user interface (GUI).

For more information on the available options in SLURM, further documentation is available from the following sources:

Use the man squeue command to see a full list of scheduler queue instructions
Use the man sbatch/srun command to see a full list of scheduler submission instructions
Online documentation for the SLURM scheduler is available here