Skip to content

Running jobs on CREATE HPC

This page covers what you need to know to get your code running on the CREATE HPC cluster. It covers quick start steps for beginners with simple jobs, configuration steps for GPU and MPI jobs, along with general advice on using the scheduler that will help you get the most out of the system. This uses only trivial code examples, specific scientific software can be accessed using modules or containers.

Quickstart

Prerequisites

These instructions assume the following to be true:

  • Your account has been enabled.
  • You have an ssh client installed on a computer with internet connectivity or on the KCL network.
  • You are able to succesfully access the cluster login nodes.
  • You know how to create files with a Linux based text editor.

What is the CREATE HPC cluster?

The CREATE HPC cluster is a collection of highly spec'd compute nodes (computers) with a shared network and storage servers. More information about the hardware specifications of the nodes is available on the Compute Nodes page. The scheduler (Slurm) allows users of the cluster to submit jobs (software applications) to run on this pool of hardware, and for the underlying hardware to be efficiently allocated according to compute requirements, access policies and priorities based on the framework set out in the Scheduler Policy.

Loading software dependencies

The cluster makes use of Module Environments to provide a means for loading specific versions of scientific software or development tools on the cluster. When submitting jobs to the cluster you should load the software modules required for your program to run.

Identify your partition

The scheduler is configured to group the compute nodes in to a number of paritions in order to apply a sharing policy. The first task when submitting jobs to the cluster is to identify which partition you should use from the tables on Compute Nodes.

Based on which User Group you belong to in the above table you must use the associated Partition Name with the -p option of the srun command in the following examples. The _gpu partitions should be used when submitting GPU jobs.

Running an interactive job

Running an interactive job is the cluster quivalent of executing a command on a standard Linux command line. This means you will be able to provide input and read output (via the terminal) in real-time. This is often used as a means to test that code runs before submitting a batch job. You can start a new interactive job by using the srun command; the scheduler will search for an available compute node, and provide you with an interactive login shell on the node if one is available.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
k1234567@erc-hpc-login1:~$ srun -p cpu --pty /bin/bash -l
k1234567@erc-hpc-comp-021:$ echo 'Hello, World!'
Hello, World!
k1234567@erc-hpc-comp-021:$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8178       cpu     bash k1234567  R       4:44      1 erc-hpc-comp-021
k1234567@erc-hpc-comp-021:$ exit
exit
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
k1234567@erc-hpc-login1:~$

In the above example, the srun command is used together with three options: -p public_cpu, --pty and /bin/bash. -p cpu specifies the shared partition. The --pty option executes the task in pseudo terminal mode, allowing the session to act like a standard terminal session. The /bin/bash option is the command to be run, in this case the default Linux shell, bash. Once srun is run a terminal session is acquired where you can run standard Linux commands (echo, cd, ls, etc) on the allocated compute node (erc-hpc-comp-021 here). squeue -u <username> shows the details of the running interactive job and reports an empty list once we have exited bash.

Submitting a batch job

Batch (or non-interactive) jobs allow users to leverage one of the main benefits of having a cluster scheduler; jobs can be queued up with instructions on how to run them and then executed across the cluster while the user does something else. Users submit jobs as scripts, which include instructions on how to run the job - the output of the job (stdout and stderr in Linux terminology) is written to a file on disk for review later on. You can write a batch job that does anything that can be typed on the command-line.

We'll start with a basic example - the following script is written in bash. You can create the script yourself using your editor of choice. The script does nothing more than print some messages to the screen (the echo lines), and sleeps for 15 seconds. We've saved the script to a file called helloworld.sh - the .sh extension helps to remind us that this is a shell script, but adding a filename extension isn't strictly necessary for Linux.

1
2
3
4
5
  #!/bin/bash -l
  #SBATCH --output=/scratch/users/%u/%j.out
  echo "Hello, World! From $HOSTNAME"
  sleep 15
  echo "Goodbye, World! From $HOSTNAME"

Note

We use the -l option to bash on the first line of the script to request a login session. This ensures that Environment Modules can be loaded from your script.

Note

SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (ceph). We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.

We can execute that script directly on the login node by using the command bash helloworld.sh, we get the following output:

1
2
3
k1234567@erc-hpc-login1:~$ bash helloworld.sh
Hello, World! From erc-hpc-login1
Goodbye, World! From erc-hpc-login1

To submit your job script to the cluster job scheduler, use the command sbatch -p <partition> helloworld.sh. Where <partition> is your partition name. The job scheduler should immediately report the job ID for your job; your job ID is a unique identifier which can be used to when viewing or controlling queued jobs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
k1234567@erc-hpc-login1:~$ sbatch -p cpu helloworld.sh
Submitted batch job 8256
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8256        cpu hellowor k1234567  R       0:11      1 nodea01
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
k1234567@erc-hpc-login1:~$ ls -l /scratch/users/k1234567/
total 5
-rw-r--r-- 1 k1234567 clusterusers 112 Oct 17 17:08 8244.out
-rw-r--r-- 1 k1234567 clusterusers 112 Oct 17 17:21 8256.out
k1234567@erc-hpc-login1:~$ cat /scratch/users/k1234567/8256.out
Hello, World! From erc-hpc-comp-018
Goodbye, World! From erc-hpc-comp-018

Default resources

The job launched above didn't make any explicit requests for resources (e.g. CPU cores, memory) or specify a runtime and so inherited the cluster defaults. If more resources are needed (this is HPC after all) the batch job should include instructions to request more resources. It is important to remember that if you exhaust the resource limits (e.g. runtime or memory) your job will be killed.

Viewing and controlling queued jobs

Once your job has been submitted, use the squeue command to view the status of the job queue (adding -u <your_username> to see only your jobs. If there are available compute nodes with the resources you've requested, your job should be shown in the R (running) state, if not your job may be shown in the PD (pending) state until resources are available to run it. If a job is in PD state - the reason for being unable to run will be displayed in the NODELIST(REASON) column of the squeue output.

1
2
3
4
5
6
k1234567@erc-hpc-login1:~$ squeue
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8306        cpu 2D_36_si k1802890 PD       0:00      5 (Resources)
              8307        cpu 2D_36_si k1802890 PD       0:00      5 (Priority)
              8291        cpu  tvmc.sh k1623514  R    3:19:54      5 nodek[03-07]
              8312        cpu test.slu k1898460  R    2:00:38      1 nodek54

You can use the scancel <jobid> command to delete a job you've submitted, whether it's running or still in the queued state.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8327        cpu hellowor k1234567  R       0:00      1 nodea01
              8325        cpu hellowor k1234567  R       0:03      1 nodea01
              8326        cpu hellowor k1234567  R       0:03      1 nodea01
k1234567@erc-hpc-login1:~$ scancel 8325
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8325        cpu hellowor k1234567 CG       0:11      1 nodea01
              8327        cpu hellowor k1234567  R       0:09      1 nodea01
              8326        cpu hellowor k1234567  R       0:12      1 nodea01

Scheduler instructions

Job instructions can be provided in two ways; they are:

  1. On the command line, as parameters to your sbatch or srun command. For example, you can set the name of your job using the --job-name=[name] | -J [name] option:
1
2
3
4
5
k1234567@erc-hpc-login1:~$ sbatch -p cpu --job-name hello helloworld.sh
Submitted batch job 8333
k1234567@erc-hpc-login1:~$ squeue -u k1234567
            JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             8333        cpu    hello k1234567  R       0:09      1 nodea01
  1. In your job script, by including scheduler directives at the top of your job script - you can achieve the same effect as providing options with the sbatch or srun commands. To add the --job-name to our previous example:
1
2
3
4
5
6
#!/bin/bash -l
#SBATCH --output=/scratch/users/%u/%j.out
#SBATCH --job-name=hello
echo "Hello, World! From $HOSTNAME"
sleep 15
echo "Goodbye, World! From $HOSTNAME"

Including job scheduler instructions in your job-scripts is often the most convenient method of working for batch jobs - follow the guidelines below for the best experience:

  • Lines in your script that include job-scheduler directives must start with #SBATCH at the beginning of the line
  • You can have multiple lines starting with #SBATCH in your job-script
  • You can put multiple instructions separated by a space on a single line starting with #SBATCH
  • The scheduler will parse the script from top to bottom and set instructions in order; if you set the same parameter twice, the second value will be used.
  • Instructions are parsed at job submission time, before the job itself has actually run.
    • This means you can't, for example, tell the scheduler to put your job output in a directory that you create in the job-script itself - the directory will not exist when the job starts running, and your job will fail with an error.
  • You can use dynamic variables in your instructions (see below)

Dynamic scheduler variables

When writing submission scripts you will often need to reference values set by the scheduler (e.g. jobid), values inherited from the OS (e.g. username), or values you set elsewhere in your script (e.g. ntasks in the SBATCH directives). For this purpose a number of dynamic variables are made available. In the list below variables starting % can be referenced in the SBATCH directives and those starting $ from the body of the shell script.

  • %u / $USER The Linux username of the submitting user
  • %a / $SLURM_ARRAY_TASK_ID Job array ID (index) number
  • %A / $SLURM_ARRAY_JOB_ID Job allocation number for an array job
  • %j / $SLURM_JOB_ID Job allocation number
  • %x / $SLURM_JOB_NAME Job name
  • $SLURM_NTASKS Number of CPU cores requested with -n, --ntasks, this can be provided to your code to make use of the allocated CPU cores

Simple scheduler instruction examples

Here are some commonly used scheduler instructions, along with some example of their usage:

Setting output file location

To set the output file location for your job, use the -o [file_name] | --output=[file_name] option - both standard-out and standard-error from your job-script, including any output generated by applications launched by your job-script will be saved in the filename you specify.

By default, the scheduler stores data relative to your home-directory - but to avoid confusion, we recommend specifying a full path to the filename to be used. Although Linux can support several jobs writing to the same output file, the result is likely to be garbled - it's common practice to include something unique about the job (e.g. it's job ID) in the output filename to make sure your job's output is clear and easy to read.

Note

The directory used to store your job output file must exist and be writable by your user before you submit your job to the scheduler. Your job may fail to run if the scheduler cannot create the output file in the directory requested.

The following example uses the --output=[file_name] instruction to set the output file location:

1
2
3
4
5
  #!/bin/bash -l
  #SBATCH --output=/scratch/users/%u/%j.out
  echo "Hello, World! From $HOSTNAME"
  sleep 15
  echo "Goodbye, World! From $HOSTNAME"

Note

SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (ceph). We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.

Setting working directory for your job

By default, jobs are executed from your home-directory on the cluster (i.e. /home/<your-user-name>, $HOME or ~). You can include cd commands in your job-script to change to different directories; alternatively, you can provide an instruction to the scheduler to change to a different directory to run your job. The available options are:

  • -D | --chdir=[directory path] - instruct the job scheduler to move into the directory specified before starting to run the job on a compute node

For example, to change your working directory in a batch job.

1
#SBATCH --chdir=/scratch/k1234567/research/ai/

Note

The directory specified must exist and be accessible by the compute node in order for the job you submitted to run.

Waiting for a previous job before running

You can instruct the scheduler to wait for an existing job to finish before starting to run the job you are submitting with the -d [state:job_id] | --depend=[state:job_id] option.

Running task array jobs

A common workload is having a large number of jobs to run which basically do the same thing, aside perhaps from having different input data. You could generate a job-script for each of them and submit it, but that's not very convenient - especially if you have many hundreds or thousands of tasks to complete. Such jobs are known as task arrays - an embarrassingly parallel job will often fit into this category.

A convenient way to run such jobs on a cluster is to use a task array, using the -a [array_spec] | --array=[array_spec] directive. Your job-script can then use the pseudo environment variables created by the scheduler to refer to data used by each task in the job. The following job-script uses the $SLURM_ARRAY_TASK_ID/%a variable to echo its current task ID to an output file:

1
2
3
4
5
  #!/bin/bash -l
  #SBATCH --job-name=array
  #SBATCH --output=output.array.%A.%a
  #SBATCH --array=1-1000
  echo "I am $SLURM_ARRAY_TASK_ID from job $SLURM_ARRAY_JOB_ID"

All tasks in an array job are given a job ID with the format [job_ID]_[task_number] e.g. 77_81 would be job number 77, array task 81.

Array jobs can easily be cancelled using the scancel command - the following examples show various levels of control over an array job:

scancel 77 Cancels all array tasks under the job ID 77

scancel 77_[100-200] Cancels array tasks 100-200 under the job ID 77

scancel 77_5 Cancels array task 5 under the job ID 77

Requesting more resources

By default, jobs are constrained to the cluster defaults (see table below) - users can use scheduler instructions to request more resources for their jobs as needed. The following documentation shows how these requests can be made.

CPU cores Memory Runtime
1 core 1GB 24 hours

In order to promote best-use of the cluster scheduler, particularly in a shared environment, it is recommended that you inform the scheduler of the amount of time, memory and CPU cores your job is expected to need. This helps the scheduler appropriately place jobs on the available nodes in the cluster and should minimise any time spent queuing for resources to become available (to this end you should always request the minimal amount of resources you require to run).

Requesting a longer runtime

Note

48 hours is the maximum runtime If you have a job which requires longer and cannot be checkpointed please email support@er.kcl.ac.uk to discuss your requirements.

You can inform the cluster scheduler of the expected runtime using the -t, --time=<time> option. For example - to submit a job that runs for 2 hours, the following example job script could be used:

1
2
3
4
  #!/bin/bash -l
  #SBATCH --job-name=sleep
  #SBATCH --time=0-2:00
  sleep 7200

You can then see any time limits assigned to running jobs using the command squeue --long:

1
2
3
4
k1234567@erc-hpc-login1:~$ squeue --long -u k1234567
Fri Oct 18 16:21:39 2019
             JOBID  PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             11439        cpu sleep.sh k1234567  RUNNING       0:29   2:00:00      1 nodea01

Requesting more memory

You can specify the maximum amount of memory required per submitted job with the --mem=<MB> option. This informs the scheduler of the memory required for the submitted job. Optionally - you can also request an amount of memory per CPU core rather than a total amount of memory required per job. To specify an amount of memory to allocate per core, use the --mem-per-cpu=<MB> option.

Note

When running a job across multiple compute hosts, the --mem=<MB> option informs the scheduler of the required memory per node

Running multi-threaded jobs

If you want to use multiple cores on a compute node to run a multi-threaded application, they need to inform the scheduler. Using multiple CPU cores is achieved by specifying the -n, --ntasks=<number> option in either your submission command or the scheduler directives in your job script. The --ntasks option informs the scheduler of the number of cores you wish to reserve for use. If the parameter is omitted, the default --ntasks=1 is assumed. You could specify the option -n 4 to request 4 CPU cores for your job. Besides the number of tasks, you will need to add --nodes=1 to your scheduler command or at the top of your job script with #SBATCH --nodes=1, this will set the maximum number of nodes to be used to 1 and prevent the job selecting cores from multiple nodes (multi node jobs require MPI).

Note

If you request more cores than are available on a node in your cluster, the job will not run until a node capable of fulfilling your request becomes available. The scheduler will display the error in the output of the squeue command.

Note

Just asking for more cores will not necessarily mean your code makes use of them. It is generally required to inform your application of how many cores to use. This can be done using the $SLURM_NTASKS variable in your submission script.

Running parallel (MPI) jobs

Note

It is important to note that applications will not necessarily support being run across multiple nodes. They must explicitly support MPI for this purpose as seen in the mpirun application in this example.

If you want to run parallel jobs via a messaging passing interface (MPI), they need to inform the scheduler - this allows jobs to be efficiently spread over compute nodes to get the best possible performance. Using multiple CPU cores across multiple nodes is achieved by specifying the -N, --nodes=<minnodes[-maxnodes]> option - which requests a minimum (and optional maximum) number of nodes to allocate to the submitted job. If only the minnodes count is specified - then this is used for both the minimum and maximum node count for the job.

You can request multiple cores over multiple nodes using a combination of scheduler directives either in your job submission command or within your job script. Some of the following examples demonstrate how you can obtain cores across different resources;

  • --nodes=2 --ntasks=16 Requests 16 cores across 2 compute nodes
  • --nodes=2 Requests all available cores of 2 compute nodes
  • --ntasks=16 Requests 16 cores across any available compute nodes

For example, to use 30 CPU cores on the cluster for a single application, the instruction --ntasks=30 can be used. The following example the mpirun command testing MPI functionality across 30 CPU cores.

Sample code

Please copy the example code below and save it with file name "mpi_hello_world.c".

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <stdio.h>
#include <mpi.h>

int main ( int argc, char *argv[] )
{
   int rank,size;

   MPI_Init(&argc,&argv);
   MPI_Comm_size(MPI_COMM_WORLD,&size);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);

   printf("Hello world from processor %d out of %d.\n",rank,size);

   MPI_Finalize();

   return 0;
}

To compile the sample program:

1
mpicc -o mpi_hello_world_c mpi_hello_world.c

In this example the job is scheduled over two compute nodes. The jobscript loads the default OpenMPI module which makes the mpirun command available.

1
2
3
4
5
6
#!/bin/bash -l
#SBATCH -n 30
#SBATCH --job-name=openmpi
#SBATCH --output=openmpi.out.%j
module load openmpi
mpirun -n 30 mpi_hello_world_c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
k1234567@erc-hpc-login1:~$ sbatch -p cpu openmpi.sh
Submitted batch job 11449
k1234567@erc-hpc-login1:~$ squeue -u k1234567
             JOBID  PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11449        cpu  openmpi k1234567  R       0:02      2 nodea[13-14]
k1234567@erc-hpc-login1:~$ cat openmpi.out.11450
Hello world from processor 23 out of 30.
Hello world from processor 25 out of 30.
Hello world from processor 13 out of 30.
Hello world from processor 15 out of 30.
Hello world from processor 7 out of 30.
Hello world from processor 9 out of 30.
Hello world from processor 10 out of 30.
Hello world from processor 19 out of 30.
Hello world from processor 21 out of 30.
Hello world from processor 29 out of 30.
Hello world from processor 17 out of 30.
Hello world from processor 18 out of 30.
Hello world from processor 22 out of 30.
Hello world from processor 2 out of 30.
Hello world from processor 3 out of 30.
Hello world from processor 5 out of 30.
Hello world from processor 28 out of 30.
Hello world from processor 1 out of 30.
Hello world from processor 11 out of 30.
Hello world from processor 16 out of 30.
Hello world from processor 14 out of 30.
Hello world from processor 24 out of 30.
Hello world from processor 26 out of 30.
Hello world from processor 27 out of 30.
Hello world from processor 0 out of 30.
Hello world from processor 4 out of 30.
Hello world from processor 6 out of 30.
Hello world from processor 8 out of 30.
Hello world from processor 12 out of 30.
Hello world from processor 20 out of 30.

Running GPU jobs

Lots of scientific software is starting to make use of Graphical Processing Units (GPUs) for computation instead of traditional CPU cores. This is because GPUs out-perform CPUs for certain mathematical operations. If you wish to schedule your job on a GPU you need to provide the --gres=gpu option in your submissions script. The following example schedules a job on a GPU node then lists the GPU card it was assigned.

1
2
3
4
5
6
7
8
#!/bin/bash -l
#SBATCH --output=/scratch/users/%u/%j.out
#SBATCH --job-name=gpu
#SBATCH --gres=gpu
echo "Hello, World! From $HOSTNAME"
nvidia-debugdump -l
sleep 15
echo "Goodbye, World! From $HOSTNAME"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
k1234567@erc-hpc-login1:~$ sbatch -p gpu hellogpu.sh
Submitted batch job 12087
k1234567@erc-hpc-login1:~$ cat /scratch/users/k1234567/12087.out
Hello, World! From erc-hpc-comp-032
Found 2 NVIDIA devices
  Device ID:              0
  Device name:            NVIDIA A100-PCIE-40GB
  GPU internal ID:        1324120038830

  Device ID:              1
  Device name:            NVIDIA A100-PCIE-40GB
  GPU internal ID:        1324120040003

Goodbye, World! From erc-hpc-comp-032

Note

The maximum number of gpus that can be requested is now 8 on public_gpu

Note

Your GPU enabled application will mostly likely make use of the NVidia CUDA libraries, to load the CUDA module use module load cuda in your job submission script.

Testing on the GPU

Available alongside the shared public gpu queue, the interruptible_gpu partition gives access to all GPUs in CREATE, leading to a larger pool size that serves well as both a mechanism for testing GPU scheduling and making use of unused private resources, as detailed in our scheduler policy. So it is still important to take note that, although faster scheduling may be facilitated, jobs may be cancelled at anytime. Additionally, as the interruptible_gpu can be made up of a broad mix of GPU architectures, it may be useful to provide the following --constraint scheduler option with your job submissions:

1
k1234567@erc-hpc-login1:~$ srun -p interruptible_gpu --gres gpu --constraint a40 --pty /bin/bash -l

Further documentation

This guide is a quick overview of some of the many available options of the SLURM cluster scheduler. For more information on the available options, you may wish to reference some of the following available documentation for the demonstrated SLURM commands;

  • Use the man squeue command to see a full list of scheduler queue instructions
  • Use the man sbatch/srun command to see a full list of scheduler submission instructions
  • Online documentation for the SLURM scheduler is available here