Running jobs on CREATE HPC¶
This page covers what you need to know to get your code running on the CREATE HPC cluster. It covers quick start steps for beginners with simple jobs, configuration steps for GPU and MPI jobs, along with general advice on using the scheduler that will help you get the most out of the system. This uses only trivial code examples, specific scientific software can be accessed using modules or containers.
These instructions assume the following to be true:
- Your account has been enabled.
- You have an ssh client installed on a computer with internet connectivity or on the KCL network.
- You are able to succesfully access the cluster login nodes.
- You know how to create files with a Linux based text editor.
What is the CREATE HPC cluster?¶
The CREATE HPC cluster is a collection of highly spec'd compute nodes (computers) with a shared network and storage servers. More information about the hardware specifications of the nodes is available on the Compute Nodes page. The scheduler (Slurm) allows users of the cluster to submit jobs (software applications) to run on this pool of hardware, and for the underlying hardware to be efficiently allocated according to compute requirements, access policies and priorities based on the framework set out in the Scheduler Policy.
Loading software dependencies¶
The cluster makes use of Module Environments to provide a means for loading specific versions of scientific software or development tools on the cluster. When submitting jobs to the cluster you should load the software modules required for your program to run.
Identify your partition¶
The scheduler is configured to group the compute nodes in to a number of paritions in order to apply a sharing policy. The first task when submitting jobs to the cluster is to identify which partition you should use from the tables on Compute Nodes.
Based on which User Group you belong to in the above table you must use the associated Partition Name with the
-p option of the
srun command in the following examples. The
_gpu partitions should be used when submitting GPU jobs.
Running an interactive job¶
Running an interactive job is the cluster quivalent of executing a command on a standard Linux command line.
This means you will be able to provide input and read output (via the terminal) in real-time.
This is often used as a means to test that code runs before submitting a batch job.
You can start a new interactive job by using the
srun command; the scheduler will search for an available compute node, and provide you with an interactive login shell on the node if one is available.
1 2 3 4 5 6 7 8 9 10 11
In the above example, the
srun command is used together with three options:
-p cpu specifies the shared partition.
--pty option executes the task in pseudo terminal mode, allowing the session to act like a standard terminal session.
/bin/bash option is the command to be run, in this case the default Linux shell, bash.
srun is run a terminal session is acquired where you can run standard Linux commands (echo, cd, ls, etc) on the allocated compute node (erc-hpc-comp-021 here).
squeue -u <username> shows the details of the running interactive job and reports an empty list once we have exited bash.
Submitting a batch job¶
Batch (or non-interactive) jobs allow users to leverage one of the main benefits of having a cluster scheduler; jobs can be queued up with instructions on how to run them and then executed across the cluster while the user does something else. Users submit jobs as scripts, which include instructions on how to run the job - the output of the job (stdout and stderr in Linux terminology) is written to a file on disk for review later on. You can write a batch job that does anything that can be typed on the command-line.
We'll start with a basic example - the following script is written in bash. You can create the script yourself using your editor of choice. The script does nothing more than print some messages to the screen (the echo lines), and sleeps for 15 seconds.
We've saved the script to a file called
helloworld.sh - the
.sh extension helps to remind us that this is a shell script, but adding a filename extension isn't strictly necessary for Linux.
1 2 3 4 5
We use the
-l option to bash on the first line of the script to request a login session. This ensures that Environment Modules can be loaded from your script.
SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (ceph). We recommend using this configuration for all your jobs.
If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.
We can execute that script directly on the login node by using the command
bash helloworld.sh, we get the following output:
1 2 3
To submit your job script to the cluster job scheduler, use the command
sbatch -p <partition> helloworld.sh. Where
<partition> is your partition name.
The job scheduler should immediately report the job ID for your job; your job ID is a unique identifier which can be used to when viewing or controlling queued jobs
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The job launched above didn't make any explicit requests for resources (e.g. CPU cores, memory) or specify a runtime and so inherited the cluster defaults. If more resources are needed (this is HPC after all) the batch job should include instructions to request more resources. It is important to remember that if you exhaust the resource limits (e.g. runtime or memory) your job will be killed.
Viewing and controlling queued jobs¶
Once your job has been submitted, use the
squeue command to view the status of the job queue (adding
-u <your_username> to see only your jobs.
If there are available compute nodes with the resources you've requested, your job should be shown in the
R (running) state, if not your job may be shown in the
PD (pending) state until resources are available to run it.
If a job is in
PD state - the reason for being unable to run will be displayed in the
NODELIST(REASON) column of the
1 2 3 4 5 6
You can use the
scancel <jobid> command to delete a job you've submitted, whether it's running or still in the queued state.
1 2 3 4 5 6 7 8 9 10 11
Job instructions can be provided in two ways; they are:
- On the command line, as parameters to your
sruncommand. For example, you can set the name of your job using the
--job-name=[name] | -J [name]option:
1 2 3 4 5
- In your job script, by including scheduler directives at the top of your job script - you can achieve the same effect as providing options with the
sruncommands. To add the
--job-nameto our previous example:
1 2 3 4 5 6
Including job scheduler instructions in your job-scripts is often the most convenient method of working for batch jobs - follow the guidelines below for the best experience:
- Lines in your script that include job-scheduler directives must start with
#SBATCHat the beginning of the line
- You can have multiple lines starting with
#SBATCHin your job-script
- You can put multiple instructions separated by a space on a single line starting with
- The scheduler will parse the script from top to bottom and set instructions in order; if you set the same parameter twice, the second value will be used.
- Instructions are parsed at job submission time, before the job itself has actually run.
- This means you can't, for example, tell the scheduler to put your job output in a directory that you create in the job-script itself - the directory will not exist when the job starts running, and your job will fail with an error.
- You can use dynamic variables in your instructions (see below)
Dynamic scheduler variables¶
When writing submission scripts you will often need to reference values set by the scheduler (e.g. jobid), values inherited from the OS (e.g. username), or values you set elsewhere in your script (e.g. ntasks in the SBATCH directives).
For this purpose a number of dynamic variables are made available. In the list below variables starting
% can be referenced in the SBATCH directives and those starting
$ from the body of the shell script.
%u / $USERThe Linux username of the submitting user
%a / $SLURM_ARRAY_TASK_IDJob array ID (index) number
%A / $SLURM_ARRAY_JOB_IDJob allocation number for an array job
%j / $SLURM_JOB_IDJob allocation number
%x / $SLURM_JOB_NAMEJob name
$SLURM_NTASKSNumber of CPU cores requested with
-n, --ntasks, this can be provided to your code to make use of the allocated CPU cores
Simple scheduler instruction examples¶
Here are some commonly used scheduler instructions, along with some example of their usage:
Setting output file location¶
To set the output file location for your job, use the
-o [file_name] | --output=[file_name] option - both standard-out and standard-error from your job-script, including any output generated by applications launched by your job-script will be saved in the filename you specify.
By default, the scheduler stores data relative to your home-directory - but to avoid confusion, we recommend specifying a full path to the filename to be used. Although Linux can support several jobs writing to the same output file, the result is likely to be garbled - it's common practice to include something unique about the job (e.g. it's job ID) in the output filename to make sure your job's output is clear and easy to read.
The directory used to store your job output file must exist and be writable by your user before you submit your job to the scheduler. Your job may fail to run if the scheduler cannot create the output file in the directory requested.
The following example uses the
--output=[file_name] instruction to set the output file location:
1 2 3 4 5
SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (ceph).
We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.
Setting working directory for your job¶
By default, jobs are executed from your home-directory on the cluster (i.e.
You can include
cd commands in your job-script to change to different directories; alternatively, you can provide an instruction to the scheduler to change to a different directory to run your job. The available options are:
-D | --chdir=[directory path]- instruct the job scheduler to move into the directory specified before starting to run the job on a compute node
For example, to change your working directory in a batch job.
The directory specified must exist and be accessible by the compute node in order for the job you submitted to run.
Waiting for a previous job before running¶
You can instruct the scheduler to wait for an existing job to finish before starting to run the job you are submitting with the
-d [state:job_id] | --depend=[state:job_id] option.
Running task array jobs¶
A common workload is having a large number of jobs to run which basically do the same thing, aside perhaps from having different input data. You could generate a job-script for each of them and submit it, but that's not very convenient - especially if you have many hundreds or thousands of tasks to complete. Such jobs are known as task arrays - an embarrassingly parallel job will often fit into this category.
A convenient way to run such jobs on a cluster is to use a task array, using the
-a [array_spec] | --array=[array_spec] directive.
Your job-script can then use the pseudo environment variables created by the scheduler to refer to data used by each task in the job. The following job-script uses the
%a variable to echo its current task ID to an output file:
1 2 3 4 5
All tasks in an array job are given a job ID with the format
77_81 would be job number 77, array task 81.
Array jobs can easily be cancelled using the
scancel command - the following examples show various levels of control over an array job:
Cancels all array tasks under the job ID
Cancels array tasks
100-200 under the job ID
Cancels array task
5 under the job ID
Requesting more resources¶
By default, jobs are constrained to the cluster defaults (see table below) - users can use scheduler instructions to request more resources for their jobs as needed. The following documentation shows how these requests can be made.
|1 core||1GB||24 hours|
In order to promote best-use of the cluster scheduler, particularly in a shared environment, it is recommended that you inform the scheduler of the amount of time, memory and CPU cores your job is expected to need. This helps the scheduler appropriately place jobs on the available nodes in the cluster and should minimise any time spent queuing for resources to become available (to this end you should always request the minimal amount of resources you require to run).
Requesting a longer runtime¶
48 hours is the maximum runtime If you have a job which requires longer and cannot be checkpointed please email firstname.lastname@example.org to discuss your requirements.
You can inform the cluster scheduler of the expected runtime using the
-t, --time=<time> option. For example - to submit a job that runs for 2 hours, the following example job script could be used:
1 2 3 4
You can then see any time limits assigned to running jobs using the command
1 2 3 4
Requesting more memory¶
You can specify the maximum amount of memory required per submitted job with the
This informs the scheduler of the memory required for the submitted job. Optionally - you can also request an amount of memory per CPU core rather than a total amount of memory required per job.
To specify an amount of memory to allocate per core, use the
When running a job across multiple compute hosts, the
--mem=<MB> option informs the scheduler of the required memory per node
Running multi-threaded jobs¶
If you want to use multiple cores on a compute node to run a multi-threaded application, they need to inform the scheduler. Using multiple CPU cores is achieved by specifying the
-n, --ntasks=<number> option in either your submission command or the scheduler directives in your job script.
--ntasks option informs the scheduler of the number of cores you wish to reserve for use. If the parameter is omitted, the default
--ntasks=1 is assumed. You could specify the option
-n 4 to request 4 CPU cores for your job.
Besides the number of tasks, you will need to add
--nodes=1 to your scheduler command or at the top of your job script with
#SBATCH --nodes=1, this will set the maximum number of nodes to be used to 1 and prevent the job selecting cores from multiple nodes (multi node jobs require MPI).
If you request more cores than are available on a node in your cluster, the job will not run until a node capable of fulfilling your request becomes available. The scheduler will display the error in the output of the
Just asking for more cores will not necessarily mean your code makes use of them. It is generally required to inform your application of how many cores to use. This can be done using the $SLURM_NTASKS variable in your submission script.
Running parallel (MPI) jobs¶
It is important to note that applications will not necessarily support being run across multiple nodes. They must explicitly support MPI for this purpose as seen in the mpirun application in this example.
If you want to run parallel jobs via a messaging passing interface (MPI), they need to inform the scheduler - this allows jobs to be efficiently spread over compute nodes to get the best possible performance.
Using multiple CPU cores across multiple nodes is achieved by specifying the
-N, --nodes=<minnodes[-maxnodes]> option - which requests a minimum (and optional maximum) number of nodes to allocate to the submitted job.
If only the
minnodes count is specified - then this is used for both the minimum and maximum node count for the job.
You can request multiple cores over multiple nodes using a combination of scheduler directives either in your job submission command or within your job script. Some of the following examples demonstrate how you can obtain cores across different resources;
--nodes=2 --ntasks=16Requests 16 cores across 2 compute nodes
--nodes=2Requests all available cores of 2 compute nodes
--ntasks=16Requests 16 cores across any available compute nodes
For example, to use 30 CPU cores on the cluster for a single application, the instruction
--ntasks=30 can be used. The following example the
mpirun command testing MPI functionality across 30 CPU cores.
Please copy the example code below and save it with file name "mpi_hello_world.c".
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
To compile the sample program:
In this example the job is scheduled over two compute nodes. The jobscript loads the default
OpenMPI module which makes the mpirun command available.
1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Running GPU jobs¶
Lots of scientific software is starting to make use of Graphical Processing Units (GPUs) for computation instead of traditional CPU cores. This is because GPUs out-perform CPUs for certain mathematical operations.
If you wish to schedule your job on a GPU you need to provide the
--gres=gpu option in your submissions script. The following example schedules a job on a GPU node then lists the GPU card it was assigned.
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The maximum number of gpus that can be requested is now 8 on public_gpu
Your GPU enabled application will mostly likely make use of the NVidia CUDA libraries, to load the CUDA module use
module load cuda in your job submission script.
Testing on the GPU¶
Available alongside the shared public gpu queue, the interruptible_gpu partition gives access to all GPUs in CREATE, leading to a larger pool size that serves well as both a mechanism for testing GPU scheduling and making use of unused private resources, as detailed in our scheduler policy.
So it is still important to take note that, although faster scheduling may be facilitated, jobs may be cancelled at anytime.
Additionally, as the interruptible_gpu can be made up of a broad mix of GPU architectures, it may be useful to provide the following
--constraint scheduler option with your job submissions:
This guide is a quick overview of some of the many available options of the SLURM cluster scheduler. For more information on the available options, you may wish to reference some of the following available documentation for the demonstrated SLURM commands;
- Use the
man squeuecommand to see a full list of scheduler queue instructions
- Use the
man sbatch/sruncommand to see a full list of scheduler submission instructions
- Online documentation for the SLURM scheduler is available here