Running jobs on CREATE HPC¶
This page covers what you need to know to get your code running on the CREATE HPC cluster. It covers quick start steps for beginners with simple jobs, configuration steps for GPU and MPI jobs, along with general advice on using the scheduler that will help you get the most out of the system. This uses only trivial code examples, specific scientific software can be accessed using modules or containers.
Quickstart¶
Prerequisites¶
These instructions assume the following to be true:
- Your account has been enabled.
- You have an ssh client installed on a computer with internet connectivity or on the KCL network.
- You are able to succesfully access the cluster login nodes.
- You know how to create files with a Linux based text editor.
What is the CREATE HPC cluster?¶
The CREATE HPC cluster is a collection of highly spec'd compute nodes (computers) with a shared network and storage servers. More information about the hardware specifications of the nodes is available on the Compute Nodes page. The scheduler (Slurm) allows users of the cluster to submit jobs (software applications) to run on this pool of hardware, and for the underlying hardware to be efficiently allocated according to compute requirements, access policies and priorities based on the framework set out in the Scheduler Policy.
Loading software dependencies¶
The cluster makes use of Module Environments to provide a means for loading specific versions of scientific software or development tools on the cluster. When submitting jobs to the cluster you should load the software modules required for your program to run.
Identify your partition¶
The scheduler is configured to group the compute nodes in to a number of paritions in order to apply a sharing policy. The first task when submitting jobs to the cluster is to identify which partition you should use from the tables on Compute Nodes.
Based on which User Group you belong to in the above table you must use the associated Partition Name with the -p
option of the srun
command in the following examples. The _gpu
partitions should be used when submitting GPU jobs.
Running an interactive job¶
Note
4 hours is the maximum runtime for interactive jobs. You must request a time limit no longer than 4 hours for interactive jobs: --time minutes
Running an interactive job is the cluster quivalent of executing a command on a standard Linux command line.
This means you will be able to provide input and read output (via the terminal) in real-time.
This is often used as a means to test that code runs before submitting a batch job.
You can start a new interactive job by using the srun
command; the scheduler will search for an available compute node, and provide you with an interactive login shell on the node if one is available.
1 2 3 4 5 6 7 8 9 10 11 |
|
In the above example, the srun
command is used together with several options: --time minutes
, -p public_cpu
, --pty
and /bin/bash -l
. -p cpu
specifies the shared partition.
The --time
option sets a limit on the total runtime of the job allocation and can be extended.
The --pty
option executes the task in pseudo terminal mode, allowing the session to act like a standard terminal session.
The /bin/bash
option is the command to be run, in this case the default Linux shell, bash.
Once srun
is run a terminal session is acquired where you can run standard Linux commands (echo, cd, ls, etc) on the allocated compute node (erc-hpc-comp-021 here). squeue -u <username>
shows the details of the running interactive job and reports an empty list once we have exited bash.
Submitting a batch job¶
Batch (or non-interactive) jobs allow users to leverage one of the main benefits of having a cluster scheduler; jobs can be queued up with instructions on how to run them and then executed across the cluster while the user does something else. Users submit jobs as scripts, which include instructions on how to run the job - the output of the job (stdout and stderr in Linux terminology) is written to a file on disk for review later on. You can write a batch job that does anything that can be typed on the command-line.
We'll start with a basic example - the following script is written in bash. You can create the script yourself using your editor of choice. The script does nothing more than print some messages to the screen (the echo lines), and sleeps for 15 seconds.
We've saved the script to a file called helloworld.sh
- the .sh
extension helps to remind us that this is a shell script, but adding a filename extension isn't strictly necessary for Linux.
1 2 3 4 5 |
|
Note
We use the -l
option to bash on the first line of the script to request a login session. This ensures that Environment Modules can be loaded from your script.
Note
SBATCH --output=/scratch/users/%u/%j.out
is specified to direct the output of the job to the fast scratch storage (ceph). We recommend using this configuration for all your jobs.
If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.
We can execute that script directly on the login node by using the command bash helloworld.sh
, we get the following output:
1 2 3 |
|
To submit your job script to the cluster job scheduler, use the command sbatch -p <partition> helloworld.sh
. Where <partition>
is your partition name.
The job scheduler should immediately report the job ID for your job; your job ID is a unique identifier which can be used to when viewing or controlling queued jobs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Default resources¶
The job launched above didn't make any explicit requests for resources (e.g. CPU cores, memory) or specify a runtime and so inherited the cluster defaults. If more resources are needed (this is HPC after all) the batch job should include instructions to request more resources. It is important to remember that if you exhaust the resource limits (e.g. runtime or memory) your job will be killed.
Job monitoring¶
Viewing and controlling queued jobs¶
Once your job has been submitted, use the squeue
command to view the status of the job queue (adding -u <your_username>
to see only your jobs.
If there are available compute nodes with the resources you've requested, your job should be shown in the R
(running) state, if not your job may be shown in the PD
(pending) state until resources are available to run it.
If a job is in PD
state - the reason for being unable to run will be displayed in the NODELIST(REASON)
column of the squeue
output.
1 2 3 4 5 6 |
|
You can use the scancel <jobid>
command to delete a job you've submitted, whether it's running or still in the queued state.
1 2 3 4 5 6 7 8 9 10 11 |
|
Checking job resource usage¶
To check the resource usage of a running or completed job, you can use the sacct
command.
Looking at the actual time taken, memory used, etc. of a completed job can be very helpful for optimising the amount of resources you request in future jobs.
Use the -j <job_id>
option to specify the ID of the job you want to investigate, and use --format=<fields>
to specify which job information you want to output.
Some useful fields included in the suggested command below are:
- the requested and elapsed time (Timelimit and Elapsed; CPUTime is the time elapsed multiplied by the number of CPUs used)
- the requested and allocated number of CPUs (ReqCPUS and NCPUS)
- the requested and maximum memory usage (ReqMem and MaxRSS)
1 2 3 4 5 |
|
The full list of available fields is given by sacct -e
.
More information on how to interpret the fields is available by running man sacct
.
To save remembering all the fields of interest, you can create an alias by adding the following line to your .profile
:
1 |
|
Scheduler instructions¶
Job instructions can be provided in two ways; they are:
- On the command line, as parameters to your
sbatch
orsrun
command. For example, you can set the name of your job using the--job-name=[name] | -J [name]
option:
1 2 3 4 5 |
|
- In your job script, by including scheduler directives at the top of your job script - you can achieve the same effect as providing options with the
sbatch
orsrun
commands. To add the--job-name
to our previous example:
1 2 3 4 5 6 |
|
Including job scheduler instructions in your job-scripts is often the most convenient method of working for batch jobs - follow the guidelines below for the best experience:
- Lines in your script that include job-scheduler directives must start with
#SBATCH
at the beginning of the line - You can have multiple lines starting with
#SBATCH
in your job-script - You can put multiple instructions separated by a space on a single line starting with
#SBATCH
- The scheduler will parse the script from top to bottom and set instructions in order; if you set the same parameter twice, the second value will be used.
- Instructions are parsed at job submission time, before the job itself has actually run.
- This means you can't, for example, tell the scheduler to put your job output in a directory that you create in the job-script itself - the directory will not exist when the job starts running, and your job will fail with an error.
- You can use dynamic variables in your instructions (see below)
Dynamic scheduler variables¶
When writing submission scripts you will often need to reference values set by the scheduler (e.g. jobid), values inherited from the OS (e.g. username), or values you set elsewhere in your script (e.g. ntasks in the SBATCH directives).
For this purpose a number of dynamic variables are made available. In the list below variables starting %
can be referenced in the SBATCH directives and those starting $
from the body of the shell script.
%u / $USER
The Linux username of the submitting user%a / $SLURM_ARRAY_TASK_ID
Job array ID (index) number%A / $SLURM_ARRAY_JOB_ID
Job allocation number for an array job%j / $SLURM_JOB_ID
Job allocation number%x / $SLURM_JOB_NAME
Job name$SLURM_NTASKS
Number of CPU cores requested with-n, --ntasks
, this can be provided to your code to make use of the allocated CPU cores
Simple scheduler instruction examples¶
Here are some commonly used scheduler instructions, along with some example of their usage:
Setting output file location¶
To set the output file location for your job, use the -o [file_name] | --output=[file_name]
option - both standard-out and standard-error from your job-script, including any output generated by applications launched by your job-script will be saved in the filename you specify.
By default, the scheduler stores data relative to your home-directory - but to avoid confusion, we recommend specifying a full path to the filename to be used. Although Linux can support several jobs writing to the same output file, the result is likely to be garbled - it's common practice to include something unique about the job (e.g. it's job ID) in the output filename to make sure your job's output is clear and easy to read.
Note
The directory used to store your job output file must exist and be writable by your user before you submit your job to the scheduler. Your job may fail to run if the scheduler cannot create the output file in the directory requested.
The following example uses the --output=[file_name]
instruction to set the output file location:
1 2 3 4 5 |
|
Note
SBATCH --output=/scratch/users/%u/%j.out
is specified to direct the output of the job to the fast scratch storage (ceph).
We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.
Setting working directory for your job¶
By default, jobs are executed from your home-directory on the cluster (i.e. /home/<your-user-name>
, $HOME
or ~
).
You can include cd
commands in your job-script to change to different directories; alternatively, you can provide an instruction to the scheduler to change to a different directory to run your job. The available options are:
-D | --chdir=[directory path]
- instruct the job scheduler to move into the directory specified before starting to run the job on a compute node
For example, to change your working directory in a batch job.
1 |
|
Note
The directory specified must exist and be accessible by the compute node in order for the job you submitted to run.
Waiting for a previous job before running¶
You can instruct the scheduler to wait for an existing job to finish before starting to run the job you are submitting with the -d [state:job_id] | --depend=[state:job_id]
option.
Running task array jobs¶
A common workload is having a large number of jobs to run which basically do the same thing, aside perhaps from having different input data. You could generate a job-script for each of them and submit it, but that's not very convenient - especially if you have many hundreds or thousands of tasks to complete. Such jobs are known as task arrays - an embarrassingly parallel job will often fit into this category.
A convenient way to run such jobs on a cluster is to use a task array, using the -a [array_spec] | --array=[array_spec]
directive.
Your job-script can then use the pseudo environment variables created by the scheduler to refer to data used by each task in the job. The following job-script uses the $SLURM_ARRAY_TASK_ID
/%a
variable to echo its current task ID to an output file:
1 2 3 4 5 |
|
All tasks in an array job are given a job ID with the format [job_ID]_[task_number]
e.g. 77_81
would be job number 77, array task 81.
Array jobs can easily be cancelled using the scancel
command - the following examples show various levels of control over an array job:
scancel 77
Cancels all array tasks under the job ID 77
scancel 77_[100-200]
Cancels array tasks 100-200
under the job ID 77
scancel 77_5
Cancels array task 5
under the job ID 77
A common use case for array jobs is running the same script on many different input files. One way to use an array job to accomplish this is to create an text file listing all the input files, then use the array ID to select the corresponding line from the file.
1 2 3 4 5 6 7 8 9 |
|
The job submission script could look something like this:
1 2 3 4 5 6 7 8 |
|
Requesting more resources¶
By default, jobs are constrained to the cluster defaults (see table below) - users can use scheduler instructions to request more resources for their jobs as needed. The following documentation shows how these requests can be made.
CPU cores | Memory | Runtime |
---|---|---|
1 core | 1GB | 24 hours |
In order to promote best-use of the cluster scheduler, particularly in a shared environment, it is recommended that you inform the scheduler of the amount of time, memory and CPU cores your job is expected to need.
This helps the scheduler appropriately place jobs on the available nodes in the cluster and should minimise any time spent queuing for resources to become available (to this end you should always request the minimal amount of resources you require to run).
See the section on checking job resource usage with sacct
for some tips on how to identify how the resources used by past jobs - you can use this as a guide for resource requests for new jobs.
Requesting a longer runtime¶
Note
48 hours is the maximum runtime for batch jobs. If you have a job which requires longer and cannot be checkpointed please email support@er.kcl.ac.uk to discuss your requirements.
You can inform the cluster scheduler of the expected runtime using the -t, --time=<time>
option. For example - to submit a job that runs for 2 hours, the following example job script could be used:
1 2 3 4 |
|
You can then see any time limits assigned to running jobs using the command squeue --long
:
1 2 3 4 |
|
Requesting more memory¶
You can specify the maximum amount of memory required per submitted job with the --mem=<MB>
option.
This informs the scheduler of the memory required for the submitted job. Optionally - you can also request an amount of memory per CPU core rather than a total amount of memory required per job.
To specify an amount of memory to allocate per core, use the --mem-per-cpu=<MB>
option.
Note
When running a job across multiple compute hosts, the --mem=<MB>
option informs the scheduler of the required memory per node
Running multi-threaded jobs¶
If you want to use multiple cores on a compute node to run a multi-threaded application, they need to inform the scheduler. Using multiple CPU cores is achieved by specifying the -n, --ntasks=<number>
option in either your submission command or the scheduler directives in your job script.
The --ntasks
option informs the scheduler of the number of cores you wish to reserve for use. If the parameter is omitted, the default --ntasks=1
is assumed. You could specify the option -n 4
to request 4 CPU cores for your job.
Besides the number of tasks, you will need to add --nodes=1
to your scheduler command or at the top of your job script with #SBATCH --nodes=1
, this will set the maximum number of nodes to be used to 1 and prevent the job selecting cores from multiple nodes (multi node jobs require MPI).
Note
If you request more cores than are available on a node in your cluster, the job will not run until a node capable of fulfilling your request becomes available. The scheduler will display the error in the output of the squeue
command.
Note
Just asking for more cores will not necessarily mean your code makes use of them. It is generally required to inform your application of how many cores to use. This can be done using the $SLURM_NTASKS variable in your submission script.
Running parallel (MPI) jobs¶
Note
It is important to note that applications will not necessarily support being run across multiple nodes. They must explicitly support MPI for this purpose as seen in the mpirun application in this example.
If you want to run parallel jobs via a messaging passing interface (MPI), they need to inform the scheduler - this allows jobs to be efficiently spread over compute nodes to get the best possible performance.
Using multiple CPU cores across multiple nodes is achieved by specifying the -N, --nodes=<minnodes[-maxnodes]>
option - which requests a minimum (and optional maximum) number of nodes to allocate to the submitted job.
If only the minnodes
count is specified - then this is used for both the minimum and maximum node count for the job.
You can request multiple cores over multiple nodes using a combination of scheduler directives either in your job submission command or within your job script. Some of the following examples demonstrate how you can obtain cores across different resources;
--nodes=2 --ntasks=16
Requests 16 cores across 2 compute nodes--nodes=2
Requests all available cores of 2 compute nodes--ntasks=16
Requests 16 cores across any available compute nodes
For example, to use 30 CPU cores on the cluster for a single application, the instruction --ntasks=30
can be used. The following example the mpirun
command testing MPI functionality across 30 CPU cores.
Sample code
Please copy the example code below and save it with file name "mpi_hello_world.c".
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
To compile the sample program:
1 |
|
In this example the job is scheduled over two compute nodes. The jobscript loads the default OpenMPI
module which makes the mpirun command available.
1 2 3 4 5 6 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
Running GPU jobs¶
Lots of scientific software is starting to make use of Graphical Processing Units (GPUs) for computation instead of traditional CPU cores. This is because GPUs out-perform CPUs for certain mathematical operations.
If you wish to schedule your job on a GPU you need to provide the --gres=gpu
option in your submissions script. The following example schedules a job on a GPU node then lists the GPU card it was assigned.
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Note
The maximum number of gpus that can be requested is now 8 on public_gpu
Note
Your GPU enabled application will mostly likely make use of the NVidia CUDA libraries, to load the CUDA module use module load cuda
in your job submission script.
Testing on the GPU¶
Available alongside the shared public gpu queue, the interruptible_gpu partition gives access to all GPUs in CREATE, leading to a larger pool size that serves well as both a mechanism for testing GPU scheduling and making use of unused private resources, as detailed in our scheduler policy.
So it is still important to take note that, although faster scheduling may be facilitated, jobs may be cancelled at anytime.
Additionally, as the interruptible_gpu can be made up of a broad mix of GPU architectures, it may be useful to provide the following --constraint
scheduler option with your job submissions:
1 |
|
Further documentation¶
This guide is a quick overview of some of the many available options of the SLURM cluster scheduler. For more information on the available options, you may wish to reference some of the following available documentation for the demonstrated SLURM commands;
- Use the
man squeue
command to see a full list of scheduler queue instructions - Use the
man sbatch/srun
command to see a full list of scheduler submission instructions - Online documentation for the SLURM scheduler is available here