Requesting Access to the Long Partitions¶

The long_cpu and long_gpu partitions provide extended runtime capabilities for computational jobs that cannot be completed within the standard 48-hour limit of the regular cpu and gpu partitions. These partitions are available on an "as needs basis" and require approval for access.

Overview¶

The long partitions consist of:

long_cpu partition

5 compute nodes (erc-hpc-comp001-005)
640 CPU cores
No GPUs (CPU-only compute resources)
7-day maximum runtime

long_gpu partition

3 compute nodes (erc-hpc-comp031, erc-hpc-comp048, erc-hpc-comp202)
400 CPU cores
16 GPUs
10-day maximum runtime

When to Request Access¶

Access to the long partitions can be requested when:

A computational job requires more than 48 hours to complete
An application cannot be checkpointed or interrupted and resumed
For long_gpu: A job requires GPU resources for extended periods

Note

Before requesting access to the long partitions, please investigate native checkpointing for applications or other approaches that can be used to optimise run times.

Other ways to improve run times can be to break work into smaller chunks and increase parallelisation.

Monitoring resource usage may also suggest changes in resource requests, of e.g. cores, that can also help to speed up jobs.

How to Request Access¶

Access to the long partitions can be requested through the e-Research support team, please send an email to support@er.kcl.ac.uk with the following information:

Which partition is required (long_cpu or long_gpu)
The filepath to the job script and any command line parameters passed when invoking the script
A brief description of the computational job/workflow
Estimated runtime required
Explanation of why additional runtime is required and any previous optimisation attempts

The request will be reviewed; additional information may be required and suggestions may be made on how to meet objectives without access to long partitions.

Submitting Jobs to Long Partitions¶

Once access has been granted, jobs can be submitted to the long partitions using the standard SLURM batch job commands with the appropriate partition specified (-p long_cpu or -p long_gpu).

Resource considerations when using long partitions¶

Request only the resources you need
Consider memory requirements, long-running jobs may have different memory patterns
Monitor job progress using squeue to check job status and sacct to review resource utilization
Be mindful of other users, the long partitions have limited resources shared among approved users
For GPU jobs, seek to ensure the application effectively utilizes GPU resources (for the entire runtime)

Please see Checking job resource usage for more details on monitoring resource usage

Best Practices¶

Test with shorter jobs first by validating workflows on the standard cpu or gpu partitions before submitting jobs to long partitions
Use appropriate time limits by specifying realistic time estimates using #SBATCH --time directive (up to a maximum of 7 days for long_cpu and 10 days for long_gpu)
Implement progress monitoring, e.g. add logging or progress indicators to track job advancement
Plan for contingencies by considering what happens if a job fails
Monitor Memory, CPU and GPU (where applicable) utilization to ensure efficient resource usage

Troubleshooting¶

Verify node availability in the partition, e.g. use sinfo -p long_cpu or sinfo -p long_gpu
Review resource requests and ensure job requirements do not exceed available resources
For GPU jobs, verify GPU availability with sinfo -p long_gpu -o "%P %A %C %G"

For further troubleshooting please raise a ticket via email to: support@er.kcl.ac.uk

Further Information¶

Running Jobs - Introduction - General job submission guide
Running GPU Jobs - Guide for GPU job submission
Scheduler Policy - Overall cluster usage policies
Compute Nodes - Hardware specifications and partition details
Getting Help - How to get support