Requesting Access to the Long Partitions¶
The long_cpu
and long_gpu
partitions provide extended runtime capabilities for computational jobs that cannot be completed within the standard 48-hour limit of the regular cpu
and gpu
partitions. These partitions are available on an "as needs basis" and require approval for access.
Overview¶
The long partitions consist of:
long_cpu partition
- 5 compute nodes (erc-hpc-comp001-005)
- 640 CPU cores
- No GPUs (CPU-only compute resources)
- 7-day maximum runtime
long_gpu partition
- 3 compute nodes (erc-hpc-comp031, erc-hpc-comp048, erc-hpc-comp202)
- 400 CPU cores
- 16 GPUs
- 10-day maximum runtime
When to Request Access¶
Access to the long partitions can be requested when:
- A computational job requires more than 48 hours to complete
- An application cannot be checkpointed or interrupted and resumed
- For
long_gpu
: A job requires GPU resources for extended periods
Note
Before requesting access to the long partitions, please investigate native checkpointing for applications or other approaches that can be used to optimise run times.
Other ways to improve run times can be to break work into smaller chunks and increase parallelisation.
Monitoring resource usage may also suggest changes in resource requests, of e.g. cores, that can also help to speed up jobs.
How to Request Access¶
Access to the long partitions can be requested through the e-Research support team, please send an email to support@er.kcl.ac.uk with the following information:
- Which partition is required (
long_cpu
orlong_gpu
) - The filepath to the job script and any command line parameters passed when invoking the script
- A brief description of the computational job/workflow
- Estimated runtime required
- Explanation of why additional runtime is required and any previous optimisation attempts
The request will be reviewed; additional information may be required and suggestions may be made on how to meet objectives without access to long partitions.
Submitting Jobs to Long Partitions¶
Once access has been granted, jobs can be submitted to the long partitions using the standard SLURM batch job commands with the appropriate partition specified (-p long_cpu
or -p long_gpu
).
Resource considerations when using long partitions¶
- Request only the resources you need
- Consider memory requirements, long-running jobs may have different memory patterns
- Monitor job progress using
squeue
to check job status andsacct
to review resource utilization - Be mindful of other users, the long partitions have limited resources shared among approved users
- For GPU jobs, seek to ensure the application effectively utilizes GPU resources (for the entire runtime)
Please see Checking job resource usage for more details on monitoring resource usage
Best Practices¶
- Test with shorter jobs first by validating workflows on the standard
cpu
orgpu
partitions before submitting jobs to long partitions - Use appropriate time limits by specifying realistic time estimates using
#SBATCH --time
directive (up to a maximum of 7 days forlong_cpu
and 10 days forlong_gpu
) - Implement progress monitoring, e.g. add logging or progress indicators to track job advancement
- Plan for contingencies by considering what happens if a job fails
- Monitor Memory, CPU and GPU (where applicable) utilization to ensure efficient resource usage
Troubleshooting¶
- Verify node availability in the partition, e.g. use
sinfo -p long_cpu
orsinfo -p long_gpu
- Review resource requests and ensure job requirements do not exceed available resources
- For GPU jobs, verify GPU availability with
sinfo -p long_gpu -o "%P %A %C %G"
For further troubleshooting please raise a ticket via email to: support@er.kcl.ac.uk
Further Information¶
- Running Jobs - Introduction - General job submission guide
- Running GPU Jobs - Guide for GPU job submission
- Scheduler Policy - Overall cluster usage policies
- Compute Nodes - Hardware specifications and partition details
- Getting Help - How to get support