Skip to content

Requesting Access to the Long Partitions

The long_cpu and long_gpu partitions provide extended runtime capabilities for computational jobs that cannot be completed within the standard 48-hour limit of the regular cpu and gpu partitions. These partitions are available on an "as needs basis" and require approval for access.

Overview

The long partitions consist of:

long_cpu partition

  • 5 compute nodes (erc-hpc-comp001-005)
  • 640 CPU cores
  • No GPUs (CPU-only compute resources)
  • 7-day maximum runtime

long_gpu partition

  • 3 compute nodes (erc-hpc-comp031, erc-hpc-comp048, erc-hpc-comp202)
  • 400 CPU cores
  • 16 GPUs
  • 10-day maximum runtime

When to Request Access

Access to the long partitions can be requested when:

  • A computational job requires more than 48 hours to complete
  • An application cannot be checkpointed or interrupted and resumed
  • For long_gpu: A job requires GPU resources for extended periods

Note

Before requesting access to the long partitions, please investigate native checkpointing for applications or other approaches that can be used to optimise run times.

Other ways to improve run times can be to break work into smaller chunks and increase parallelisation.

Monitoring resource usage may also suggest changes in resource requests, of e.g. cores, that can also help to speed up jobs.

How to Request Access

Access to the long partitions can be requested through the e-Research support team, please send an email to support@er.kcl.ac.uk with the following information:

  • Which partition is required (long_cpu or long_gpu)
  • The filepath to the job script and any command line parameters passed when invoking the script
  • A brief description of the computational job/workflow
  • Estimated runtime required
  • Explanation of why additional runtime is required and any previous optimisation attempts

The request will be reviewed; additional information may be required and suggestions may be made on how to meet objectives without access to long partitions.

Submitting Jobs to Long Partitions

Once access has been granted, jobs can be submitted to the long partitions using the standard SLURM batch job commands with the appropriate partition specified (-p long_cpu or -p long_gpu).

Resource considerations when using long partitions

  • Request only the resources you need
  • Consider memory requirements, long-running jobs may have different memory patterns
  • Monitor job progress using squeue to check job status and sacct to review resource utilization
  • Be mindful of other users, the long partitions have limited resources shared among approved users
  • For GPU jobs, seek to ensure the application effectively utilizes GPU resources (for the entire runtime)

Please see Checking job resource usage for more details on monitoring resource usage

Best Practices

  • Test with shorter jobs first by validating workflows on the standard cpu or gpu partitions before submitting jobs to long partitions
  • Use appropriate time limits by specifying realistic time estimates using #SBATCH --time directive (up to a maximum of 7 days for long_cpu and 10 days for long_gpu)
  • Implement progress monitoring, e.g. add logging or progress indicators to track job advancement
  • Plan for contingencies by considering what happens if a job fails
  • Monitor Memory, CPU and GPU (where applicable) utilization to ensure efficient resource usage

Troubleshooting

  • Verify node availability in the partition, e.g. use sinfo -p long_cpu or sinfo -p long_gpu
  • Review resource requests and ensure job requirements do not exceed available resources
  • For GPU jobs, verify GPU availability with sinfo -p long_gpu -o "%P %A %C %G"

For further troubleshooting please raise a ticket via email to: support@er.kcl.ac.uk

Further Information