Running jobs on multiple nodes
Note
It is important to note that applications will not necessarily support being run across multiple nodes. They must explicitly support the Message Passing Interface (MPI) for this purpose as seen in the examples provided below.
The MPI is used to run parallel jobs on multiple nodes. The scheduler (SLURM) needs to be aware so that jobs can be efficiently allocated over compute nodes to get the best possible performance.
Using multiple CPU cores across multiple nodes is achieved by specifying the -N, --nodes=<minnodes[-maxnodes]>
option - which requests a minimum (and optional maximum) number of nodes to allocate to the submitted job.
If only the minnodes
count is specified - then this is used for both the minimum and maximum node count for the job.
You can request multiple cores over multiple nodes using a combination of scheduler directives either in your job submission command or within your job script. Some of the following examples demonstrate how you can obtain cores across different resources;
--nodes=2 --ntasks=16
Requests 16 cores across 2 compute nodes
--nodes=2
Requests all available cores of 2 compute nodes
--ntasks=16
Requests 16 cores across any available compute nodes
--nodes=2 --ntasks-per-node=16
Requests 16 cores on each of 2 nodes
MPI examples
Two examples are provided below. As well as providing an introduction to MPI they can be used to debug MPI issues.
To see initial user level debugging from mpirun add the -d
flag to the mpirun command.
Additional information (including much more detail about how to debug MPI applications) is available on the Open MPI docs at: https://docs.open-mpi.org/en/main/index.html
Example 1: "Hello world"
This is a very simple example that will spawn processes on multiple nodes but not require them to actually communicate with each other.
Copy the code below and save it with file name "mpi_hello_world.c".
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27 | #include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
|
To compile the sample program:
| mpicc -o mpi_hello_world_c mpi_hello_world.c
|
Using the submission script below the job is scheduled over two compute nodes with 16 cores on each node and mpirun invoked using 32 processes.
| #!/bin/bash -l
#SBATCH --ntasks-per-node 16
#SBATCH -N 2
#SBATCH --job-name=openmpi
#SBATCH --output=openmpi.out.%j
mpirun -n 32 mpi_hello_world_c
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 | k1234567@erc-hpc-login1:~$ sbatch -p cpu openmpi.sh
Submitted batch job 11449
k1234567@erc-hpc-login1:~$ squeue -u k1234567
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
11449 cpu openmpi k1234567 R 0:02 2 erc-hpc-comp[184-185]
k1234567@erc-hpc-login1:~$ cat openmpi.out.11449
Hello world from processor erc-hpc-comp184, rank 7 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 8 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 13 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 14 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 15 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 4 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 0 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 3 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 5 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 6 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 1 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 2 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 9 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 10 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 12 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 17 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 21 out of 32 processors
Hello world from processor erc-hpc-comp184, rank 11 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 29 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 30 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 16 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 20 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 22 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 24 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 25 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 27 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 28 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 19 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 26 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 18 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 23 out of 32 processors
Hello world from processor erc-hpc-comp185, rank 31 out of 32 processors
|
Example 2: "Hello from..."
This is still quite a simple script although does check that two spawned processes can send and receive messages.
create a file such as mpi_comm.c
with the contents:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52 | #include <stdio.h>
#include <mpi.h>
#include <string.h>
int main(int argc, char** argv) {
int rank, n_ranks, my_pair;
// First call MPI_Init
MPI_Init(&argc, &argv);
// Get my name
char my_processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(my_processor_name, &name_len);
// Get the number of ranks
MPI_Comm_size(MPI_COMM_WORLD,&n_ranks);
// Get my rank
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
// Figure out my pair
if( rank%2 == 1 ){
my_pair = rank-1;
} else {
my_pair = rank+1;
}
// Run only if my pair exists
if( my_pair < n_ranks ){
if( rank%2 == 0 ){
char message[32];
strcpy(message, "Hello from ");
strcat(message, my_processor_name);
strcat(message, "\n");
MPI_Send(message, 32, MPI_CHAR, my_pair, 0, MPI_COMM_WORLD);
}
if( rank%2 == 1 ){
char message[32];
MPI_Status status;
printf("From rank %d to %d out of %d: ",my_pair,rank,n_ranks);
MPI_Recv(message, 32, MPI_CHAR, my_pair, 0, MPI_COMM_WORLD, &status);
printf("%s",message);
}
}
// Call finalize at the end
return MPI_Finalize();
}
|
After compilation and running expected output is something like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 | From rank 6 to 7 out of 32: Hello from erc-hpc-comp183
From rank 8 to 9 out of 32: Hello from erc-hpc-comp183
From rank 0 to 1 out of 32: Hello from erc-hpc-comp183
From rank 2 to 3 out of 32: Hello from erc-hpc-comp183
From rank 12 to 13 out of 32: Hello from erc-hpc-comp183
From rank 14 to 15 out of 32: Hello from erc-hpc-comp183
From rank 4 to 5 out of 32: Hello from erc-hpc-comp183
From rank 10 to 11 out of 32: Hello from erc-hpc-comp183
From rank 22 to 23 out of 32: Hello from erc-hpc-comp184
From rank 30 to 31 out of 32: Hello from erc-hpc-comp184
From rank 28 to 29 out of 32: Hello from erc-hpc-comp184
From rank 16 to 17 out of 32: Hello from erc-hpc-comp184
From rank 26 to 27 out of 32: Hello from erc-hpc-comp184
From rank 20 to 21 out of 32: Hello from erc-hpc-comp184
From rank 18 to 19 out of 32: Hello from erc-hpc-comp184
From rank 24 to 25 out of 32: Hello from erc-hpc-comp184
|
Running mpi in an interactive session
--mpi=pmix
can be added to the invocation of srun to allow mpi jobs to be run e.g.
| srun -p cpu --ntasks-per-node 28 -N 2 -t 30:00 --mpi=pmix --pty /bin/bash -l
|
mpirun can then be invoked in the same way as in a submission script e.g. mpirun -n 56 mpi_comm_c
Known issues
Occasionally there is a bug that results in some error messages reporting:
PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
If these are seen, either on the command line or in the submission script, export PMIX_MCA_gds=hash
before invoking mpirun, this will only need to be done once per session.