Scaling measurement

Here, we measure the acceleration of the llg3d program when increasing the number of MPI processes on a SLURM cluster. Two platforms are tested:

  • An Apple M3 Max laptop with 16 CPU cores (12 are performance cores)

  • A SLURM cluster with 6 nodes of 128 CPU cores (AMD EPYC 7713 64-Core Processor)

Acceleration on an Apple M3 Max

A bash script scaling.sh is used to run the program on a variable number of MPI processes:

#!/bin/bash

echo -e "np\t| N\t| time/ite [s] "
echo -e "-------------------------------------"

for np in 1 2 3 4 5 6 8 10 12 15; do
    N=$((100 * np))
    mpiexec -n "$np" llg3d --N "$N" --Jx 600 --n_mean 0 --result_file run_$np.npz > /dev/null 2>&1
    time_per_ite=$(llg3d.extract run_$np.npz results/metrics/time_per_ite)
    echo -e "$np\t| $N\t| $time_per_ite"
done

To maintain a constant execution time, the number of iterations is varied. The execution time per iterations will be extracted from the result files.

The script is called from a run directory:

$ # Create a new directory for the run and move into it
$ mkdir run_m3
$ cd run_m3
$ # Run the scaling script
$ ../scaling.sh
np	| N	| time/ite [s] 
-------------------------------------
1	| 100	| 0.027003140000160783
2	| 200	| 0.013229978544986807
3	| 300	| 0.008791839303448797
4	| 400	| 0.006845292499929201
5	| 500	| 0.005322440499905497
6	| 600	| 0.004086350763488251
8	| 800	| 0.003354668906249572
10	| 1000	| 0.0030978774579707535
12	| 1200	| 0.0031420966666579867
15	| 1500	| 0.0038488076106489946

The plot_acceleration.py script is used to visualize the speed-up: we compare the parallel execution with the sequential execution (np = 1) as a function of the number of MPI processes.

$ python plot_acceleration.py run_m3
===================
run_m3
-------------------
n_proc | efficiency
-------|-----------
1      | 1.000
2      | 1.021
3      | 1.024
4      | 0.986
5      | 1.015
6      | 1.101
8      | 1.006
10     | 0.872
12     | 0.716
15     | 0.468
Image saved to run_m3/scaling.png

Scaling on Apple M3 Max

The acceleration is close to the ideal linear scaling up to 8 MPI processes, and then degrades, especially above 12 processes where the efficiency cores are used.

Acceleration on a SLURM cluster

The scaling.slurm script submits an array of jobs to the SLURM scheduler on the cluster:

#!/bin/bash

#SBATCH -p public            # targeting the public partition
#SBATCH --ntasks-per-core=1  # disabling multithreading
#SBATCH --exclusive          # exclusive access to the node
#SBATCH -w gaya[1-4]         # specifying nodes to use
#SBATCH -n 200               # reserving 200 compute cores
#SBATCH -J scaling           # naming the job
#SBATCH --array=0-11         # creating a SLURM job array of 12 sub-jobs

NPS=(1 2 4 12 20 25 40 50 100 120 150 200) 
np=${NPS[$SLURM_ARRAY_TASK_ID]}
N=$((40 * $np))

mpirun -np $np llg3d --N ${N} --Jx 3000 --Jy 21 --Jz 21 --n_mean 0 --solver mpi --result_file run_${np}.npz --profiling

Each job runs the llg3d program with a different number of MPI processes, from 1 to 200. The domain is 5 times longer than in the previous test to maintain a computation vs communication ratio compatible with a good scaling. First, the communication is done in a non-blocking way to maximize the scaling.

This script is submitted to SLURM on a 6-node cluster with 128 cores per node (so the execution occurs on a single node):

$ # Create a new directory for the SLURM run and submit the jobs array
$ mkdir -p run_slurm_non_blocking
$ cd run_slurm_non_blocking
$ sbatch ../scaling.slurm
Submitted batch job 45077

Then we submit a blocking version of the same script to compare the scaling:

$ # Create a new directory for the SLURM run and submit the jobs array
$ mkdir -p run_slurm_blocking
$ cd run_slurm_blocking
$ sbatch ../scaling_blocking.slurm
Submitted batch job 45078

Finally, we plot the acceleration of the blocking and non-blocking versions:

$ python plot_acceleration.py run_slurm_non_blocking run_slurm_blocking
===================
run_slurm_non_blocking
-------------------
n_proc | efficiency
-------|-----------
1      | 1.000
2      | 0.824
4      | 1.044
12     | 0.965
20     | 0.722
25     | 0.696
40     | 0.798
50     | 0.848
100    | 1.017
120    | 0.964
150    | 0.911
200    | 0.835
===================
run_slurm_blocking
-------------------
n_proc | efficiency
-------|-----------
1      | 1.000
2      | 0.785
4      | 0.777
12     | 0.686
20     | 0.576
25     | 0.697
40     | 0.643
50     | 0.724
100    | 0.834
120    | 0.811
150    | 0.742
200    | 0.640
Image saved to run_slurm_non_blocking/scaling.png

Scaling on SLURM cluster

In the non-blocking case, the acceleration is close to the ideal linear scaling up to 120 MPI processes (above 96% parallel efficiency) and then degrades to 83% parallel efficiency at 200 MPI processes. In the blocking case, the acceleration starts to degrade above 25 MPI processes. At 200 MPI processes, the parallel efficiency is only 64%, showing the advantage of using non-blocking communications.