Scaling measurement¶
Here, we measure the acceleration of the llg3d program when increasing the number of MPI processes on a SLURM cluster.
Two platforms are tested:
An Apple M3 Max laptop with 16 CPU cores (12 are performance cores)
A SLURM cluster with 6 nodes of 128 CPU cores (AMD EPYC 7713 64-Core Processor)
Acceleration on an Apple M3 Max¶
A bash script scaling.sh is used to run the program on a variable number of MPI processes:
#!/bin/bash
echo -e "np\t| N\t| time/ite [s] "
echo -e "-------------------------------------"
for np in 1 2 3 4 5 6 8 10 12 15; do
N=$((100 * np))
mpiexec -n "$np" llg3d --N "$N" --Jx 600 --n_mean 0 --result_file run_$np.npz > /dev/null 2>&1
time_per_ite=$(llg3d.extract run_$np.npz results/metrics/time_per_ite)
echo -e "$np\t| $N\t| $time_per_ite"
done
To maintain a constant execution time, the number of iterations is varied. The execution time per iterations will be extracted from the result files.
The script is called from a run directory:
$ # Create a new directory for the run and move into it
$ mkdir run_m3
$ cd run_m3
$ # Run the scaling script
$ ../scaling.sh
np | N | time/ite [s]
-------------------------------------
1 | 100 | 0.027003140000160783
2 | 200 | 0.013229978544986807
3 | 300 | 0.008791839303448797
4 | 400 | 0.006845292499929201
5 | 500 | 0.005322440499905497
6 | 600 | 0.004086350763488251
8 | 800 | 0.003354668906249572
10 | 1000 | 0.0030978774579707535
12 | 1200 | 0.0031420966666579867
15 | 1500 | 0.0038488076106489946
The plot_acceleration.py script is used to visualize the speed-up:
we compare the parallel execution with the sequential execution (np = 1) as a function of the number of MPI processes.
$ python plot_acceleration.py run_m3
===================
run_m3
-------------------
n_proc | efficiency
-------|-----------
1 | 1.000
2 | 1.021
3 | 1.024
4 | 0.986
5 | 1.015
6 | 1.101
8 | 1.006
10 | 0.872
12 | 0.716
15 | 0.468
Image saved to run_m3/scaling.png

The acceleration is close to the ideal linear scaling up to 8 MPI processes, and then degrades, especially above 12 processes where the efficiency cores are used.
Acceleration on a SLURM cluster¶
The scaling.slurm script submits an array of jobs to the SLURM scheduler on the cluster:
#!/bin/bash
#SBATCH -p public # targeting the public partition
#SBATCH --ntasks-per-core=1 # disabling multithreading
#SBATCH --exclusive # exclusive access to the node
#SBATCH -w gaya[1-4] # specifying nodes to use
#SBATCH -n 200 # reserving 200 compute cores
#SBATCH -J scaling # naming the job
#SBATCH --array=0-11 # creating a SLURM job array of 12 sub-jobs
NPS=(1 2 4 12 20 25 40 50 100 120 150 200)
np=${NPS[$SLURM_ARRAY_TASK_ID]}
N=$((40 * $np))
mpirun -np $np llg3d --N ${N} --Jx 3000 --Jy 21 --Jz 21 --n_mean 0 --solver mpi --result_file run_${np}.npz --profiling
Each job runs the llg3d program with a different number of MPI processes, from 1 to 200.
The domain is 5 times longer than in the previous test to maintain a computation vs
communication ratio compatible with a good scaling.
First, the communication is done in a non-blocking way to maximize the scaling.
This script is submitted to SLURM on a 6-node cluster with 128 cores per node (so the execution occurs on a single node):
$ # Create a new directory for the SLURM run and submit the jobs array
$ mkdir -p run_slurm_non_blocking
$ cd run_slurm_non_blocking
$ sbatch ../scaling.slurm
Submitted batch job 45077
Then we submit a blocking version of the same script to compare the scaling:
$ # Create a new directory for the SLURM run and submit the jobs array
$ mkdir -p run_slurm_blocking
$ cd run_slurm_blocking
$ sbatch ../scaling_blocking.slurm
Submitted batch job 45078
Finally, we plot the acceleration of the blocking and non-blocking versions:
$ python plot_acceleration.py run_slurm_non_blocking run_slurm_blocking
===================
run_slurm_non_blocking
-------------------
n_proc | efficiency
-------|-----------
1 | 1.000
2 | 0.824
4 | 1.044
12 | 0.965
20 | 0.722
25 | 0.696
40 | 0.798
50 | 0.848
100 | 1.017
120 | 0.964
150 | 0.911
200 | 0.835
===================
run_slurm_blocking
-------------------
n_proc | efficiency
-------|-----------
1 | 1.000
2 | 0.785
4 | 0.777
12 | 0.686
20 | 0.576
25 | 0.697
40 | 0.643
50 | 0.724
100 | 0.834
120 | 0.811
150 | 0.742
200 | 0.640
Image saved to run_slurm_non_blocking/scaling.png

In the non-blocking case, the acceleration is close to the ideal linear scaling up to 120 MPI processes (above 96% parallel efficiency) and then degrades to 83% parallel efficiency at 200 MPI processes. In the blocking case, the acceleration starts to degrade above 25 MPI processes. At 200 MPI processes, the parallel efficiency is only 64%, showing the advantage of using non-blocking communications.