Efficiency measurements¶
The efficiency is defined as the time taken per iteration per point in the simulation domain. This metric allows us to compare the performance of different solvers and configurations.
Note
The lower the efficiency value, the better the performance.
The efficiency is expected to be:
for the NumPy solver: almost constant with respect to the domain size
for the MPI solver at a constant number of CPU cores: almost constant with respect to the domain size if the domain is large enough to amortize the communication overhead
for the OpenCL solver: decreasing with respect to the domain size, as the GPU is better utilized with larger domains.
We run the llg3d.bench.efficiency script that executes the
llg3d.benchmarks.efficiency module to measure the
efficiency of the different solvers on different hardware configurations:
Parallelipedic domains of size \((J_x \times 24 \times 24)\) with \(J_x = 128, 256, ..., 4096\)
solvers:
NumPy solver
MPI solver with 8 or 32 processes
OpenCL solver with a GPU device
The number of iterations is varied according to the domain size to keep the total computation time. Tests are performed in single precision to be compatible with the Apple M3 GPU.
Performance on an Apple M3 Max¶
We run the benchmark with 8 MPI processes and 10 repeats for each domain size.
$ llg3d.bench.efficiency run --np 8 --repeats 10 --csv m3/bench_efficiency.csv
It produces the CSV file m3/bench_efficiency.csv than can
be used to generate the report from the m3 directory:
$ llg3d.bench.efficiency report bench_efficiency.csv
CPU: Apple M3 Max | GPU: Apple M3 Max
Domain size NumPy (1 CPU core) MPI (8 CPU cores) (Accel) OpenCL (1 GPU) (Accel)
------------- -------------------- --------------------------- ------------------------
128 3.8e-08 6.4e-09 ( 5.9x) 7.8e-10 ( 48.5x)
256 4.3e-08 5.9e-09 ( 7.4x) 4.5e-10 ( 96.9x)
512 4.3e-08 6.2e-09 ( 7.0x) 2.8e-10 (154.6x)
1024 4.7e-08 7.6e-09 ( 6.2x) 2.4e-10 (193.8x)
2048 5.1e-08 1.1e-08 ( 4.7x) 3.1e-10 (163.1x)
4096 4.1e-08 1.3e-08 ( 3.2x) 3.6e-10 (116.1x)
and the resulting plot using:
$ llg3d.bench.efficiency plot bench_efficiency.csv

The NumPy solver shows an almost constant efficiency, as expected.
The MPI solver with 8 processes shows an increasing efficiency with the domain size. The reason for this behavior is not clear.
The OpenCL solver shows an exponentially decreasing efficiency when the domain size increases from 128 to 512. A minimum is reached at 1024, then the efficiency increases again with the domain size, but it is still much lower than the NumPy solver.
Acceleration
For \(J_x = 1024\), OpenCL is almost \(200 \times\) faster than NumPy!
Performance on a multi-core server with GPUs¶
We now test the performance on a server with 32 CPU cores equipped with 3 AMD
Instinct MI210 GPUs (only one GPU is used here) using the following
utils/efficiency/sbatch.slurm script:
#!/bin/bash
#SBATCH -p gpu # targeting the gpu partition
#SBATCH --ntasks-per-core=1 # disabling multithreading
#SBATCH -n 32 # number of CPU cores
#SBATCH --gres=gpu:1 # ask for 1 GPU
#SBATCH -J llg3d-gpu # job name
# Activating the Python virtual environment
source .venv/bin/activate
# Launching the computation
llg3d.bench.efficiency run --repeats 10 --csv gaya/bench_efficiency.csv
We submit the job with:
$ sbatch sbatch.slurm
It produces the CSV file gaya/bench_efficiency.csv than can
be used to generate the report from the gaya directory:
$ llg3d.bench.efficiency report bench_efficiency.csv
CPU: AMD EPYC 7313 16-Core Processor | GPU: AMD Instinct MI210
Domain size NumPy (1 CPU core) MPI (32 CPU cores) (Accel) OpenCL (1 GPU) (Accel)
------------- -------------------- ---------------------------- ------------------------
128 7.9e-08 5.8e-09 ( 13.6x) 4.5e-10 (175.8x)
256 8e-08 4.0e-09 ( 19.7x) 2.9e-10 (272.0x)
512 8.4e-08 3.6e-09 ( 23.7x) 2.3e-10 (372.6x)
1024 8.5e-08 3.3e-09 ( 25.5x) 1.9e-10 (439.9x)
2048 8.5e-08 3.0e-09 ( 28.0x) 1.6e-10 (517.7x)
4096 9.4e-08 3.5e-09 ( 26.6x) 1.5e-10 (643.2x)
and the resulting plot using:
$ llg3d.bench.efficiency plot bench_efficiency.csv

The NumPy solver shows an almost constant efficiency, as expected.
The MPI solver with 32 processes shows a decreasing efficiency with the domain size up to \(J_x = 1024\) then a slight increase. Compared to the NumPy executions, the acceleration is close to 32 in the 512-4096 range.
On this MI210 GPU, the OpenCL solver shows a monotonic decreasing efficiency in a ratio of 3 from 128 to 4096.
Acceleration
For \(1024 \leq J_x \leq 4096\) OpenCL is between \(440\times\) and \(640\times\) faster than NumPy!
Comparison of three GPU platforms¶
We compare the efficiency of the OpenCL solver on three different GPU platforms in single precision:
Apple M3 Max (64GB RAM, 40 GPU cores)
AMD Instinct MI210
NVIDIA V100
$ llg3d.bench.efficiency compare m3/bench_efficiency.csv gaya/bench_efficiency.csv v100/bench_efficiency.csv --solver opencl
Saved comparison plot to bench_efficiency_comparison_opencl.png

Note
The efficiency benchmark runs with no data recording: only the compute kernels are measured. In practice, the total execution time of a simulation includes additional overheads such as data recording, which can significantly impact the overall performance, especially for OpenCL where the compute time is very low.