Profiling

Using the internal profiler

NumPy-based profiling

Each function decorated with llg3d.solvers.profiling.timeit() records its execution time in run.npz. Launch LLG3D with the internal profiler enabled:

$ llg3d --N 100 --solver numpy --n_mean 1 --Jx 150 --Jy 11 --Jz 11 --profiling ; rm -f run.npz
...
total_time [s]        = 0.514
time/ite [s/ite]      = 5.141e-03
efficiency [s/ite/pt] = 2.833e-07
CFL                   = 7.543e-02
Profiling info:
Function           | Calls | total_time (s) |      % |  Avg Time (s)
--------------------------------------------------------------------
_compute_slope     |   200 |       0.321811 |  62.6% |      0.001609
_compute_laplacian |   200 |       0.217171 |  42.2% |      0.001086
_laplacian_3d      |   600 |       0.206513 |  40.2% |      0.000344
_get_R_random      |   100 |       0.096505 |  18.8% |      0.000965
_normalize         |   100 |       0.011847 |   2.3% |      0.000118
_xyz_average       |   101 |       0.008457 |   1.6% |      0.000084

Saving run.npz

This simple profiling shows that:

OpenCL-based profiling

Each kernel is profiled using the OpenCL profiler using OpenCL events. The events are processed at the end of the time loop and the cumulative time and call counts for each kernel and function are stored.

$ llg3d --N 1000 --solver opencl --precision single --n_mean 0 --T 1000 --profiling ; rm -f run.npz
...
total_time [s]        = 0.265
time/ite [s/ite]      = 2.652e-04
efficiency [s/ite/pt] = 2.005e-09
CFL                   = 7.543e-02
Profiling info:
Function                     | Calls | total_time (s) |      % |  Avg Time (s)
------------------------------------------------------------------------------
_compute_R_random            |  1000 |       0.016763 |   6.3% |      0.000017
kernel_predict               |  1000 |       0.011867 |   4.5% |      0.000012
kernel_correct_and_normalize |  1000 |       0.011297 |   4.3% |      0.000011

Saving run.npz

This simple profiling shows that an iteration step without space averaging is divided into 3 equivalent parts:

In reality, this values correspond to the time spent in the kernels, and do not include the overhead the OpenCL driver. That’s why the total time of the iteration step is much higher than the sum of the kernel times.

If we now enable space averaging with --n_mean 1:

$ llg3d --N 1000 --solver opencl --precision single --n_mean 1 --T 1000 --profiling ; rm -f run.npz
...
total_time [s]        = 0.474
time/ite [s/ite]      = 4.739e-04
efficiency [s/ite/pt] = 3.582e-09
CFL                   = 7.543e-02
Profiling info:
Function                     | Calls | total_time (s) |      % |  Avg Time (s)
------------------------------------------------------------------------------
_xyz_average                 |  1001 |       0.127841 |  27.0% |      0.000128
_compute_R_random            |  1000 |       0.017378 |   3.7% |      0.000017
kernel_predict               |  1000 |       0.012006 |   2.5% |      0.000012
kernel_correct_and_normalize |  1000 |       0.011546 |   2.4% |      0.000012
kernel_sum_m1_weighted       |  1001 |       0.010019 |   2.1% |      0.000010
kernel_reduce_partial_sums   |  1001 |       0.007663 |   1.6% |      0.000008

Saving run.npz

We can see that the _xyz_average step (that makes the calls to kernel_sum_m1_weighted and kernel_reduce_partial_sums of llg3d.solvers.opencl.OpenCLSolver) takes almost a half of the total time.

Using cProfile

Launch with cProfile:

mpirun -np 4 python -m cProfile -o llg3d_np4.prof -m llg3d -N 5000

Visualize with snakeviz:

snakeviz llg3d_np4.prof

To produce one profile file per MPI process, you can use the following script:

"""Profile using cProfile and dump one file per process."""
import cProfile
from mpi4py.MPI import COMM_WORLD as comm
from llg3d import main

pr = cProfile.Profile()
pr.enable()
main.main(["--solver", "mpi", "--N", "500"])
pr.disable()
pr.dump_stats(f'cpu_{comm.rank}.prof')