Profiling¶
Using the internal profiler¶
NumPy-based profiling¶
Each function decorated with llg3d.solvers.profiling.timeit() records its execution time in run.npz.
Launch LLG3D with the internal profiler enabled:
$ llg3d --N 100 --solver numpy --n_mean 1 --Jx 150 --Jy 11 --Jz 11 --profiling ; rm -f run.npz
...
total_time [s] = 0.514
time/ite [s/ite] = 5.141e-03
efficiency [s/ite/pt] = 2.833e-07
CFL = 7.543e-02
Profiling info:
Function | Calls | total_time (s) | % | Avg Time (s)
--------------------------------------------------------------------
_compute_slope | 200 | 0.321811 | 62.6% | 0.001609
_compute_laplacian | 200 | 0.217171 | 42.2% | 0.001086
_laplacian_3d | 600 | 0.206513 | 40.2% | 0.000344
_get_R_random | 100 | 0.096505 | 18.8% | 0.000965
_normalize | 100 | 0.011847 | 2.3% | 0.000118
_xyz_average | 101 | 0.008457 | 1.6% | 0.000084
Saving run.npz
This simple profiling shows that:
An iteration step is mostly spent in the
llg3d.solvers.numpy.NumPySolver._compute_slope()method which is called twice per iteration (for the prediction and correction steps).Most of the time in this method is spent in the
llg3d.solvers.numpy.NumPySolver._compute_laplacian()method.The
llg3d.solvers.base.BaseSolver._get_R_random()method takes about the sixth of the time of the whole iteration step, while the rest is negligible.
OpenCL-based profiling¶
Each kernel is profiled using the OpenCL profiler using OpenCL events. The events are processed at the end of the time loop and the cumulative time and call counts for each kernel and function are stored.
$ llg3d --N 1000 --solver opencl --precision single --n_mean 0 --T 1000 --profiling ; rm -f run.npz
...
total_time [s] = 0.265
time/ite [s/ite] = 2.652e-04
efficiency [s/ite/pt] = 2.005e-09
CFL = 7.543e-02
Profiling info:
Function | Calls | total_time (s) | % | Avg Time (s)
------------------------------------------------------------------------------
_compute_R_random | 1000 | 0.016763 | 6.3% | 0.000017
kernel_predict | 1000 | 0.011867 | 4.5% | 0.000012
kernel_correct_and_normalize | 1000 | 0.011297 | 4.3% | 0.000011
Saving run.npz
This simple profiling shows that an iteration step without space averaging is divided into 3 equivalent parts:
random field generation with
llg3d.solvers.opencl.OpenCLSolver._compute_R_random(),prediction step with
kernel_predictofllg3d.solvers.opencl.OpenCLSolver,correction step with
kernel_correct_and_normalizeofllg3d.solvers.opencl.OpenCLSolver.
In reality, this values correspond to the time spent in the kernels, and do not include the overhead the OpenCL driver. That’s why the total time of the iteration step is much higher than the sum of the kernel times.
If we now enable space averaging with --n_mean 1:
$ llg3d --N 1000 --solver opencl --precision single --n_mean 1 --T 1000 --profiling ; rm -f run.npz
...
total_time [s] = 0.474
time/ite [s/ite] = 4.739e-04
efficiency [s/ite/pt] = 3.582e-09
CFL = 7.543e-02
Profiling info:
Function | Calls | total_time (s) | % | Avg Time (s)
------------------------------------------------------------------------------
_xyz_average | 1001 | 0.127841 | 27.0% | 0.000128
_compute_R_random | 1000 | 0.017378 | 3.7% | 0.000017
kernel_predict | 1000 | 0.012006 | 2.5% | 0.000012
kernel_correct_and_normalize | 1000 | 0.011546 | 2.4% | 0.000012
kernel_sum_m1_weighted | 1001 | 0.010019 | 2.1% | 0.000010
kernel_reduce_partial_sums | 1001 | 0.007663 | 1.6% | 0.000008
Saving run.npz
We can see that the _xyz_average step (that makes the calls to kernel_sum_m1_weighted
and kernel_reduce_partial_sums of llg3d.solvers.opencl.OpenCLSolver)
takes almost a half of the total time.
Using cProfile¶
Launch with cProfile:
mpirun -np 4 python -m cProfile -o llg3d_np4.prof -m llg3d -N 5000
Visualize with snakeviz:
snakeviz llg3d_np4.prof
To produce one profile file per MPI process, you can use the following script:
"""Profile using cProfile and dump one file per process."""
import cProfile
from mpi4py.MPI import COMM_WORLD as comm
from llg3d import main
pr = cProfile.Profile()
pr.enable()
main.main(["--solver", "mpi", "--N", "500"])
pr.disable()
pr.dump_stats(f'cpu_{comm.rank}.prof')