Execute on a SLURM cluster¶
A computing cluster is the preferred environment to run LLG3D simulations. They may benefit from:
Multiple CPU cores using MPI
GPU acceleration using OpenCL
a large number of compute nodes and GPU devices for parametric studies
Here, we illustrate how to run LLG3D on a SLURM cluster using job arrays to perform simulations at various temperatures in parallel.
Install LLG3D on the Cluster¶
First create a working directory and clone the LLG3D sources:
mkdir work
cd work
git clone git@gitlab.math.unistra.fr:llg3d/llg3d.git
Then install LLG3D:
uv venv
source .venv/bin/activate
uv sync --all-extras --active --directory ../llg3d
virtualenv .venv
source .venv/bin/activate
# Install LLG3D (in editable mode):
pip install -e "../llg3d[mpi,opencl]"
Parallel execution¶
Move to a run directory¶
Create a run directory (outside the cloned repository) and move into it:
mkdir run
cd run
The work directory structure is now as follows:
work/
├── .venv/ # Python virtual environment
├── llg3d/ # LLG3D source code
└── run/ # Run directory
Create a sbatch File¶
In this example, we create a SLURM job array across multiple temperatures for both OpenCL and MPI on the gaya cluster, which has six 128‑core CPU nodes (public partition) and a 3‑GPU node (gpu partition).
Copy the utils/slurm/opencl/sbatch_jobarrays.slurm file into the run directory:
cp ../llg3d/utils/slurm/opencl/sbatch_jobarrays.slurm .
Its content is as follows:
#!/bin/bash
#SBATCH -p gpu # targeting the gpu partition
#SBATCH --ntasks-per-core=1 # disabling multithreading
#SBATCH --gres=gpu:1 # ask for 1 GPU
#SBATCH -J llg3d-gpu # job name
#SBATCH --array=0-12 # creating a SLURM job array of 13 sub-jobs
# Array of temperatures
TEMPERATURES=(1000 1100 1200 1300 1350 1390 1400 1410 1450 1500 1550 1700 1900)
# If the number of SLURM tasks is different from the size of TEMPERATURES, we exit
if [ $SLURM_ARRAY_TASK_COUNT -ne ${#TEMPERATURES[@]} ]
then
echo "number of tasks != number of temperatures"
echo "($SLURM_ARRAY_TASK_COUNT != ${#TEMPERATURES[@]})"
exit 1
fi
# JOB TASK ID with 3 zero padding
id=$(printf %03d $SLURM_ARRAY_TASK_ID)
# Run temperature
let "temperature = ${TEMPERATURES[$SLURM_ARRAY_TASK_ID]}"
# Activating the Python virtual environment
source ../.venv/bin/activate
# Launching the computation
llg3d --solver opencl --N 20000 --start_averaging 12000 --Jx 3000 --dx 1e-9 --T $temperature --result_file run_T${temperature}K.npz
Copy the utils/slurm/mpi/sbatch_jobarrays.slurm file into the run directory:
cp ../llg3d/utils/slurm/mpi/sbatch_jobarrays.slurm .
Its content is as follows:
#!/bin/bash
#SBATCH -p public # targeting the public partition
#SBATCH --ntasks-per-core=1 # disabling multithreading
#SBATCH -n 40 # ask for 40 compute cores
#SBATCH -J llg3d-mpi # naming the job
#SBATCH --array=0-12 # creating a SLURM job array of 13 sub-jobs
# Array of temperatures
TEMPERATURES=(1000 1100 1200 1300 1350 1390 1400 1410 1450 1500 1550 1700 1900)
# If the number of SLURM tasks is different from the size of TEMPERATURES, we exit
if [ $SLURM_ARRAY_TASK_COUNT -ne ${#TEMPERATURES[@]} ]
then
echo "number of tasks != number of temperatures"
echo "($SLURM_ARRAY_TASK_COUNT != ${#TEMPERATURES[@]})"
exit 1
fi
# JOB TASK ID with 3 zero padding
id=$(printf %03d $SLURM_ARRAY_TASK_ID)
# Run temperature
let "temperature = ${TEMPERATURES[$SLURM_ARRAY_TASK_ID]}"
# Activating the Python virtual environment
source ../.venv/bin/activate
# Launching the computation
mpirun -np $SLURM_NTASKS llg3d --solver mpi --N 20000 --start_averaging 17500 --Jx 3000 --dx 1e-9 --T $temperature --result_file run_T${temperature}K.npz
Submit the Job Array¶
(run) $ sbatch sbatch_jobarrays.slurm
Submitted batch job 50221
The execution will create a SLURM job array where each sub-job corresponds to a temperature.
Monitor Job Execution¶
(run) $ squeue -u boileau
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
45116_[3-12] gpu llg3d-gp boileau PD 0:00 1 (Resources)
45116_1 gpu llg3d-gp boileau R 0:04 1 gaya-gpu
45116_2 gpu llg3d-gp boileau R 0:04 1 gaya-gpu
45116_0 gpu llg3d-gp boileau R 0:05 1 gaya-gpu
It can be seen that jobs [0-2] have already started (R for running)
while jobs [3-12] are still waiting for resources (PD for pending).
(run) $ squeue -u boileau
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
50221_5 public llg3d-mp boileau R 0:04 1 gaya2
50221_6 public llg3d-mp boileau R 0:04 1 gaya3
50221_7 public llg3d-mp boileau R 0:04 1 gaya3
50221_8 public llg3d-mp boileau R 0:04 1 gaya3
50221_9 public llg3d-mp boileau R 0:04 1 gaya4
50221_10 public llg3d-mp boileau R 0:04 1 gaya4
50221_11 public llg3d-mp boileau R 0:04 1 gaya4
50221_12 public llg3d-mp boileau R 0:04 1 gaya5
50221_0 public llg3d-mp boileau R 0:05 1 gaya1
50221_1 public llg3d-mp boileau R 0:05 1 gaya1
50221_2 public llg3d-mp boileau R 0:05 1 gaya1
50221_3 public llg3d-mp boileau R 0:05 1 gaya2
50221_4 public llg3d-mp boileau R 0:05 1 gaya2
It can be seen that all the jobs have already started (R for running).
When the jobs are finished, they leave the queue:
(run) $ squeue -u boileau
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
The execution produces the following directory structure:
(run) $ tree
.
├── run_T1000K.npz
├── run_T1100K.npz
├── run_T1200K.npz
├── run_T1300K.npz
├── run_T1350K.npz
├── run_T1390K.npz
├── run_T1400K.npz
├── run_T1410K.npz
├── run_T1450K.npz
├── run_T1500K.npz
├── run_T1550K.npz
├── run_T1700K.npz
├── run_T1900K.npz
├── sbatch_jobarrays.slurm
├── slurm-50221_0.out
├── slurm-50221_10.out
├── slurm-50221_11.out
├── slurm-50221_12.out
├── slurm-50221_1.out
├── slurm-50221_2.out
├── slurm-50221_3.out
├── slurm-50221_4.out
├── slurm-50221_5.out
├── slurm-50221_6.out
├── slurm-50221_7.out
├── slurm-50221_8.out
└── slurm-50221_9.out
1 directory, 27 files
Process the Results¶
Use the llg3d.m1_vs_T command, which executes the llg3d.post.m1_vs_T.plot_m1_vs_T() function to gather the results stored in the run_*.npz files and plot the average magnetization as a function of temperature (example for the OpenCL execution):
(run) $ llg3d.m1_vs_T run_*.npz -i m1_vs_T_opencl.png
Processing file: run_T1000K.npz
Processing file: run_T1100K.npz
Processing file: run_T1200K.npz
Processing file: run_T1300K.npz
Processing file: run_T1350K.npz
Processing file: run_T1390K.npz
Processing file: run_T1400K.npz
Processing file: run_T1410K.npz
Processing file: run_T1450K.npz
Processing file: run_T1500K.npz
Processing file: run_T1550K.npz
Processing file: run_T1700K.npz
Processing file: run_T1900K.npz
T_Curie = 1393 K
Image saved in m1_vs_T_opencl.png
The Curie temperature is computed as the value where the average magnetization drops below 0.1.
The plotted graph looks like this:
