Execute on a SLURM cluster

A computing cluster is the preferred environment to run LLG3D simulations. They may benefit from:

  • Multiple CPU cores using MPI

  • GPU acceleration using OpenCL

  • a large number of compute nodes and GPU devices for parametric studies

Here, we illustrate how to run LLG3D on a SLURM cluster using job arrays to perform simulations at various temperatures in parallel.

Install LLG3D on the Cluster

First create a working directory and clone the LLG3D sources:

mkdir work
cd work
git clone git@gitlab.math.unistra.fr:llg3d/llg3d.git

Then install LLG3D:

uv venv
source .venv/bin/activate
uv sync --all-extras --active --directory ../llg3d
virtualenv .venv
source .venv/bin/activate
# Install LLG3D (in editable mode):
pip install -e "../llg3d[mpi,opencl]"

Parallel execution

Move to a run directory

Create a run directory (outside the cloned repository) and move into it:

mkdir run
cd run

The work directory structure is now as follows:

work/
├── .venv/  # Python virtual environment
├── llg3d/  # LLG3D source code
└── run/    # Run directory

Create a sbatch File

In this example, we create a SLURM job array across multiple temperatures for both OpenCL and MPI on the gaya cluster, which has six 128‑core CPU nodes (public partition) and a 3‑GPU node (gpu partition).

Copy the utils/slurm/opencl/sbatch_jobarrays.slurm file into the run directory:

cp ../llg3d/utils/slurm/opencl/sbatch_jobarrays.slurm .

Its content is as follows:

#!/bin/bash

#SBATCH -p gpu               # targeting the gpu partition
#SBATCH --ntasks-per-core=1  # disabling multithreading
#SBATCH --gres=gpu:1         # ask for 1 GPU
#SBATCH -J llg3d-gpu         # job name
#SBATCH --array=0-12         # creating a SLURM job array of 13 sub-jobs

# Array of temperatures
TEMPERATURES=(1000 1100 1200 1300 1350 1390 1400 1410 1450 1500 1550 1700 1900)

# If the number of SLURM tasks is different from the size of TEMPERATURES, we exit
if [ $SLURM_ARRAY_TASK_COUNT -ne ${#TEMPERATURES[@]} ]
then
    echo "number of tasks != number of temperatures"
    echo "($SLURM_ARRAY_TASK_COUNT != ${#TEMPERATURES[@]})"
    exit 1
fi

# JOB TASK ID with 3 zero padding
id=$(printf %03d $SLURM_ARRAY_TASK_ID)

# Run temperature
let "temperature = ${TEMPERATURES[$SLURM_ARRAY_TASK_ID]}"

# Activating the Python virtual environment
source ../.venv/bin/activate

# Launching the computation
llg3d --solver opencl --N 20000 --start_averaging 12000 --Jx 3000 --dx 1e-9 --T $temperature --result_file run_T${temperature}K.npz

Copy the utils/slurm/mpi/sbatch_jobarrays.slurm file into the run directory:

cp ../llg3d/utils/slurm/mpi/sbatch_jobarrays.slurm .

Its content is as follows:

#!/bin/bash

#SBATCH -p public            # targeting the public partition
#SBATCH --ntasks-per-core=1  # disabling multithreading
#SBATCH -n 40                # ask for 40 compute cores
#SBATCH -J llg3d-mpi         # naming the job
#SBATCH --array=0-12         # creating a SLURM job array of 13 sub-jobs

# Array of temperatures
TEMPERATURES=(1000 1100 1200 1300 1350 1390 1400 1410 1450 1500 1550 1700 1900)

# If the number of SLURM tasks is different from the size of TEMPERATURES, we exit
if [ $SLURM_ARRAY_TASK_COUNT -ne ${#TEMPERATURES[@]} ]
then
    echo "number of tasks != number of temperatures"
    echo "($SLURM_ARRAY_TASK_COUNT != ${#TEMPERATURES[@]})"
    exit 1
fi

# JOB TASK ID with 3 zero padding
id=$(printf %03d $SLURM_ARRAY_TASK_ID)

# Run temperature
let "temperature = ${TEMPERATURES[$SLURM_ARRAY_TASK_ID]}"

# Activating the Python virtual environment
source ../.venv/bin/activate

# Launching the computation
mpirun -np $SLURM_NTASKS llg3d --solver mpi --N 20000 --start_averaging 17500 --Jx 3000 --dx 1e-9 --T $temperature --result_file run_T${temperature}K.npz

Submit the Job Array

(run) $ sbatch sbatch_jobarrays.slurm 
Submitted batch job 50221

The execution will create a SLURM job array where each sub-job corresponds to a temperature.

Monitor Job Execution

(run) $ squeue -u boileau
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      45116_[3-12]       gpu llg3d-gp  boileau PD       0:00      1 (Resources)
           45116_1       gpu llg3d-gp  boileau  R       0:04      1 gaya-gpu
           45116_2       gpu llg3d-gp  boileau  R       0:04      1 gaya-gpu
           45116_0       gpu llg3d-gp  boileau  R       0:05      1 gaya-gpu

It can be seen that jobs [0-2] have already started (R for running) while jobs [3-12] are still waiting for resources (PD for pending).

(run) $ squeue -u boileau
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           50221_5    public llg3d-mp  boileau  R       0:04      1 gaya2
           50221_6    public llg3d-mp  boileau  R       0:04      1 gaya3
           50221_7    public llg3d-mp  boileau  R       0:04      1 gaya3
           50221_8    public llg3d-mp  boileau  R       0:04      1 gaya3
           50221_9    public llg3d-mp  boileau  R       0:04      1 gaya4
          50221_10    public llg3d-mp  boileau  R       0:04      1 gaya4
          50221_11    public llg3d-mp  boileau  R       0:04      1 gaya4
          50221_12    public llg3d-mp  boileau  R       0:04      1 gaya5
           50221_0    public llg3d-mp  boileau  R       0:05      1 gaya1
           50221_1    public llg3d-mp  boileau  R       0:05      1 gaya1
           50221_2    public llg3d-mp  boileau  R       0:05      1 gaya1
           50221_3    public llg3d-mp  boileau  R       0:05      1 gaya2
           50221_4    public llg3d-mp  boileau  R       0:05      1 gaya2

It can be seen that all the jobs have already started (R for running).

When the jobs are finished, they leave the queue:

(run) $ squeue -u boileau
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

The execution produces the following directory structure:

(run) $ tree
.
├── run_T1000K.npz
├── run_T1100K.npz
├── run_T1200K.npz
├── run_T1300K.npz
├── run_T1350K.npz
├── run_T1390K.npz
├── run_T1400K.npz
├── run_T1410K.npz
├── run_T1450K.npz
├── run_T1500K.npz
├── run_T1550K.npz
├── run_T1700K.npz
├── run_T1900K.npz
├── sbatch_jobarrays.slurm
├── slurm-50221_0.out
├── slurm-50221_10.out
├── slurm-50221_11.out
├── slurm-50221_12.out
├── slurm-50221_1.out
├── slurm-50221_2.out
├── slurm-50221_3.out
├── slurm-50221_4.out
├── slurm-50221_5.out
├── slurm-50221_6.out
├── slurm-50221_7.out
├── slurm-50221_8.out
└── slurm-50221_9.out

1 directory, 27 files

Process the Results

Use the llg3d.m1_vs_T command, which executes the llg3d.post.m1_vs_T.plot_m1_vs_T() function to gather the results stored in the run_*.npz files and plot the average magnetization as a function of temperature (example for the OpenCL execution):

(run) $ llg3d.m1_vs_T run_*.npz -i m1_vs_T_opencl.png
Processing file: run_T1000K.npz
Processing file: run_T1100K.npz
Processing file: run_T1200K.npz
Processing file: run_T1300K.npz
Processing file: run_T1350K.npz
Processing file: run_T1390K.npz
Processing file: run_T1400K.npz
Processing file: run_T1410K.npz
Processing file: run_T1450K.npz
Processing file: run_T1500K.npz
Processing file: run_T1550K.npz
Processing file: run_T1700K.npz
Processing file: run_T1900K.npz
T_Curie = 1393 K
Image saved in m1_vs_T_opencl.png

The Curie temperature is computed as the value where the average magnetization drops below 0.1.

The plotted graph looks like this:

Graph m = f(T)