How to Set Up a GPU Cluster with Slurm for ML Training

Quick Start: Your First GPU Job

Here’s what you came for - a working SBATCH script for multi-GPU PyTorch training. Save this as train.sbatch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
#SBATCH --job-name=pytorch-ddp
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:a100:4
#SBATCH --cpus-per-task=8
#SBATCH --mem=256G
#SBATCH --time=24:00:00
#SBATCH --output=logs/%j.out
#SBATCH --error=logs/%j.err

# Load environment
module load cuda/12.1
source /opt/conda/bin/activate pytorch-env

# Master node configuration
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500
export NCCL_DEBUG=INFO

# Launch distributed training
srun python train.py \
    --distributed \
    --num-nodes=$SLURM_NNODES \
    --node-rank=$SLURM_NODEID \
    --master-addr=$MASTER_ADDR \
    --master-port=$MASTER_PORT \
    --batch-size=64 \
    --epochs=100

Submit it with sbatch train.sbatch and you’re running distributed training across 8 A100s. Now let’s build the infrastructure to make this work.

Installing Slurm on Your Cluster

You need one controller node and N compute nodes. I recommend Ubuntu 22.04 for all nodes because the package management is straightforward.

On the controller node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Install Slurm controller and database
apt update && apt install -y slurm-wlm slurm-wlm-torque mariadb-server

# Start MariaDB and create Slurm accounting database
systemctl enable --now mariadb
mysql -u root <<EOF
CREATE DATABASE slurm_acct_db;
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurmdbpass';
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';
FLUSH PRIVILEGES;
EOF

# Install Slurm accounting daemon
apt install -y slurmdbd
systemctl enable slurmdbd

On each compute node:

1
2
3
4
5
# Install Slurm client and daemon
apt update && apt install -y slurmd slurm-client

# Install NVIDIA drivers and CUDA (if not already done)
apt install -y nvidia-driver-535 nvidia-utils-535

Make sure all nodes can resolve each other by hostname. Edit /etc/hosts or use DNS - Slurm relies on hostnames, not IPs.

Configuring GPU Partitions with GRES

The magic happens in /etc/slurm/slurm.conf. This file lives on the controller and gets synced to all compute nodes. Here’s a production-ready config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# slurm.conf - GPU cluster configuration
ClusterName=ml-cluster
SlurmctldHost=controller01

# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=controller01
AccountingStorageTRES=gres/gpu

# Scheduling - optimize for ML workloads
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

# GPU resource tracking (CRITICAL!)
GresTypes=gpu
AccountingStorageTRES=gres/gpu

# Partitions
PartitionName=gpu Nodes=gpu[01-04] Default=YES MaxTime=7-00:00:00 State=UP

# Node definitions with GPU GRES
NodeName=gpu01 Gres=gpu:a100:4 CPUs=64 RealMemory=512000 State=UNKNOWN
NodeName=gpu02 Gres=gpu:a100:4 CPUs=64 RealMemory=512000 State=UNKNOWN
NodeName=gpu03 Gres=gpu:v100:8 CPUs=128 RealMemory=1024000 State=UNKNOWN
NodeName=gpu04 Gres=gpu:v100:8 CPUs=128 RealMemory=1024000 State=UNKNOWN

The Gres=gpu:a100:4 syntax tells Slurm this node has 4 A100 GPUs. Users request GPUs with --gres=gpu:a100:2 to get 2 A100s specifically, or --gres=gpu:2 for any 2 GPUs.

Now create /etc/slurm/gres.conf on each compute node to map GPUs to devices:

1
2
3
4
5
6
7
8
# On gpu01 (4x A100)
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia0
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia1
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia2
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia3

# On gpu03 (8x V100)
NodeName=gpu03 Name=gpu Type=v100 File=/dev/nvidia[0-7]

Restart Slurm everywhere: systemctl restart slurmctld slurmdbd on controller, systemctl restart slurmd on compute nodes. Check nodes with sinfo -Nel - they should show GPU counts.

Writing SBATCH Scripts for Distributed Training

The header matters more than you think. Each directive controls how Slurm allocates resources. Here’s what each line does in the quick-start script:

--nodes=2 - Request 2 physical machines
--ntasks-per-node=4 - Launch 4 processes per node (one per GPU)
--gres=gpu:a100:4 - Allocate 4 A100 GPUs per node
--cpus-per-task=8 - Each GPU process gets 8 CPU cores for data loading
--mem=256G - Total memory per node, not per task

For PyTorch DistributedDataParallel, you want ntasks-per-node = GPUs per node. Slurm’s srun will launch your script once per task with SLURM_PROCID set correctly.

Here’s the training script (train.py) that works with the SBATCH header:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def setup_distributed():
    """Initialize PyTorch distributed from Slurm env vars"""
    # Slurm sets these automatically
    world_size = int(os.environ['SLURM_NTASKS'])
    rank = int(os.environ['SLURM_PROCID'])
    local_rank = int(os.environ['SLURM_LOCALID'])

    # Master address comes from SBATCH script
    master_addr = os.environ['MASTER_ADDR']
    master_port = os.environ['MASTER_PORT']

    dist.init_process_group(
        backend='nccl',
        init_method=f'tcp://{master_addr}:{master_port}',
        world_size=world_size,
        rank=rank
    )

    torch.cuda.set_device(local_rank)
    return local_rank, rank, world_size

def main():
    local_rank, rank, world_size = setup_distributed()

    # Load model and wrap with DDP
    model = YourModel().cuda(local_rank)
    model = DDP(model, device_ids=[local_rank])

    # Use DistributedSampler to split data across ranks
    dataset = YourDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = DataLoader(dataset, batch_size=64, sampler=sampler, num_workers=8)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(100):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for batch in loader:
            loss = model(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        if rank == 0:
            print(f"Epoch {epoch} complete")

if __name__ == '__main__':
    main()

The srun in the SBATCH script handles process launching - no need for torchrun or manual mpirun.

Managing Job Queues and Priorities

Check what’s running with squeue:

1
2
squeue -u $USER              # Your jobs
squeue -p gpu -o "%.18i %.9P %.50j %.8u %.2t %.10M %.6D %.6C %.8m %R"  # Detailed GPU queue view

Cancel jobs with scancel:

1
2
3
scancel 12345                # Cancel job 12345
scancel -u $USER             # Cancel all your jobs
scancel -n pytorch-ddp       # Cancel by job name

Set job dependencies to chain experiments:

1
2
3
# Run evaluation after training completes
TRAIN_JOB=$(sbatch --parsable train.sbatch)
sbatch --dependency=afterok:$TRAIN_JOB eval.sbatch

Priority matters when the cluster is full. Configure fair-share in slurm.conf:

1
2
3
4
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=500

This prioritizes users who haven’t run jobs recently (fair-share) while giving small jobs a boost to fill idle GPUs.

Hyperparameter Sweeps with Array Jobs

Array jobs are perfect for hyperparameter tuning - one SBATCH launches hundreds of independent runs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
#SBATCH --job-name=hp-sweep
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --array=0-99
#SBATCH --time=4:00:00
#SBATCH --output=logs/sweep_%A_%a.out

# SLURM_ARRAY_TASK_ID ranges from 0 to 99
module load cuda/12.1
source /opt/conda/bin/activate pytorch-env

# Map array index to hyperparameters
LR=$(awk "NR==$((SLURM_ARRAY_TASK_ID + 1))" configs/learning_rates.txt)
BS=$(awk "NR==$((SLURM_ARRAY_TASK_ID + 1))" configs/batch_sizes.txt)
WD=$(awk "NR==$((SLURM_ARRAY_TASK_ID + 1))" configs/weight_decays.txt)

python train.py \
    --learning-rate=$LR \
    --batch-size=$BS \
    --weight-decay=$WD \
    --run-id=$SLURM_ARRAY_TASK_ID \
    --output-dir=results/$SLURM_ARRAY_JOB_ID/$SLURM_ARRAY_TASK_ID

Generate the config files with Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import numpy as np

# Grid search
lrs = np.logspace(-5, -2, 10)
batch_sizes = [16, 32, 64, 128]
weight_decays = [0, 1e-4, 1e-3, 1e-2]

# 10 * 4 * 4 = 160 combinations, use --array=0-159
with open('configs/learning_rates.txt', 'w') as f:
    for lr in lrs:
        for bs in batch_sizes:
            for wd in weight_decays:
                f.write(f"{lr}\n")

# Repeat for batch_sizes.txt and weight_decays.txt

Submit with sbatch sweep.sbatch and Slurm schedules 100 jobs, running as many in parallel as GPUs allow.

Monitoring GPU Utilization

Install nvidia-dcgm-exporter on compute nodes and Prometheus on the controller for real-time metrics. But for quick checks, use these commands:

On any compute node:

1
nvidia-smi dmon -s pucvmet  # Live GPU usage: power, utilization, clocks, memory, temp

From the controller, check all nodes:

1
srun --nodes=4 --ntasks=4 nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv

Check job GPU allocation:

1
squeue -u $USER -o "%.18i %.9P %.8T %b"  # Shows GRES allocation in last column

See historical usage with Slurm accounting:

1
sacct -j 12345 --format=JobID,Elapsed,ReqGRES,AllocGRES,MaxRSS,MaxVMSize

For deep debugging, enable GPU frequency and power monitoring in jobs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/bash
#SBATCH --gres=gpu:2

# Log GPU stats every 10 seconds in background
nvidia-smi dmon -s pucvmet -o TD -f gpu_stats_$SLURM_JOB_ID.csv &
MONITOR_PID=$!

# Run training
python train.py

# Stop monitoring
kill $MONITOR_PID

Common Errors and Fixes

“Unable to allocate resources: Invalid generic resource (gres) specification”

You requested --gres=gpu:a100:4 but the node doesn’t have that GPU type or count. Check available types with sinfo -o "%20N %10G". Use --gres=gpu:4 to request any GPU type.

“NCCL error: unhandled system error” or hangs at init_process_group

Network issue between nodes. Verify all nodes can reach each other:

1
srun --nodes=2 --ntasks=2 hostname

Check firewalls aren’t blocking the MASTER_PORT. Use InfiniBand instead of Ethernet if available - add export NCCL_IB_DISABLE=0 to SBATCH script.

Job pending with reason “ReqNodeNotAvail, Reserved for maintenance”

Nodes are draining for admin work. Check with sinfo -R. Either wait or exclude those nodes: #SBATCH --exclude=gpu03,gpu04.

“slurmstepd: error: task launch failure” on compute nodes

The compute node can’t find your Python environment. Use absolute paths in SBATCH scripts:

1
2
source /opt/conda/bin/activate pytorch-env  # Good
source activate pytorch-env                  # Bad - relies on shell init

Out of memory despite requesting --mem=256G

You requested total node memory, but the OS and other jobs use some. Request 90% of physical RAM: --mem=460G on a 512GB node. Or use --mem-per-cpu=8G to scale with --cpus-per-task.

Array job fills up the queue, blocking other users

Set MaxArraySize and MaxSubmitJobs in slurm.conf:

1
2
MaxArraySize=1000
MaxSubmitJobs=5000

Or limit concurrent runs: #SBATCH --array=0-999%50 runs 50 at a time.

Quick Start: Your First GPU Job#

Installing Slurm on Your Cluster#

Configuring GPU Partitions with GRES#

Writing SBATCH Scripts for Distributed Training#

Managing Job Queues and Priorities#

Hyperparameter Sweeps with Array Jobs#

Monitoring GPU Utilization#

Common Errors and Fixes#

Related Guides#

About the Author