Quick Start: Your First GPU Job#
Here’s what you came for - a working SBATCH script for multi-GPU PyTorch training. Save this as train.sbatch:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| #!/bin/bash
#SBATCH --job-name=pytorch-ddp
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:a100:4
#SBATCH --cpus-per-task=8
#SBATCH --mem=256G
#SBATCH --time=24:00:00
#SBATCH --output=logs/%j.out
#SBATCH --error=logs/%j.err
# Load environment
module load cuda/12.1
source /opt/conda/bin/activate pytorch-env
# Master node configuration
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500
export NCCL_DEBUG=INFO
# Launch distributed training
srun python train.py \
--distributed \
--num-nodes=$SLURM_NNODES \
--node-rank=$SLURM_NODEID \
--master-addr=$MASTER_ADDR \
--master-port=$MASTER_PORT \
--batch-size=64 \
--epochs=100
|
Submit it with sbatch train.sbatch and you’re running distributed training across 8 A100s. Now let’s build the infrastructure to make this work.
Installing Slurm on Your Cluster#
You need one controller node and N compute nodes. I recommend Ubuntu 22.04 for all nodes because the package management is straightforward.
On the controller node:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Install Slurm controller and database
apt update && apt install -y slurm-wlm slurm-wlm-torque mariadb-server
# Start MariaDB and create Slurm accounting database
systemctl enable --now mariadb
mysql -u root <<EOF
CREATE DATABASE slurm_acct_db;
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurmdbpass';
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';
FLUSH PRIVILEGES;
EOF
# Install Slurm accounting daemon
apt install -y slurmdbd
systemctl enable slurmdbd
|
On each compute node:
1
2
3
4
5
| # Install Slurm client and daemon
apt update && apt install -y slurmd slurm-client
# Install NVIDIA drivers and CUDA (if not already done)
apt install -y nvidia-driver-535 nvidia-utils-535
|
Make sure all nodes can resolve each other by hostname. Edit /etc/hosts or use DNS - Slurm relies on hostnames, not IPs.
Configuring GPU Partitions with GRES#
The magic happens in /etc/slurm/slurm.conf. This file lives on the controller and gets synced to all compute nodes. Here’s a production-ready config:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| # slurm.conf - GPU cluster configuration
ClusterName=ml-cluster
SlurmctldHost=controller01
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=controller01
AccountingStorageTRES=gres/gpu
# Scheduling - optimize for ML workloads
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
# GPU resource tracking (CRITICAL!)
GresTypes=gpu
AccountingStorageTRES=gres/gpu
# Partitions
PartitionName=gpu Nodes=gpu[01-04] Default=YES MaxTime=7-00:00:00 State=UP
# Node definitions with GPU GRES
NodeName=gpu01 Gres=gpu:a100:4 CPUs=64 RealMemory=512000 State=UNKNOWN
NodeName=gpu02 Gres=gpu:a100:4 CPUs=64 RealMemory=512000 State=UNKNOWN
NodeName=gpu03 Gres=gpu:v100:8 CPUs=128 RealMemory=1024000 State=UNKNOWN
NodeName=gpu04 Gres=gpu:v100:8 CPUs=128 RealMemory=1024000 State=UNKNOWN
|
The Gres=gpu:a100:4 syntax tells Slurm this node has 4 A100 GPUs. Users request GPUs with --gres=gpu:a100:2 to get 2 A100s specifically, or --gres=gpu:2 for any 2 GPUs.
Now create /etc/slurm/gres.conf on each compute node to map GPUs to devices:
1
2
3
4
5
6
7
8
| # On gpu01 (4x A100)
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia0
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia1
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia2
NodeName=gpu01 Name=gpu Type=a100 File=/dev/nvidia3
# On gpu03 (8x V100)
NodeName=gpu03 Name=gpu Type=v100 File=/dev/nvidia[0-7]
|
Restart Slurm everywhere: systemctl restart slurmctld slurmdbd on controller, systemctl restart slurmd on compute nodes. Check nodes with sinfo -Nel - they should show GPU counts.
Writing SBATCH Scripts for Distributed Training#
The header matters more than you think. Each directive controls how Slurm allocates resources. Here’s what each line does in the quick-start script:
--nodes=2 - Request 2 physical machines--ntasks-per-node=4 - Launch 4 processes per node (one per GPU)--gres=gpu:a100:4 - Allocate 4 A100 GPUs per node--cpus-per-task=8 - Each GPU process gets 8 CPU cores for data loading--mem=256G - Total memory per node, not per task
For PyTorch DistributedDataParallel, you want ntasks-per-node = GPUs per node. Slurm’s srun will launch your script once per task with SLURM_PROCID set correctly.
Here’s the training script (train.py) that works with the SBATCH header:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
def setup_distributed():
"""Initialize PyTorch distributed from Slurm env vars"""
# Slurm sets these automatically
world_size = int(os.environ['SLURM_NTASKS'])
rank = int(os.environ['SLURM_PROCID'])
local_rank = int(os.environ['SLURM_LOCALID'])
# Master address comes from SBATCH script
master_addr = os.environ['MASTER_ADDR']
master_port = os.environ['MASTER_PORT']
dist.init_process_group(
backend='nccl',
init_method=f'tcp://{master_addr}:{master_port}',
world_size=world_size,
rank=rank
)
torch.cuda.set_device(local_rank)
return local_rank, rank, world_size
def main():
local_rank, rank, world_size = setup_distributed()
# Load model and wrap with DDP
model = YourModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
# Use DistributedSampler to split data across ranks
dataset = YourDataset()
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, batch_size=64, sampler=sampler, num_workers=8)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(100):
sampler.set_epoch(epoch) # Shuffle differently each epoch
for batch in loader:
loss = model(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
if rank == 0:
print(f"Epoch {epoch} complete")
if __name__ == '__main__':
main()
|
The srun in the SBATCH script handles process launching - no need for torchrun or manual mpirun.
Managing Job Queues and Priorities#
Check what’s running with squeue:
1
2
| squeue -u $USER # Your jobs
squeue -p gpu -o "%.18i %.9P %.50j %.8u %.2t %.10M %.6D %.6C %.8m %R" # Detailed GPU queue view
|
Cancel jobs with scancel:
1
2
3
| scancel 12345 # Cancel job 12345
scancel -u $USER # Cancel all your jobs
scancel -n pytorch-ddp # Cancel by job name
|
Set job dependencies to chain experiments:
1
2
3
| # Run evaluation after training completes
TRAIN_JOB=$(sbatch --parsable train.sbatch)
sbatch --dependency=afterok:$TRAIN_JOB eval.sbatch
|
Priority matters when the cluster is full. Configure fair-share in slurm.conf:
1
2
3
4
| PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=500
|
This prioritizes users who haven’t run jobs recently (fair-share) while giving small jobs a boost to fill idle GPUs.
Hyperparameter Sweeps with Array Jobs#
Array jobs are perfect for hyperparameter tuning - one SBATCH launches hundreds of independent runs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| #!/bin/bash
#SBATCH --job-name=hp-sweep
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --array=0-99
#SBATCH --time=4:00:00
#SBATCH --output=logs/sweep_%A_%a.out
# SLURM_ARRAY_TASK_ID ranges from 0 to 99
module load cuda/12.1
source /opt/conda/bin/activate pytorch-env
# Map array index to hyperparameters
LR=$(awk "NR==$((SLURM_ARRAY_TASK_ID + 1))" configs/learning_rates.txt)
BS=$(awk "NR==$((SLURM_ARRAY_TASK_ID + 1))" configs/batch_sizes.txt)
WD=$(awk "NR==$((SLURM_ARRAY_TASK_ID + 1))" configs/weight_decays.txt)
python train.py \
--learning-rate=$LR \
--batch-size=$BS \
--weight-decay=$WD \
--run-id=$SLURM_ARRAY_TASK_ID \
--output-dir=results/$SLURM_ARRAY_JOB_ID/$SLURM_ARRAY_TASK_ID
|
Generate the config files with Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import numpy as np
# Grid search
lrs = np.logspace(-5, -2, 10)
batch_sizes = [16, 32, 64, 128]
weight_decays = [0, 1e-4, 1e-3, 1e-2]
# 10 * 4 * 4 = 160 combinations, use --array=0-159
with open('configs/learning_rates.txt', 'w') as f:
for lr in lrs:
for bs in batch_sizes:
for wd in weight_decays:
f.write(f"{lr}\n")
# Repeat for batch_sizes.txt and weight_decays.txt
|
Submit with sbatch sweep.sbatch and Slurm schedules 100 jobs, running as many in parallel as GPUs allow.
Monitoring GPU Utilization#
Install nvidia-dcgm-exporter on compute nodes and Prometheus on the controller for real-time metrics. But for quick checks, use these commands:
On any compute node:
1
| nvidia-smi dmon -s pucvmet # Live GPU usage: power, utilization, clocks, memory, temp
|
From the controller, check all nodes:
1
| srun --nodes=4 --ntasks=4 nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv
|
Check job GPU allocation:
1
| squeue -u $USER -o "%.18i %.9P %.8T %b" # Shows GRES allocation in last column
|
See historical usage with Slurm accounting:
1
| sacct -j 12345 --format=JobID,Elapsed,ReqGRES,AllocGRES,MaxRSS,MaxVMSize
|
For deep debugging, enable GPU frequency and power monitoring in jobs:
1
2
3
4
5
6
7
8
9
10
11
12
| #!/bin/bash
#SBATCH --gres=gpu:2
# Log GPU stats every 10 seconds in background
nvidia-smi dmon -s pucvmet -o TD -f gpu_stats_$SLURM_JOB_ID.csv &
MONITOR_PID=$!
# Run training
python train.py
# Stop monitoring
kill $MONITOR_PID
|
Common Errors and Fixes#
“Unable to allocate resources: Invalid generic resource (gres) specification”
You requested --gres=gpu:a100:4 but the node doesn’t have that GPU type or count. Check available types with sinfo -o "%20N %10G". Use --gres=gpu:4 to request any GPU type.
“NCCL error: unhandled system error” or hangs at init_process_group
Network issue between nodes. Verify all nodes can reach each other:
1
| srun --nodes=2 --ntasks=2 hostname
|
Check firewalls aren’t blocking the MASTER_PORT. Use InfiniBand instead of Ethernet if available - add export NCCL_IB_DISABLE=0 to SBATCH script.
Job pending with reason “ReqNodeNotAvail, Reserved for maintenance”
Nodes are draining for admin work. Check with sinfo -R. Either wait or exclude those nodes: #SBATCH --exclude=gpu03,gpu04.
“slurmstepd: error: task launch failure” on compute nodes
The compute node can’t find your Python environment. Use absolute paths in SBATCH scripts:
1
2
| source /opt/conda/bin/activate pytorch-env # Good
source activate pytorch-env # Bad - relies on shell init
|
Out of memory despite requesting --mem=256G
You requested total node memory, but the OS and other jobs use some. Request 90% of physical RAM: --mem=460G on a 512GB node. Or use --mem-per-cpu=8G to scale with --cpus-per-task.
Array job fills up the queue, blocking other users
Set MaxArraySize and MaxSubmitJobs in slurm.conf:
1
2
| MaxArraySize=1000
MaxSubmitJobs=5000
|
Or limit concurrent runs: #SBATCH --array=0-999%50 runs 50 at a time.