The Quick Version

Install DeepSpeed, write a JSON config that picks the right ZeRO stage for your model size, and launch with the deepspeed CLI. Here’s the minimum you need to train a model across all your GPUs:

1
2
pip install deepspeed
deepspeed --num_gpus=4 train.py --deepspeed ds_config.json

That single command handles process spawning, gradient syncing, and memory partitioning. The real work is picking the right ZeRO stage and tuning the config.

Understanding ZeRO Stages

ZeRO (Zero Redundancy Optimizer) eliminates memory duplication across GPUs. Standard data parallelism replicates the entire model on every GPU. ZeRO partitions different components instead.

Stage 1: Optimizer State Partitioning

Each GPU holds only a shard of the optimizer states (momentum, variance for Adam). Model parameters and gradients are still replicated. This cuts optimizer memory by the number of GPUs. Use Stage 1 when your model fits in GPU memory but you’re tight on optimizer overhead.

Stage 2: Gradient Partitioning

On top of Stage 1, gradients are also partitioned. Each GPU only stores the gradients corresponding to its optimizer shard. Communication happens during the backward pass. Stage 2 is the sweet spot for most training runs – minimal communication overhead with significant memory savings.

Stage 3: Parameter Partitioning

Parameters themselves are sharded across GPUs. No single GPU holds the full model. Parameters are gathered on-the-fly for forward and backward passes, then discarded. Stage 3 is what you need for models that don’t fit on a single GPU at all, like 7B+ parameter models on 24GB cards.

DeepSpeed with Hugging Face Trainer

The fastest path if you’re already using Transformers. Create a ds_config.json:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 100,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

The "auto" values tell DeepSpeed to pull settings from the Trainer’s arguments. This avoids the double-configuration headache where your Trainer says one batch size and your config says another.

Pass it to the Trainer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    bf16=True,
    deepspeed="ds_config.json",
    gradient_checkpointing=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

Launch with:

1
2
3
4
deepspeed --num_gpus=4 train.py \
  --deepspeed ds_config.json \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4

Standalone DeepSpeed Integration

When you’re not using Hugging Face, use deepspeed.initialize() directly. This gives you full control over the training loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import deepspeed
import torch

model = YourModel()

ds_config = {
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 4,
    "gradient_accumulation_steps": 2,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01,
        },
    },
    "bf16": {"enabled": True},
    "zero_optimization": {"stage": 2},
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config,
)

for batch in dataloader:
    inputs = batch["input_ids"].to(model_engine.local_rank)
    labels = batch["labels"].to(model_engine.local_rank)

    outputs = model_engine(inputs, labels=labels)
    loss = outputs.loss

    model_engine.backward(loss)
    model_engine.step()

Notice you call model_engine.backward() and model_engine.step() instead of the usual PyTorch patterns. DeepSpeed wraps your model and optimizer into a single engine that handles gradient syncing, scaling, and partitioning.

ZeRO-3 Config for a 7B Model

Training a 7B model (like LLaMA 2 7B) on 4x A100 80GB or 8x A6000 48GB requires ZeRO-3 with CPU offloading. Here’s a production-ready config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": 8,
  "gradient_clipping": 1.0,
  "train_batch_size": 64,
  "train_micro_batch_size_per_gpu": 1,
  "wall_clock_breakdown": false
}

Key choices here:

  • CPU offloading for both optimizer and parameters – this is what lets you fit a 7B model on smaller GPUs. Without offloading, you’d need roughly 112GB of GPU memory just for the model states in bf16. With ZeRO-3 offload across 4 GPUs, each card only holds a fraction at any given time.
  • pin_memory: true – keeps CPU tensors in pinned memory for faster GPU transfers. Costs more host RAM but significantly speeds up offloading.
  • Micro batch size of 1 – start here and increase only if GPU memory allows. With gradient accumulation at 8 and 8 GPUs, your effective batch size is 64.

If you have NVMe storage and limited CPU RAM, swap CPU offloading for NVMe:

1
2
3
4
5
6
7
"offload_optimizer": {
  "device": "nvme",
  "nvme_path": "/local_nvme",
  "pin_memory": true,
  "buffer_count": 5,
  "fast_init": false
}

NVMe offloading is slower than CPU but lets you train models that exceed your system RAM.

Mixed Precision Training

DeepSpeed supports both fp16 and bf16. Prefer bf16 on Ampere GPUs (A100, A6000, RTX 3090) and newer – it handles the full float32 range so you skip loss scaling entirely.

For fp16 on older hardware:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  }
}

Setting "loss_scale": 0 enables dynamic loss scaling. Don’t set both fp16 and bf16 to true – pick one.

Gradient Checkpointing

Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them. For large models, this is non-negotiable.

With Hugging Face:

1
2
3
4
5
training_args = TrainingArguments(
    ...,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

With standalone DeepSpeed, enable it on your model before deepspeed.initialize():

1
2
3
4
5
model.gradient_checkpointing_enable()
# or for custom models:
deepspeed.checkpointing.configure(
    num_checkpoints=model.config.num_hidden_layers
)

Expect roughly 30-40% slower training in exchange for 50-60% memory reduction on activations.

DeepSpeed with Accelerate

Hugging Face Accelerate wraps DeepSpeed with a simpler interface. Run the config wizard:

1
accelerate config

Select DeepSpeed when prompted and pick your ZeRO stage. This generates a config file at ~/.cache/huggingface/accelerate/default_config.yaml. Then launch:

1
accelerate launch --num_processes=4 train.py

Your training script uses the Accelerate API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, dataloader, scheduler = accelerator.prepare(
    model, optimizer, dataloader, scheduler
)

for batch in dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

Accelerate handles the DeepSpeed engine creation behind the scenes. The advantage is your training script stays portable – you can switch between DDP, FSDP, and DeepSpeed without code changes.

Common Errors

RuntimeError: CUDA out of memory – Even with ZeRO-3, you can OOM if your micro batch size is too large. Drop train_micro_batch_size_per_gpu to 1 and increase gradient_accumulation_steps to keep your effective batch size the same.

AssertionError: DeepSpeed engine was not initialized – You called model.forward() on the raw model instead of the DeepSpeed engine. Always use the engine returned by deepspeed.initialize().

Timeout waiting for all processes – One GPU is slower or crashed. Check NCCL_DEBUG=INFO output. Common cause: heterogeneous GPUs with different memory sizes. DeepSpeed partitions equally, so the smallest GPU becomes the bottleneck.

stage3_gather_16bit_weights_on_model_save errors during checkpoint saving – Make sure this flag is true in your config. Without it, ZeRO-3 saves sharded weights that are painful to consolidate later.

ValueError: fp16 and bf16 cannot both be enabled – Pick one. Check that your Trainer args and ds_config don’t conflict. Using "auto" in the config prevents this.

Slow training with CPU offload – Offloading adds PCIe transfer overhead. Make sure pin_memory is true and your CPU-GPU interconnect isn’t saturated. On multi-socket systems, pin processes to the NUMA node closest to each GPU using numactl.