The Quick Version
Install DeepSpeed, write a JSON config that picks the right ZeRO stage for your model size, and launch with the deepspeed CLI. Here’s the minimum you need to train a model across all your GPUs:
| |
That single command handles process spawning, gradient syncing, and memory partitioning. The real work is picking the right ZeRO stage and tuning the config.
Understanding ZeRO Stages
ZeRO (Zero Redundancy Optimizer) eliminates memory duplication across GPUs. Standard data parallelism replicates the entire model on every GPU. ZeRO partitions different components instead.
Stage 1: Optimizer State Partitioning
Each GPU holds only a shard of the optimizer states (momentum, variance for Adam). Model parameters and gradients are still replicated. This cuts optimizer memory by the number of GPUs. Use Stage 1 when your model fits in GPU memory but you’re tight on optimizer overhead.
Stage 2: Gradient Partitioning
On top of Stage 1, gradients are also partitioned. Each GPU only stores the gradients corresponding to its optimizer shard. Communication happens during the backward pass. Stage 2 is the sweet spot for most training runs – minimal communication overhead with significant memory savings.
Stage 3: Parameter Partitioning
Parameters themselves are sharded across GPUs. No single GPU holds the full model. Parameters are gathered on-the-fly for forward and backward passes, then discarded. Stage 3 is what you need for models that don’t fit on a single GPU at all, like 7B+ parameter models on 24GB cards.
DeepSpeed with Hugging Face Trainer
The fastest path if you’re already using Transformers. Create a ds_config.json:
| |
The "auto" values tell DeepSpeed to pull settings from the Trainer’s arguments. This avoids the double-configuration headache where your Trainer says one batch size and your config says another.
Pass it to the Trainer:
| |
Launch with:
| |
Standalone DeepSpeed Integration
When you’re not using Hugging Face, use deepspeed.initialize() directly. This gives you full control over the training loop.
| |
Notice you call model_engine.backward() and model_engine.step() instead of the usual PyTorch patterns. DeepSpeed wraps your model and optimizer into a single engine that handles gradient syncing, scaling, and partitioning.
ZeRO-3 Config for a 7B Model
Training a 7B model (like LLaMA 2 7B) on 4x A100 80GB or 8x A6000 48GB requires ZeRO-3 with CPU offloading. Here’s a production-ready config:
| |
Key choices here:
- CPU offloading for both optimizer and parameters – this is what lets you fit a 7B model on smaller GPUs. Without offloading, you’d need roughly 112GB of GPU memory just for the model states in bf16. With ZeRO-3 offload across 4 GPUs, each card only holds a fraction at any given time.
pin_memory: true– keeps CPU tensors in pinned memory for faster GPU transfers. Costs more host RAM but significantly speeds up offloading.- Micro batch size of 1 – start here and increase only if GPU memory allows. With gradient accumulation at 8 and 8 GPUs, your effective batch size is 64.
If you have NVMe storage and limited CPU RAM, swap CPU offloading for NVMe:
| |
NVMe offloading is slower than CPU but lets you train models that exceed your system RAM.
Mixed Precision Training
DeepSpeed supports both fp16 and bf16. Prefer bf16 on Ampere GPUs (A100, A6000, RTX 3090) and newer – it handles the full float32 range so you skip loss scaling entirely.
For fp16 on older hardware:
| |
Setting "loss_scale": 0 enables dynamic loss scaling. Don’t set both fp16 and bf16 to true – pick one.
Gradient Checkpointing
Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them. For large models, this is non-negotiable.
With Hugging Face:
| |
With standalone DeepSpeed, enable it on your model before deepspeed.initialize():
| |
Expect roughly 30-40% slower training in exchange for 50-60% memory reduction on activations.
DeepSpeed with Accelerate
Hugging Face Accelerate wraps DeepSpeed with a simpler interface. Run the config wizard:
| |
Select DeepSpeed when prompted and pick your ZeRO stage. This generates a config file at ~/.cache/huggingface/accelerate/default_config.yaml. Then launch:
| |
Your training script uses the Accelerate API:
| |
Accelerate handles the DeepSpeed engine creation behind the scenes. The advantage is your training script stays portable – you can switch between DDP, FSDP, and DeepSpeed without code changes.
Common Errors
RuntimeError: CUDA out of memory – Even with ZeRO-3, you can OOM if your micro batch size is too large. Drop train_micro_batch_size_per_gpu to 1 and increase gradient_accumulation_steps to keep your effective batch size the same.
AssertionError: DeepSpeed engine was not initialized – You called model.forward() on the raw model instead of the DeepSpeed engine. Always use the engine returned by deepspeed.initialize().
Timeout waiting for all processes – One GPU is slower or crashed. Check NCCL_DEBUG=INFO output. Common cause: heterogeneous GPUs with different memory sizes. DeepSpeed partitions equally, so the smallest GPU becomes the bottleneck.
stage3_gather_16bit_weights_on_model_save errors during checkpoint saving – Make sure this flag is true in your config. Without it, ZeRO-3 saves sharded weights that are painful to consolidate later.
ValueError: fp16 and bf16 cannot both be enabled – Pick one. Check that your Trainer args and ds_config don’t conflict. Using "auto" in the config prevents this.
Slow training with CPU offload – Offloading adds PCIe transfer overhead. Make sure pin_memory is true and your CPU-GPU interconnect isn’t saturated. On multi-socket systems, pin processes to the NUMA node closest to each GPU using numactl.
Related Guides
- How to Build a Model Training Pipeline with Composer and FSDP
- How to Set Up Multi-GPU Training with PyTorch
- How to Build a Model Training Pipeline with Lightning Fabric
- How to Profile and Optimize GPU Memory for LLM Training
- How to Scale ML Training and Inference with Ray
- How to Speed Up Training with Mixed Precision and PyTorch AMP
- How to Build a Model Training Checkpoint Pipeline with PyTorch
- How to Build a Model Training Queue with Redis and Worker Pools
- How to Build a Multi-Node Training Pipeline with Fabric and NCCL
- How to Set Up a GPU Cluster with Slurm for ML Training