How to Build a Model Configuration Management Pipeline with Hydra

Every ML project starts the same way: a train.py with hardcoded learning rates, batch sizes, and model dimensions scattered across the file. You tweak one number, forget which value worked best, and end up with a dozen train_v3_final_FINAL.py scripts. Hydra fixes this by externalizing all configuration into composable YAML files and giving you automatic output directories, config logging, and multirun sweeps out of the box.

Install Hydra and the Optuna sweeper plugin:

1
pip install hydra-core==1.3.2 hydra-optuna-sweeper==1.2.0

Setting Up Config Files

Hydra loads YAML configs and injects them as structured objects into your Python code. Create a conf/ directory at your project root with a main config file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
project/
├── conf/
│   ├── config.yaml
│   ├── model/
│   │   ├── resnet.yaml
│   │   └── vit.yaml
│   └── optimizer/
│       ├── adam.yaml
│       └── sgd.yaml
└── train.py

Start with conf/config.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
defaults:
  - model: resnet
  - optimizer: adam
  - _self_

training:
  epochs: 20
  batch_size: 64
  seed: 42

data:
  dataset: cifar10
  num_workers: 4

The defaults list tells Hydra to compose the final config by merging in the model/resnet.yaml and optimizer/adam.yaml files. The _self_ entry controls where the current file’s values sit in the merge order – putting it last means values in config.yaml override anything from the defaults.

Now define the config groups. conf/model/resnet.yaml:

1
2
3
name: resnet18
num_classes: 10
pretrained: true

conf/model/vit.yaml:

1
2
3
4
name: vit_b_16
num_classes: 10
pretrained: true
patch_size: 16

conf/optimizer/adam.yaml:

1
2
3
name: adam
lr: 0.001
weight_decay: 0.0001

conf/optimizer/sgd.yaml:

1
2
3
4
name: sgd
lr: 0.01
momentum: 0.9
weight_decay: 0.0005

Switching from ResNet with Adam to ViT with SGD is now a single command-line override – no code changes needed.

Integrating Hydra with a PyTorch Training Loop

Here’s a real training script that uses Hydra to wire everything together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import os
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
from torchvision.models import ResNet18_Weights, ViT_B_16_Weights
import hydra
from omegaconf import DictConfig, OmegaConf


def build_model(cfg):
    if cfg.model.name == "resnet18":
        weights = ResNet18_Weights.IMAGENET1K_V1 if cfg.model.pretrained else None
        model = models.resnet18(weights=weights)
        model.fc = nn.Linear(model.fc.in_features, cfg.model.num_classes)
    elif cfg.model.name == "vit_b_16":
        weights = ViT_B_16_Weights.IMAGENET1K_V1 if cfg.model.pretrained else None
        model = models.vit_b_16(weights=weights)
        model.heads.head = nn.Linear(model.heads.head.in_features, cfg.model.num_classes)
    else:
        raise ValueError(f"Unknown model: {cfg.model.name}")
    return model


def build_optimizer(cfg, model):
    if cfg.optimizer.name == "adam":
        return torch.optim.Adam(
            model.parameters(),
            lr=cfg.optimizer.lr,
            weight_decay=cfg.optimizer.weight_decay,
        )
    elif cfg.optimizer.name == "sgd":
        return torch.optim.SGD(
            model.parameters(),
            lr=cfg.optimizer.lr,
            momentum=cfg.optimizer.momentum,
            weight_decay=cfg.optimizer.weight_decay,
        )
    else:
        raise ValueError(f"Unknown optimizer: {cfg.optimizer.name}")


@hydra.main(version_base=None, config_path="conf", config_name="config")
def train(cfg: DictConfig):
    # Hydra changes the working directory to an output folder automatically
    print(f"Output directory: {os.getcwd()}")
    print(f"Full config:\n{OmegaConf.to_yaml(cfg)}")

    torch.manual_seed(cfg.training.seed)

    transform = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
    ])

    train_dataset = datasets.CIFAR10(
        root="./data", train=True, download=True, transform=transform
    )
    train_loader = DataLoader(
        train_dataset,
        batch_size=cfg.training.batch_size,
        shuffle=True,
        num_workers=cfg.data.num_workers,
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = build_model(cfg).to(device)
    optimizer = build_optimizer(cfg, model)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(cfg.training.epochs):
        model.train()
        running_loss = 0.0
        for batch_idx, (images, labels) in enumerate(train_loader):
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        avg_loss = running_loss / len(train_loader)
        print(f"Epoch {epoch + 1}/{cfg.training.epochs} - Loss: {avg_loss:.4f}")

    # Save model to Hydra's output directory
    torch.save(model.state_dict(), "model.pt")
    print(f"Model saved to {os.path.join(os.getcwd(), 'model.pt')}")


if __name__ == "__main__":
    train()

Run it with different configurations entirely from the command line:

1
2
3
4
5
6
7
8
# Default config (resnet18 + adam)
python train.py

# Switch to ViT with SGD and a different learning rate
python train.py model=vit optimizer=sgd optimizer.lr=0.005

# Override batch size and epochs
python train.py training.batch_size=128 training.epochs=50

Each run gets its own timestamped output directory under outputs/YYYY-MM-DD/HH-MM-SS/. Hydra automatically saves the full resolved config as .hydra/config.yaml in that directory, so you always know exactly which parameters produced which results.

Running Hyperparameter Sweeps

Hydra’s multirun mode lets you sweep over parameter combinations. The basic built-in sweeper uses grid search:

1
2
# Grid sweep over learning rates and batch sizes
python train.py --multirun optimizer.lr=0.001,0.01,0.1 training.batch_size=32,64,128

That fires off 9 runs (3 LRs x 3 batch sizes). For smarter search, use the Optuna sweeper. Add a sweep config to conf/config.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
defaults:
  - model: resnet
  - optimizer: adam
  - override hydra/sweeper: optuna
  - _self_

training:
  epochs: 20
  batch_size: 64
  seed: 42

data:
  dataset: cifar10
  num_workers: 4

hydra:
  sweeper:
    sampler:
      _target_: optuna.samplers.TPESampler
      seed: 42
    direction: minimize
    n_trials: 30
    params:
      optimizer.lr:
        type: float
        low: 0.0001
        high: 0.1
        log: true
      training.batch_size:
        type: categorical
        choices: [32, 64, 128, 256]

Your training function needs to return a numeric value for Optuna to optimize:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
@hydra.main(version_base=None, config_path="conf", config_name="config")
def train(cfg: DictConfig) -> float:
    # ... same training code as above ...

    # Return final validation loss for Optuna to minimize
    model.eval()
    val_loss = 0.0
    val_dataset = datasets.CIFAR10(
        root="./data", train=False, download=True, transform=transform
    )
    val_loader = DataLoader(val_dataset, batch_size=cfg.training.batch_size)

    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            val_loss += criterion(outputs, labels).item()

    avg_val_loss = val_loss / len(val_loader)
    return avg_val_loss

Launch the Optuna sweep:

1
python train.py --multirun

Optuna’s TPE sampler will intelligently explore the search space across 30 trials, and results land in multirun/YYYY-MM-DD/HH-MM-SS/ with each trial in its own numbered subdirectory.

Reproducibility and Output Management

Hydra creates a clean output structure for every run. Here’s what you get automatically:

1
2
3
4
5
6
7
outputs/2026-02-15/14-30-22/
├── .hydra/
│   ├── config.yaml        # Resolved config (all overrides applied)
│   ├── hydra.yaml         # Hydra's own settings
│   └── overrides.yaml     # CLI overrides you passed
├── train.log              # Stdout/stderr captured by Hydra
└── model.pt               # Your saved artifacts

You can access these paths programmatically inside your training script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from hydra.core.hydra_config import HydraConfig

@hydra.main(version_base=None, config_path="conf", config_name="config")
def train(cfg: DictConfig):
    hydra_cfg = HydraConfig.get()

    # The output directory for this run
    output_dir = hydra_cfg.runtime.output_dir
    print(f"Saving artifacts to: {output_dir}")

    # The overrides passed on the command line
    overrides = hydra_cfg.overrides.task
    print(f"CLI overrides: {overrides}")

    # Get the original working directory (before Hydra changed it)
    original_cwd = hydra.utils.get_original_cwd()
    data_path = os.path.join(original_cwd, "data")

If you want to disable the automatic directory change (some people find it confusing), set this in your config:

1
2
3
hydra:
  job:
    chdir: false

With chdir: false, your working directory stays at the project root and you access the output dir through HydraConfig.get().runtime.output_dir.

Common Errors and Fixes

MissingMandatoryValue error: You referenced a config key that exists in the schema but has no value. Either set a default in your YAML or pass it on the command line. Check your config groups for any keys set to ??? (Hydra’s “must be provided” marker).

Could not load model/custom.yaml: Your config group file is missing or misnamed. Config group files must live in conf/<group_name>/ and the filename (minus .yaml) is the value you use. Double-check the directory structure matches your defaults list.

Key 'foo' is not in struct: You’re trying to set a key that doesn’t exist in the config. By default, OmegaConf structs are closed. If you need dynamic keys, add _target_ or set struct: false on the node:

1
2
3
4
# In your config file, allow open struct
model:
  _target_: my_module.ModelFactory
  name: resnet18

Or disable struct mode from the CLI:

1
python train.py ++model.new_param=42

The ++ prefix creates the key if it doesn’t exist, unlike + which only adds to lists or creates missing nested keys.

Working directory changed unexpectedly: Hydra changes os.getcwd() by default. Use hydra.utils.get_original_cwd() to reference files relative to your project root, or set hydra.job.chdir: false as shown above.

Multirun outputs overwriting each other: Each multirun trial gets its own subdirectory, but if you’re writing to a hardcoded path, those writes will collide. Always use os.getcwd() or HydraConfig.get().runtime.output_dir for output paths, never absolute paths.

TypeError: train() got an unexpected keyword argument: The Optuna sweeper returns a float from your main function. Make sure your @hydra.main decorated function has the correct return type annotation (-> float) and actually returns a numeric value.

Setting Up Config Files#

Integrating Hydra with a PyTorch Training Loop#

Running Hyperparameter Sweeps#

Reproducibility and Output Management#

Common Errors and Fixes#

Related Guides#

About the Author

Setting Up Config Files

Integrating Hydra with a PyTorch Training Loop

Running Hyperparameter Sweeps

Reproducibility and Output Management

Common Errors and Fixes

Related Guides