How to Build a Model Experiment Tracking Pipeline with MLflow and DuckDB

If you’ve run more than a handful of training jobs, you know the pain: scattered logs, forgotten hyperparameters, and no easy way to ask “which learning rate worked best with dropout 0.3?” MLflow handles the logging side well, but its built-in UI becomes clunky once you have hundreds of runs. DuckDB fixes that – you dump your run data into an in-process SQL engine and slice it however you want in milliseconds.

Here’s what the end result looks like. You train models, MLflow logs everything, and DuckDB gives you instant SQL access:

1
pip install mlflow duckdb scikit-learn pandas

Setting Up MLflow with a Local Backend

MLflow stores experiment data in a tracking URI. For local work, a SQLite backend keeps things simple and portable. Set the tracking URI before you log anything:

1
2
3
4
5
6
7
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("hyperparameter-search")

print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Artifact URI: {mlflow.get_artifact_uri()}")

This creates a mlflow.db file in your working directory. All parameters, metrics, and run metadata get stored there. Artifacts (model files, plots) go into a local mlruns/ directory by default.

You can also point MLflow at a remote server if your team needs shared access, but the local SQLite backend is perfect for single-machine experimentation and pairs well with DuckDB later.

Logging Experiments from Training Runs

Here’s a realistic training loop that logs everything you’d actually care about – hyperparameters, per-epoch metrics, and the final model. We’ll use scikit-learn to keep the ML part simple so the focus stays on the tracking pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import mlflow
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import time

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("hyperparameter-search")

X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=10,
    n_classes=2,
    random_state=42,
)

param_grid = [
    {"n_estimators": 50, "max_depth": 5, "min_samples_split": 2},
    {"n_estimators": 100, "max_depth": 10, "min_samples_split": 5},
    {"n_estimators": 200, "max_depth": 15, "min_samples_split": 10},
    {"n_estimators": 100, "max_depth": 5, "min_samples_split": 2},
    {"n_estimators": 200, "max_depth": 10, "min_samples_split": 5},
    {"n_estimators": 50, "max_depth": 15, "min_samples_split": 10},
    {"n_estimators": 150, "max_depth": 8, "min_samples_split": 3},
    {"n_estimators": 300, "max_depth": 20, "min_samples_split": 8},
]

for params in param_grid:
    with mlflow.start_run():
        mlflow.log_params(params)
        mlflow.set_tag("model_type", "RandomForest")
        mlflow.set_tag("dataset", "synthetic_classification")

        start_time = time.time()

        clf = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
        scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")

        train_duration = time.time() - start_time

        mlflow.log_metric("mean_accuracy", np.mean(scores))
        mlflow.log_metric("std_accuracy", np.std(scores))
        mlflow.log_metric("train_duration_seconds", train_duration)

        for fold_idx, score in enumerate(scores):
            mlflow.log_metric("fold_accuracy", score, step=fold_idx)

        clf.fit(X, y)
        mlflow.sklearn.log_model(clf, "model")

        print(f"Params: {params} | Accuracy: {np.mean(scores):.4f} +/- {np.std(scores):.4f}")

Each start_run() block creates an isolated run. Parameters, metrics, and the serialized model all get stored atomically. If the run crashes mid-way, MLflow marks it as failed so you don’t confuse partial results with complete ones.

Exporting MLflow Data to DuckDB

MLflow’s search_runs function returns a Pandas DataFrame. DuckDB can query DataFrames directly – no export step needed. But if you want persistent storage or want to join experiment data with other tables, writing to a DuckDB file makes more sense:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import mlflow
import duckdb
import pandas as pd

mlflow.set_tracking_uri("sqlite:///mlflow.db")

runs_df = mlflow.search_runs(
    experiment_names=["hyperparameter-search"],
    output_format="pandas",
)

print(f"Loaded {len(runs_df)} runs with {len(runs_df.columns)} columns")
print(f"Columns: {list(runs_df.columns)}")

con = duckdb.connect("experiments.duckdb")

con.execute("DROP TABLE IF EXISTS runs")
con.execute("CREATE TABLE runs AS SELECT * FROM runs_df")

row_count = con.execute("SELECT COUNT(*) FROM runs").fetchone()[0]
print(f"Stored {row_count} runs in experiments.duckdb")

con.close()

MLflow prefixes parameter columns with params. and metric columns with metrics.. So n_estimators becomes params.n_estimators and mean_accuracy becomes metrics.mean_accuracy. DuckDB handles these dotted column names fine – you just need to quote them.

Querying Experiments with SQL

This is where DuckDB shines. Instead of clicking through the MLflow UI or writing Pandas filter chains, you write SQL. Here are the queries you’ll actually use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import duckdb

con = duckdb.connect("experiments.duckdb")

# Find the top 5 runs by accuracy
top_runs = con.execute("""
    SELECT
        run_id,
        "params.n_estimators" AS n_estimators,
        "params.max_depth" AS max_depth,
        "params.min_samples_split" AS min_samples_split,
        "metrics.mean_accuracy" AS accuracy,
        "metrics.std_accuracy" AS std_accuracy,
        "metrics.train_duration_seconds" AS duration_s
    FROM runs
    WHERE status = 'FINISHED'
    ORDER BY accuracy DESC
    LIMIT 5
""").fetchdf()

print("=== Top 5 Runs ===")
print(top_runs.to_string(index=False))

# Compare average accuracy by max_depth
depth_stats = con.execute("""
    SELECT
        "params.max_depth" AS max_depth,
        COUNT(*) AS num_runs,
        ROUND(AVG("metrics.mean_accuracy"), 4) AS avg_accuracy,
        ROUND(MIN("metrics.mean_accuracy"), 4) AS min_accuracy,
        ROUND(MAX("metrics.mean_accuracy"), 4) AS max_accuracy
    FROM runs
    WHERE status = 'FINISHED'
    GROUP BY "params.max_depth"
    ORDER BY avg_accuracy DESC
""").fetchdf()

print("\n=== Accuracy by Max Depth ===")
print(depth_stats.to_string(index=False))

# Find runs where training was fast AND accuracy was high
efficient_runs = con.execute("""
    SELECT
        run_id,
        "params.n_estimators" AS n_estimators,
        "params.max_depth" AS max_depth,
        ROUND("metrics.mean_accuracy", 4) AS accuracy,
        ROUND("metrics.train_duration_seconds", 2) AS duration_s
    FROM runs
    WHERE status = 'FINISHED'
      AND "metrics.mean_accuracy" > 0.90
      AND "metrics.train_duration_seconds" < 5.0
    ORDER BY accuracy DESC
""").fetchdf()

print("\n=== Fast and Accurate Runs ===")
print(efficient_runs.to_string(index=False))

con.close()

The quoted column names ("params.n_estimators") are the key detail. DuckDB treats the dots as part of the column name, not as a table qualifier, as long as you quote them.

Building a Reporting Function

Wrap the whole pipeline into a reusable function that refreshes DuckDB from MLflow and returns the top N experiments. This is the kind of thing you’d call from a notebook or a CI step after a hyperparameter sweep finishes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import mlflow
import duckdb
import pandas as pd


def get_top_experiments(
    experiment_name: str,
    metric: str = "metrics.mean_accuracy",
    top_n: int = 5,
    tracking_uri: str = "sqlite:///mlflow.db",
) -> pd.DataFrame:
    """Query MLflow experiments through DuckDB and return the top N runs."""
    mlflow.set_tracking_uri(tracking_uri)

    runs_df = mlflow.search_runs(
        experiment_names=[experiment_name],
        output_format="pandas",
    )

    if runs_df.empty:
        print(f"No runs found for experiment '{experiment_name}'")
        return pd.DataFrame()

    con = duckdb.connect(":memory:")
    con.execute("CREATE TABLE runs AS SELECT * FROM runs_df")

    query = f"""
        SELECT
            run_id,
            start_time,
            "{metric}" AS target_metric,
            status
        FROM runs
        WHERE status = 'FINISHED'
        ORDER BY target_metric DESC
        LIMIT {top_n}
    """

    result = con.execute(query).fetchdf()

    # Pull all params and metrics for just the top runs
    top_run_ids = result["run_id"].tolist()
    param_cols = [c for c in runs_df.columns if c.startswith("params.")]
    metric_cols = [c for c in runs_df.columns if c.startswith("metrics.")]

    detail_df = runs_df[runs_df["run_id"].isin(top_run_ids)][
        ["run_id"] + param_cols + metric_cols
    ].copy()

    # Clean up column names for readability
    detail_df.columns = [
        c.replace("params.", "").replace("metrics.", "")
        for c in detail_df.columns
    ]

    con.close()
    return detail_df


# Usage
results = get_top_experiments("hyperparameter-search", top_n=3)
print(results.to_string(index=False))

The :memory: connection means DuckDB doesn’t write anything to disk – it just uses the DataFrame as a temporary table. For repeated queries or dashboards, swap :memory: for a file path so you aren’t re-reading MLflow on every call.

Common Errors and Fixes

MlflowException: No experiment with name 'X' exists

You set the experiment name in search_runs but never created it. Either call mlflow.set_experiment("X") first, or check the exact name in the MLflow UI. Names are case-sensitive.

duckdb.BinderException: column "params.n_estimators" not found

You forgot to quote the column name. DuckDB requires double quotes around column names that contain dots:

1
2
3
4
5
-- Wrong
SELECT params.n_estimators FROM runs;

-- Right
SELECT "params.n_estimators" FROM runs;

TypeError: 'NoneType' object is not subscriptable when reading metrics

Some runs may have failed before logging metrics. Always filter with WHERE status = 'FINISHED' to exclude incomplete runs.

mlflow.exceptions.MlflowException: Run already active

You have a nested start_run() call or a previous run didn’t close cleanly. Use mlflow.end_run() to force-close it, or restructure your code so each with mlflow.start_run(): block is independent.

DuckDB returns empty results after re-running the pipeline

If you recreated the DuckDB table but MLflow’s search_runs returned an empty DataFrame (maybe you changed the experiment name), the table exists but has zero rows. Print len(runs_df) right after search_runs to verify data is flowing through.

Slow queries on large experiment sets

DuckDB handles millions of rows easily, but if you’re loading thousands of MLflow runs, the bottleneck is usually search_runs itself. Pass max_results to limit what MLflow returns, or filter with filter_string="metrics.mean_accuracy > 0.85" to narrow it down server-side.

Setting Up MLflow with a Local Backend#

Logging Experiments from Training Runs#

Exporting MLflow Data to DuckDB#

Querying Experiments with SQL#

Building a Reporting Function#

Common Errors and Fixes#

Related Guides#

About the Author