If you’ve run more than a handful of training jobs, you know the pain: scattered logs, forgotten hyperparameters, and no easy way to ask “which learning rate worked best with dropout 0.3?” MLflow handles the logging side well, but its built-in UI becomes clunky once you have hundreds of runs. DuckDB fixes that – you dump your run data into an in-process SQL engine and slice it however you want in milliseconds.
Here’s what the end result looks like. You train models, MLflow logs everything, and DuckDB gives you instant SQL access:
1
| pip install mlflow duckdb scikit-learn pandas
|
Setting Up MLflow with a Local Backend#
MLflow stores experiment data in a tracking URI. For local work, a SQLite backend keeps things simple and portable. Set the tracking URI before you log anything:
1
2
3
4
5
6
7
| import mlflow
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("hyperparameter-search")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Artifact URI: {mlflow.get_artifact_uri()}")
|
This creates a mlflow.db file in your working directory. All parameters, metrics, and run metadata get stored there. Artifacts (model files, plots) go into a local mlruns/ directory by default.
You can also point MLflow at a remote server if your team needs shared access, but the local SQLite backend is perfect for single-machine experimentation and pairs well with DuckDB later.
Logging Experiments from Training Runs#
Here’s a realistic training loop that logs everything you’d actually care about – hyperparameters, per-epoch metrics, and the final model. We’ll use scikit-learn to keep the ML part simple so the focus stays on the tracking pipeline:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
| import mlflow
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import time
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("hyperparameter-search")
X, y = make_classification(
n_samples=2000,
n_features=20,
n_informative=10,
n_classes=2,
random_state=42,
)
param_grid = [
{"n_estimators": 50, "max_depth": 5, "min_samples_split": 2},
{"n_estimators": 100, "max_depth": 10, "min_samples_split": 5},
{"n_estimators": 200, "max_depth": 15, "min_samples_split": 10},
{"n_estimators": 100, "max_depth": 5, "min_samples_split": 2},
{"n_estimators": 200, "max_depth": 10, "min_samples_split": 5},
{"n_estimators": 50, "max_depth": 15, "min_samples_split": 10},
{"n_estimators": 150, "max_depth": 8, "min_samples_split": 3},
{"n_estimators": 300, "max_depth": 20, "min_samples_split": 8},
]
for params in param_grid:
with mlflow.start_run():
mlflow.log_params(params)
mlflow.set_tag("model_type", "RandomForest")
mlflow.set_tag("dataset", "synthetic_classification")
start_time = time.time()
clf = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
train_duration = time.time() - start_time
mlflow.log_metric("mean_accuracy", np.mean(scores))
mlflow.log_metric("std_accuracy", np.std(scores))
mlflow.log_metric("train_duration_seconds", train_duration)
for fold_idx, score in enumerate(scores):
mlflow.log_metric("fold_accuracy", score, step=fold_idx)
clf.fit(X, y)
mlflow.sklearn.log_model(clf, "model")
print(f"Params: {params} | Accuracy: {np.mean(scores):.4f} +/- {np.std(scores):.4f}")
|
Each start_run() block creates an isolated run. Parameters, metrics, and the serialized model all get stored atomically. If the run crashes mid-way, MLflow marks it as failed so you don’t confuse partial results with complete ones.
Exporting MLflow Data to DuckDB#
MLflow’s search_runs function returns a Pandas DataFrame. DuckDB can query DataFrames directly – no export step needed. But if you want persistent storage or want to join experiment data with other tables, writing to a DuckDB file makes more sense:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| import mlflow
import duckdb
import pandas as pd
mlflow.set_tracking_uri("sqlite:///mlflow.db")
runs_df = mlflow.search_runs(
experiment_names=["hyperparameter-search"],
output_format="pandas",
)
print(f"Loaded {len(runs_df)} runs with {len(runs_df.columns)} columns")
print(f"Columns: {list(runs_df.columns)}")
con = duckdb.connect("experiments.duckdb")
con.execute("DROP TABLE IF EXISTS runs")
con.execute("CREATE TABLE runs AS SELECT * FROM runs_df")
row_count = con.execute("SELECT COUNT(*) FROM runs").fetchone()[0]
print(f"Stored {row_count} runs in experiments.duckdb")
con.close()
|
MLflow prefixes parameter columns with params. and metric columns with metrics.. So n_estimators becomes params.n_estimators and mean_accuracy becomes metrics.mean_accuracy. DuckDB handles these dotted column names fine – you just need to quote them.
Querying Experiments with SQL#
This is where DuckDB shines. Instead of clicking through the MLflow UI or writing Pandas filter chains, you write SQL. Here are the queries you’ll actually use:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
| import duckdb
con = duckdb.connect("experiments.duckdb")
# Find the top 5 runs by accuracy
top_runs = con.execute("""
SELECT
run_id,
"params.n_estimators" AS n_estimators,
"params.max_depth" AS max_depth,
"params.min_samples_split" AS min_samples_split,
"metrics.mean_accuracy" AS accuracy,
"metrics.std_accuracy" AS std_accuracy,
"metrics.train_duration_seconds" AS duration_s
FROM runs
WHERE status = 'FINISHED'
ORDER BY accuracy DESC
LIMIT 5
""").fetchdf()
print("=== Top 5 Runs ===")
print(top_runs.to_string(index=False))
# Compare average accuracy by max_depth
depth_stats = con.execute("""
SELECT
"params.max_depth" AS max_depth,
COUNT(*) AS num_runs,
ROUND(AVG("metrics.mean_accuracy"), 4) AS avg_accuracy,
ROUND(MIN("metrics.mean_accuracy"), 4) AS min_accuracy,
ROUND(MAX("metrics.mean_accuracy"), 4) AS max_accuracy
FROM runs
WHERE status = 'FINISHED'
GROUP BY "params.max_depth"
ORDER BY avg_accuracy DESC
""").fetchdf()
print("\n=== Accuracy by Max Depth ===")
print(depth_stats.to_string(index=False))
# Find runs where training was fast AND accuracy was high
efficient_runs = con.execute("""
SELECT
run_id,
"params.n_estimators" AS n_estimators,
"params.max_depth" AS max_depth,
ROUND("metrics.mean_accuracy", 4) AS accuracy,
ROUND("metrics.train_duration_seconds", 2) AS duration_s
FROM runs
WHERE status = 'FINISHED'
AND "metrics.mean_accuracy" > 0.90
AND "metrics.train_duration_seconds" < 5.0
ORDER BY accuracy DESC
""").fetchdf()
print("\n=== Fast and Accurate Runs ===")
print(efficient_runs.to_string(index=False))
con.close()
|
The quoted column names ("params.n_estimators") are the key detail. DuckDB treats the dots as part of the column name, not as a table qualifier, as long as you quote them.
Building a Reporting Function#
Wrap the whole pipeline into a reusable function that refreshes DuckDB from MLflow and returns the top N experiments. This is the kind of thing you’d call from a notebook or a CI step after a hyperparameter sweep finishes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
| import mlflow
import duckdb
import pandas as pd
def get_top_experiments(
experiment_name: str,
metric: str = "metrics.mean_accuracy",
top_n: int = 5,
tracking_uri: str = "sqlite:///mlflow.db",
) -> pd.DataFrame:
"""Query MLflow experiments through DuckDB and return the top N runs."""
mlflow.set_tracking_uri(tracking_uri)
runs_df = mlflow.search_runs(
experiment_names=[experiment_name],
output_format="pandas",
)
if runs_df.empty:
print(f"No runs found for experiment '{experiment_name}'")
return pd.DataFrame()
con = duckdb.connect(":memory:")
con.execute("CREATE TABLE runs AS SELECT * FROM runs_df")
query = f"""
SELECT
run_id,
start_time,
"{metric}" AS target_metric,
status
FROM runs
WHERE status = 'FINISHED'
ORDER BY target_metric DESC
LIMIT {top_n}
"""
result = con.execute(query).fetchdf()
# Pull all params and metrics for just the top runs
top_run_ids = result["run_id"].tolist()
param_cols = [c for c in runs_df.columns if c.startswith("params.")]
metric_cols = [c for c in runs_df.columns if c.startswith("metrics.")]
detail_df = runs_df[runs_df["run_id"].isin(top_run_ids)][
["run_id"] + param_cols + metric_cols
].copy()
# Clean up column names for readability
detail_df.columns = [
c.replace("params.", "").replace("metrics.", "")
for c in detail_df.columns
]
con.close()
return detail_df
# Usage
results = get_top_experiments("hyperparameter-search", top_n=3)
print(results.to_string(index=False))
|
The :memory: connection means DuckDB doesn’t write anything to disk – it just uses the DataFrame as a temporary table. For repeated queries or dashboards, swap :memory: for a file path so you aren’t re-reading MLflow on every call.
Common Errors and Fixes#
MlflowException: No experiment with name 'X' exists
You set the experiment name in search_runs but never created it. Either call mlflow.set_experiment("X") first, or check the exact name in the MLflow UI. Names are case-sensitive.
duckdb.BinderException: column "params.n_estimators" not found
You forgot to quote the column name. DuckDB requires double quotes around column names that contain dots:
1
2
3
4
5
| -- Wrong
SELECT params.n_estimators FROM runs;
-- Right
SELECT "params.n_estimators" FROM runs;
|
TypeError: 'NoneType' object is not subscriptable when reading metrics
Some runs may have failed before logging metrics. Always filter with WHERE status = 'FINISHED' to exclude incomplete runs.
mlflow.exceptions.MlflowException: Run already active
You have a nested start_run() call or a previous run didn’t close cleanly. Use mlflow.end_run() to force-close it, or restructure your code so each with mlflow.start_run(): block is independent.
DuckDB returns empty results after re-running the pipeline
If you recreated the DuckDB table but MLflow’s search_runs returned an empty DataFrame (maybe you changed the experiment name), the table exists but has zero rows. Print len(runs_df) right after search_runs to verify data is flowing through.
Slow queries on large experiment sets
DuckDB handles millions of rows easily, but if you’re loading thousands of MLflow runs, the bottleneck is usually search_runs itself. Pass max_results to limit what MLflow returns, or filter with filter_string="metrics.mean_accuracy > 0.85" to narrow it down server-side.