MLflow’s default SQLite backend works fine when you’re the only one running experiments on your laptop. The moment a second person needs access – or you want to run the tracking server behind a load balancer – SQLite falls apart. No concurrent writes, no network access, no replication. PostgreSQL fixes all of that and gives you proper backup tooling, MVCC for concurrent access, and battle-tested reliability.

This guide walks through standing up MLflow with PostgreSQL using Docker Compose, registering and versioning models through the Python SDK, managing stage transitions, and handling the errors you’ll actually hit in production.

Setting Up MLflow with PostgreSQL

The cleanest way to run this stack is Docker Compose. Two services: PostgreSQL for the backend store, and the MLflow tracking server that connects to it. Artifacts go to a local volume here, but you can swap in S3 by changing --default-artifact-root.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# docker-compose.yml
version: "3.8"

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: mlflow_secret
      POSTGRES_DB: mlflow_db
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow -d mlflow_db"]
      interval: 5s
      timeout: 3s
      retries: 5

  mlflow:
    image: python:3.11-slim
    depends_on:
      postgres:
        condition: service_healthy
    ports:
      - "5000:5000"
    volumes:
      - ./mlartifacts:/mlartifacts
    command: >
      bash -c "
        pip install mlflow psycopg2-binary --quiet &&
        mlflow server
          --backend-store-uri postgresql://mlflow:mlflow_secret@postgres:5432/mlflow_db
          --default-artifact-root /mlartifacts
          --host 0.0.0.0
          --port 5000
      "

volumes:
  pgdata:

Start it up:

1
docker compose up -d

MLflow automatically creates its schema tables in PostgreSQL on first launch. You don’t need to run any migrations manually for a fresh database. The --backend-store-uri flag is what tells MLflow to use PostgreSQL instead of the local filesystem. The connection string format is standard: postgresql://user:password@host:port/database.

If you want S3 for artifacts instead of a local volume, swap the artifact root:

1
2
3
4
5
mlflow server \
  --backend-store-uri postgresql://mlflow:mlflow_secret@postgres:5432/mlflow_db \
  --default-artifact-root s3://my-mlflow-artifacts/ \
  --host 0.0.0.0 \
  --port 5000

You’ll need boto3 installed and valid AWS credentials available in the MLflow container for S3 to work.

Registering and Versioning Models

With the server running, point your client at it and start logging models. Here’s a full example that trains a model, logs it to an experiment, and registers it in one shot.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import mlflow
import mlflow.sklearn
from mlflow import MlflowClient
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("breast-cancer-classifier")

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

with mlflow.start_run():
    model = GradientBoostingClassifier(
        n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
    )
    model.fit(X_train, y_train)

    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds)

    mlflow.log_param("n_estimators", 200)
    mlflow.log_param("max_depth", 4)
    mlflow.log_param("learning_rate", 0.1)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_score", f1)

    # Register the model -- creates version 1 if it doesn't exist yet
    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="model",
        registered_model_name="cancer-gbm",
    )
    print(f"Accuracy: {acc:.4f}, F1: {f1:.4f}")

Every time you run this with the same registered_model_name, MLflow increments the version number automatically. Version 1, version 2, version 3 – each linked back to its training run, parameters, and metrics.

To inspect what you’ve registered:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
client = MlflowClient(tracking_uri="http://localhost:5000")

# List all versions of the model
versions = client.search_model_versions("name='cancer-gbm'")
for v in versions:
    print(f"Version {v.version} | Status: {v.status} | Run ID: {v.run_id}")

# Get details on a specific version
version_info = client.get_model_version("cancer-gbm", "1")
print(f"Run ID: {version_info.run_id}")
print(f"Source: {version_info.source}")
print(f"Created: {version_info.creation_timestamp}")

Stage Transitions and Deployment

MLflow 2.9+ deprecates the old Staging/Production/Archived stages in favor of aliases. Aliases are strictly better – you can name them whatever you want, assign multiple to a version, and swap them without any state machine nonsense. But the stage-based API still works if you’re on an older MLflow version or prefer that model.

Here’s the alias-based workflow, which is what you should use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from mlflow import MlflowClient

client = MlflowClient(tracking_uri="http://localhost:5000")

# Tag version 1 as the production champion
client.set_registered_model_alias("cancer-gbm", "champion", 1)

# Register a new version from a different run, then set it as challenger
client.set_registered_model_alias("cancer-gbm", "challenger", 2)

# Check what the champion alias points to
champion = client.get_model_version_by_alias("cancer-gbm", "champion")
print(f"Champion is version {champion.version}")

# After validation, promote challenger to champion
client.set_registered_model_alias("cancer-gbm", "champion", 2)
client.delete_registered_model_alias("cancer-gbm", "challenger")
print("Version 2 is now the champion")

The real power is in loading models by alias. Your serving code never hardcodes a version number:

1
2
3
4
5
6
7
8
import mlflow.pyfunc

# Always loads whatever version has the "champion" alias
model = mlflow.pyfunc.load_model("models:/cancer-gbm@champion")

# Run inference
predictions = model.predict(X_test)
print(f"Predictions shape: {predictions.shape}")

When you promote a new version to champion, the next call to load_model picks it up. No code changes, no redeployment. Your serving infrastructure just points at models:/cancer-gbm@champion and you manage promotions entirely through the registry.

You can also add metadata tags to track approval status, training data versions, or anything else:

1
2
3
4
5
6
7
8
9
client.set_model_version_tag("cancer-gbm", "1", "validation_status", "approved")
client.set_model_version_tag("cancer-gbm", "1", "trained_by", "qasim")
client.set_model_version_tag("cancer-gbm", "2", "dataset_version", "breast-cancer-v2")

client.update_model_version(
    name="cancer-gbm",
    version="2",
    description="GBM with 200 trees, trained on v2 dataset. F1: 0.972",
)

Common Errors and Fixes

PostgreSQL connection refused:

1
2
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError)
could not connect to server: Connection refused

This usually means PostgreSQL isn’t ready when MLflow tries to connect. In the Docker Compose file above, the healthcheck and depends_on with condition: service_healthy handle this. If you’re running outside Docker, check that PostgreSQL is listening on the right interface. Edit postgresql.conf to set listen_addresses = '*' and add a line to pg_hba.conf:

1
host    mlflow_db    mlflow    0.0.0.0/0    md5

Then restart PostgreSQL. Also verify the port isn’t blocked by a firewall.

Schema migration errors after MLflow upgrade:

When you upgrade MLflow, the database schema might need updating. Run:

1
mlflow db upgrade postgresql://mlflow:mlflow_secret@localhost:5432/mlflow_db

Always back up the database before running migrations. A simple pg_dump mlflow_db > backup.sql saves you if something goes wrong.

S3 artifact store permissions:

If you use S3 for artifacts, the MLflow server process needs s3:PutObject, s3:GetObject, and s3:ListBucket permissions. When running in Docker, pass AWS credentials as environment variables or mount ~/.aws/credentials. The error looks like botocore.exceptions.ClientError: AccessDenied – check your IAM policy first, then verify the bucket name and region match.

Model name collisions:

create_registered_model() throws RESOURCE_ALREADY_EXISTS if the name is taken. Use mlflow.sklearn.log_model() with registered_model_name instead – it creates the model if missing and adds a version if it exists. Or check first:

1
2
3
4
try:
    client.get_registered_model("cancer-gbm")
except mlflow.exceptions.MlflowException:
    client.create_registered_model("cancer-gbm")

Large model artifacts slowing down registration:

Big models (multi-GB) can make the log_model call painfully slow, especially over a network artifact store. Two things help: use S3 with multipart uploads (MLflow handles this automatically through boto3), and set MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT to a higher value in the server environment. For truly large models, consider logging them as artifacts separately and linking the URI manually with mlflow.register_model("s3://bucket/path/to/model", "model-name").