MLflow’s default SQLite backend works fine when you’re the only one running experiments on your laptop. The moment a second person needs access – or you want to run the tracking server behind a load balancer – SQLite falls apart. No concurrent writes, no network access, no replication. PostgreSQL fixes all of that and gives you proper backup tooling, MVCC for concurrent access, and battle-tested reliability.
This guide walks through standing up MLflow with PostgreSQL using Docker Compose, registering and versioning models through the Python SDK, managing stage transitions, and handling the errors you’ll actually hit in production.
Setting Up MLflow with PostgreSQL
The cleanest way to run this stack is Docker Compose. Two services: PostgreSQL for the backend store, and the MLflow tracking server that connects to it. Artifacts go to a local volume here, but you can swap in S3 by changing --default-artifact-root.
| |
Start it up:
| |
MLflow automatically creates its schema tables in PostgreSQL on first launch. You don’t need to run any migrations manually for a fresh database. The --backend-store-uri flag is what tells MLflow to use PostgreSQL instead of the local filesystem. The connection string format is standard: postgresql://user:password@host:port/database.
If you want S3 for artifacts instead of a local volume, swap the artifact root:
| |
You’ll need boto3 installed and valid AWS credentials available in the MLflow container for S3 to work.
Registering and Versioning Models
With the server running, point your client at it and start logging models. Here’s a full example that trains a model, logs it to an experiment, and registers it in one shot.
| |
Every time you run this with the same registered_model_name, MLflow increments the version number automatically. Version 1, version 2, version 3 – each linked back to its training run, parameters, and metrics.
To inspect what you’ve registered:
| |
Stage Transitions and Deployment
MLflow 2.9+ deprecates the old Staging/Production/Archived stages in favor of aliases. Aliases are strictly better – you can name them whatever you want, assign multiple to a version, and swap them without any state machine nonsense. But the stage-based API still works if you’re on an older MLflow version or prefer that model.
Here’s the alias-based workflow, which is what you should use:
| |
The real power is in loading models by alias. Your serving code never hardcodes a version number:
| |
When you promote a new version to champion, the next call to load_model picks it up. No code changes, no redeployment. Your serving infrastructure just points at models:/cancer-gbm@champion and you manage promotions entirely through the registry.
You can also add metadata tags to track approval status, training data versions, or anything else:
| |
Common Errors and Fixes
PostgreSQL connection refused:
| |
This usually means PostgreSQL isn’t ready when MLflow tries to connect. In the Docker Compose file above, the healthcheck and depends_on with condition: service_healthy handle this. If you’re running outside Docker, check that PostgreSQL is listening on the right interface. Edit postgresql.conf to set listen_addresses = '*' and add a line to pg_hba.conf:
| |
Then restart PostgreSQL. Also verify the port isn’t blocked by a firewall.
Schema migration errors after MLflow upgrade:
When you upgrade MLflow, the database schema might need updating. Run:
| |
Always back up the database before running migrations. A simple pg_dump mlflow_db > backup.sql saves you if something goes wrong.
S3 artifact store permissions:
If you use S3 for artifacts, the MLflow server process needs s3:PutObject, s3:GetObject, and s3:ListBucket permissions. When running in Docker, pass AWS credentials as environment variables or mount ~/.aws/credentials. The error looks like botocore.exceptions.ClientError: AccessDenied – check your IAM policy first, then verify the bucket name and region match.
Model name collisions:
create_registered_model() throws RESOURCE_ALREADY_EXISTS if the name is taken. Use mlflow.sklearn.log_model() with registered_model_name instead – it creates the model if missing and adds a version if it exists. Or check first:
| |
Large model artifacts slowing down registration:
Big models (multi-GB) can make the log_model call painfully slow, especially over a network artifact store. Two things help: use S3 with multipart uploads (MLflow handles this automatically through boto3), and set MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT to a higher value in the server environment. For truly large models, consider logging them as artifacts separately and linking the URI manually with mlflow.register_model("s3://bucket/path/to/model", "model-name").
Related Guides
- How to Build a Model Artifact Cache with S3 and Local Fallback
- How to Build a Model Artifact Signing and Verification Pipeline
- How to Build a Model Serving Pipeline with Docker Compose and Traefik
- How to Build a Model Artifact CDN with CloudFront and S3
- How to Build a Model Training Pipeline with Lightning Fabric
- How to Build a Model Serving Cost Dashboard with Prometheus and Grafana
- How to Build a Model Artifact Pipeline with ORAS and Container Registries
- How to Build a Model Inference Cost Tracking Pipeline with OpenTelemetry
- How to Build a Model Training Pipeline with AWS SageMaker and Python
- How to Build a Model Training Cost Calculator with Cloud Pricing APIs