The fastest way to tank your ML system is deploying a new model to 100% of users in one shot. Feature flags let you roll out models gradually, compare versions in shadow mode, and kill bad deployments instantly without redeploying code.
Here’s how to wire up feature flags with FastAPI to control which model version serves each request.
Fast Track: LaunchDarkly with FastAPI#
LaunchDarkly is the simplest option if you don’t want to host infrastructure. You get percentage rollouts, user targeting, and kill switches out of the box.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| from fastapi import FastAPI, Request
from ldclient import Context, set_config, Config
from ldclient.client import LDClient
import pickle
import os
# Initialize LaunchDarkly client once at startup
ld_client = LDClient(sdk_key=os.getenv("LAUNCHDARKLY_SDK_KEY"))
app = FastAPI()
# Load your model versions
model_v1 = pickle.load(open("model_v1.pkl", "rb"))
model_v2 = pickle.load(open("model_v2.pkl", "rb"))
@app.post("/predict")
async def predict(request: Request, features: dict):
# Create user context (use real user ID in production)
user_id = request.headers.get("X-User-ID", "anonymous")
context = Context.builder(user_id).build()
# Check which model version to use
model_version = ld_client.variation("ml-model-version", context, default="v1")
# Route to the right model
if model_version == "v2":
prediction = model_v2.predict([list(features.values())])[0]
else:
prediction = model_v1.predict([list(features.values())])[0]
return {
"prediction": float(prediction),
"model_version": model_version,
"user_id": user_id
}
@app.on_event("shutdown")
def shutdown():
ld_client.close()
|
In the LaunchDarkly dashboard, create a flag called ml-model-version with variations v1 and v2. Start with a 5% rollout to v2, monitor your metrics, then ramp up. If v2 tanks your latency or accuracy, flip it back to 0% instantly.
Self-Hosted: Unleash for Full Control#
If you need to own the infrastructure, Unleash gives you the same features without vendor lock-in. Run it with Docker Compose and you’re done.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
| from fastapi import FastAPI, Request
from UnleashClient import UnleashClient
import pickle
app = FastAPI()
# Initialize Unleash client
unleash_client = UnleashClient(
url="http://unleash:4242/api",
app_name="ml-service",
custom_headers={"Authorization": os.getenv("UNLEASH_API_TOKEN")}
)
unleash_client.initialize_client()
model_v1 = pickle.load(open("model_v1.pkl", "rb"))
model_v2 = pickle.load(open("model_v2.pkl", "rb"))
@app.post("/predict")
async def predict(request: Request, features: dict):
user_id = request.headers.get("X-User-ID", "anonymous")
# Unleash context for targeting
context = {
"userId": user_id,
"properties": {
"region": request.headers.get("X-Region", "us-east-1")
}
}
# Check feature flag
use_v2 = unleash_client.is_enabled("ml-model-v2", context)
if use_v2:
prediction = model_v2.predict([list(features.values())])[0]
version = "v2"
else:
prediction = model_v1.predict([list(features.values())])[0]
version = "v1"
return {
"prediction": float(prediction),
"model_version": version
}
@app.on_event("shutdown")
def shutdown():
unleash_client.destroy()
|
Deploy Unleash with this docker-compose.yml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| version: '3'
services:
unleash:
image: unleashorg/unleash-server:latest
ports:
- "4242:4242"
environment:
DATABASE_URL: postgres://unleash:password@postgres/unleash
DATABASE_SSL: "false"
depends_on:
- postgres
postgres:
image: postgres:14
environment:
POSTGRES_DB: unleash
POSTGRES_USER: unleash
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
|
Run docker-compose up -d, hit http://localhost:4242, and create your first toggle. Use gradual rollout or custom targeting rules (e.g., “only enterprise customers get v2”).
Shadow Mode: Compare Models Without Risk#
Shadow mode runs both models on every request but only returns one prediction to the user. You log both outputs to compare accuracy, latency, or drift before committing to the new model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| from fastapi import FastAPI, Request
import asyncio
import logging
import time
logger = logging.getLogger("shadow_mode")
@app.post("/predict")
async def predict(request: Request, features: dict):
user_id = request.headers.get("X-User-ID", "anonymous")
context = Context.builder(user_id).build()
# Check shadow mode flag
shadow_enabled = ld_client.variation("shadow-mode-v2", context, default=False)
# Always get v1 prediction (production)
start = time.time()
pred_v1 = model_v1.predict([list(features.values())])[0]
latency_v1 = time.time() - start
# If shadow mode is on, also get v2 prediction
if shadow_enabled:
start = time.time()
pred_v2 = model_v2.predict([list(features.values())])[0]
latency_v2 = time.time() - start
# Log comparison (send to DataDog, Prometheus, etc.)
logger.info({
"user_id": user_id,
"features": features,
"prediction_v1": float(pred_v1),
"prediction_v2": float(pred_v2),
"latency_v1_ms": latency_v1 * 1000,
"latency_v2_ms": latency_v2 * 1000,
"difference": abs(float(pred_v1) - float(pred_v2))
})
# Always return v1 to the user
return {
"prediction": float(pred_v1),
"model_version": "v1"
}
|
Enable shadow mode for 10% of traffic, let it run for a week, then analyze your logs. If v2 shows 5% better accuracy and similar latency, you’re good to switch. If it’s slower or less accurate, you never impacted users.
Percentage Rollouts: The Right Way to Ship Models#
Never go from 0% to 100%. Use this rollout schedule:
- 5% for 24 hours - Catch obvious bugs, check error rates
- 25% for 48 hours - Monitor business metrics (CTR, conversion, etc.)
- 50% for 48 hours - Statistical significance kicks in
- 100% - Full rollout if everything looks good
Both LaunchDarkly and Unleash support percentage-based targeting. In LaunchDarkly, use “Percentage rollout” in the targeting rules. In Unleash, use the “Gradual Rollout” strategy.
1
2
3
4
| # Unleash percentage rollout config (in the UI)
# Strategy: Gradual Rollout by User ID
# Percentage: 25
# Stickiness: userId (ensures same user always gets same variant)
|
Stickiness matters. Without it, a user might see predictions from v1 on one request and v2 on the next, which breaks anything stateful or conversational.
Common Errors and Fixes#
“Feature flag client returns default value every time”
Check your SDK key and network connectivity. LaunchDarkly and Unleash both need to phone home on startup to fetch flag states. If the initial sync fails, you’ll get default values forever.
1
2
3
| # Add initialization check
if not ld_client.is_initialized():
raise RuntimeError("LaunchDarkly client failed to initialize")
|
“Predictions are inconsistent for the same user”
You’re not using sticky targeting. In percentage rollouts, make sure the hash key is the user ID, not a random value. Both LaunchDarkly and Unleash default to sticky, but custom strategies might break this.
“Shadow mode doubles my latency”
You’re running predictions sequentially. Use asyncio.gather() to run both models in parallel:
1
2
3
4
5
6
7
| async def get_predictions(features):
loop = asyncio.get_event_loop()
pred_v1, pred_v2 = await asyncio.gather(
loop.run_in_executor(None, model_v1.predict, [list(features.values())]),
loop.run_in_executor(None, model_v2.predict, [list(features.values())])
)
return pred_v1[0], pred_v2[0]
|
Now both models run concurrently and you only pay the cost of the slower one.
“Flag state is stale by 30+ seconds”
LaunchDarkly and Unleash use polling by default (30-60 second intervals). For instant updates, enable streaming mode:
1
2
3
4
5
6
7
8
9
10
| # LaunchDarkly streaming (enabled by default, but can be forced)
config = Config(sdk_key=os.getenv("LAUNCHDARKLY_SDK_KEY"), stream=True)
ld_client = LDClient(config=config)
# Unleash streaming
unleash_client = UnleashClient(
url="http://unleash:4242/api",
app_name="ml-service",
refresh_interval=1 # Poll every 1 second instead of 15
)
|
Flagsmith: The Open-Source Middle Ground#
If you want self-hosted but don’t want to run Postgres, try Flagsmith. It’s lighter than Unleash and has a better UI than rolling your own.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| from flagsmith import Flagsmith
from flagsmith.models import Flags
flagsmith = Flagsmith(environment_key=os.getenv("FLAGSMITH_ENV_KEY"))
@app.post("/predict")
async def predict(request: Request, features: dict):
user_id = request.headers.get("X-User-ID", "anonymous")
# Get flags for this user
identity_flags: Flags = flagsmith.get_identity_flags(identifier=user_id)
model_version = identity_flags.get_feature_value("ml-model-version") or "v1"
if model_version == "v2":
prediction = model_v2.predict([list(features.values())])[0]
else:
prediction = model_v1.predict([list(features.values())])[0]
return {"prediction": float(prediction), "model_version": model_version}
|
Flagsmith has a hosted option too, but the self-hosted version is dead simple: one Docker container, no Postgres required (uses SQLite by default).
My Recommendation#
For production ML systems, go with Unleash if you have DevOps support to run it, or LaunchDarkly if you don’t. Unleash gives you more control and costs nothing but server time. LaunchDarkly is faster to set up and has better analytics out of the box.
Avoid building your own feature flag system. You’ll spend weeks reinventing percentage rollouts, user targeting, and flag state synchronization. These tools are battle-tested on millions of requests per second.
Always start with shadow mode, then do a slow percentage rollout. If you skip straight to 50%, you’ll get paged at 3 AM when your new model crashes half your traffic.