How to Build Feature Flags for ML Model Rollouts

The fastest way to tank your ML system is deploying a new model to 100% of users in one shot. Feature flags let you roll out models gradually, compare versions in shadow mode, and kill bad deployments instantly without redeploying code.

Here’s how to wire up feature flags with FastAPI to control which model version serves each request.

Fast Track: LaunchDarkly with FastAPI

LaunchDarkly is the simplest option if you don’t want to host infrastructure. You get percentage rollouts, user targeting, and kill switches out of the box.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from fastapi import FastAPI, Request
from ldclient import Context, set_config, Config
from ldclient.client import LDClient
import pickle
import os

# Initialize LaunchDarkly client once at startup
ld_client = LDClient(sdk_key=os.getenv("LAUNCHDARKLY_SDK_KEY"))

app = FastAPI()

# Load your model versions
model_v1 = pickle.load(open("model_v1.pkl", "rb"))
model_v2 = pickle.load(open("model_v2.pkl", "rb"))

@app.post("/predict")
async def predict(request: Request, features: dict):
    # Create user context (use real user ID in production)
    user_id = request.headers.get("X-User-ID", "anonymous")
    context = Context.builder(user_id).build()

    # Check which model version to use
    model_version = ld_client.variation("ml-model-version", context, default="v1")

    # Route to the right model
    if model_version == "v2":
        prediction = model_v2.predict([list(features.values())])[0]
    else:
        prediction = model_v1.predict([list(features.values())])[0]

    return {
        "prediction": float(prediction),
        "model_version": model_version,
        "user_id": user_id
    }

@app.on_event("shutdown")
def shutdown():
    ld_client.close()

In the LaunchDarkly dashboard, create a flag called ml-model-version with variations v1 and v2. Start with a 5% rollout to v2, monitor your metrics, then ramp up. If v2 tanks your latency or accuracy, flip it back to 0% instantly.

Self-Hosted: Unleash for Full Control

If you need to own the infrastructure, Unleash gives you the same features without vendor lock-in. Run it with Docker Compose and you’re done.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from fastapi import FastAPI, Request
from UnleashClient import UnleashClient
import pickle

app = FastAPI()

# Initialize Unleash client
unleash_client = UnleashClient(
    url="http://unleash:4242/api",
    app_name="ml-service",
    custom_headers={"Authorization": os.getenv("UNLEASH_API_TOKEN")}
)
unleash_client.initialize_client()

model_v1 = pickle.load(open("model_v1.pkl", "rb"))
model_v2 = pickle.load(open("model_v2.pkl", "rb"))

@app.post("/predict")
async def predict(request: Request, features: dict):
    user_id = request.headers.get("X-User-ID", "anonymous")

    # Unleash context for targeting
    context = {
        "userId": user_id,
        "properties": {
            "region": request.headers.get("X-Region", "us-east-1")
        }
    }

    # Check feature flag
    use_v2 = unleash_client.is_enabled("ml-model-v2", context)

    if use_v2:
        prediction = model_v2.predict([list(features.values())])[0]
        version = "v2"
    else:
        prediction = model_v1.predict([list(features.values())])[0]
        version = "v1"

    return {
        "prediction": float(prediction),
        "model_version": version
    }

@app.on_event("shutdown")
def shutdown():
    unleash_client.destroy()

Deploy Unleash with this docker-compose.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
version: '3'
services:
  unleash:
    image: unleashorg/unleash-server:latest
    ports:
      - "4242:4242"
    environment:
      DATABASE_URL: postgres://unleash:password@postgres/unleash
      DATABASE_SSL: "false"
    depends_on:
      - postgres

  postgres:
    image: postgres:14
    environment:
      POSTGRES_DB: unleash
      POSTGRES_USER: unleash
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

Run docker-compose up -d, hit http://localhost:4242, and create your first toggle. Use gradual rollout or custom targeting rules (e.g., “only enterprise customers get v2”).

Shadow Mode: Compare Models Without Risk

Shadow mode runs both models on every request but only returns one prediction to the user. You log both outputs to compare accuracy, latency, or drift before committing to the new model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from fastapi import FastAPI, Request
import asyncio
import logging
import time

logger = logging.getLogger("shadow_mode")

@app.post("/predict")
async def predict(request: Request, features: dict):
    user_id = request.headers.get("X-User-ID", "anonymous")
    context = Context.builder(user_id).build()

    # Check shadow mode flag
    shadow_enabled = ld_client.variation("shadow-mode-v2", context, default=False)

    # Always get v1 prediction (production)
    start = time.time()
    pred_v1 = model_v1.predict([list(features.values())])[0]
    latency_v1 = time.time() - start

    # If shadow mode is on, also get v2 prediction
    if shadow_enabled:
        start = time.time()
        pred_v2 = model_v2.predict([list(features.values())])[0]
        latency_v2 = time.time() - start

        # Log comparison (send to DataDog, Prometheus, etc.)
        logger.info({
            "user_id": user_id,
            "features": features,
            "prediction_v1": float(pred_v1),
            "prediction_v2": float(pred_v2),
            "latency_v1_ms": latency_v1 * 1000,
            "latency_v2_ms": latency_v2 * 1000,
            "difference": abs(float(pred_v1) - float(pred_v2))
        })

    # Always return v1 to the user
    return {
        "prediction": float(pred_v1),
        "model_version": "v1"
    }

Enable shadow mode for 10% of traffic, let it run for a week, then analyze your logs. If v2 shows 5% better accuracy and similar latency, you’re good to switch. If it’s slower or less accurate, you never impacted users.

Percentage Rollouts: The Right Way to Ship Models

Never go from 0% to 100%. Use this rollout schedule:

5% for 24 hours - Catch obvious bugs, check error rates
25% for 48 hours - Monitor business metrics (CTR, conversion, etc.)
50% for 48 hours - Statistical significance kicks in
100% - Full rollout if everything looks good

Both LaunchDarkly and Unleash support percentage-based targeting. In LaunchDarkly, use “Percentage rollout” in the targeting rules. In Unleash, use the “Gradual Rollout” strategy.

1
2
3
4
# Unleash percentage rollout config (in the UI)
# Strategy: Gradual Rollout by User ID
# Percentage: 25
# Stickiness: userId (ensures same user always gets same variant)

Stickiness matters. Without it, a user might see predictions from v1 on one request and v2 on the next, which breaks anything stateful or conversational.

Common Errors and Fixes

“Feature flag client returns default value every time”

Check your SDK key and network connectivity. LaunchDarkly and Unleash both need to phone home on startup to fetch flag states. If the initial sync fails, you’ll get default values forever.

1
2
3
# Add initialization check
if not ld_client.is_initialized():
    raise RuntimeError("LaunchDarkly client failed to initialize")

“Predictions are inconsistent for the same user”

You’re not using sticky targeting. In percentage rollouts, make sure the hash key is the user ID, not a random value. Both LaunchDarkly and Unleash default to sticky, but custom strategies might break this.

“Shadow mode doubles my latency”

You’re running predictions sequentially. Use asyncio.gather() to run both models in parallel:

1
2
3
4
5
6
7
async def get_predictions(features):
    loop = asyncio.get_event_loop()
    pred_v1, pred_v2 = await asyncio.gather(
        loop.run_in_executor(None, model_v1.predict, [list(features.values())]),
        loop.run_in_executor(None, model_v2.predict, [list(features.values())])
    )
    return pred_v1[0], pred_v2[0]

Now both models run concurrently and you only pay the cost of the slower one.

“Flag state is stale by 30+ seconds”

LaunchDarkly and Unleash use polling by default (30-60 second intervals). For instant updates, enable streaming mode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# LaunchDarkly streaming (enabled by default, but can be forced)
config = Config(sdk_key=os.getenv("LAUNCHDARKLY_SDK_KEY"), stream=True)
ld_client = LDClient(config=config)

# Unleash streaming
unleash_client = UnleashClient(
    url="http://unleash:4242/api",
    app_name="ml-service",
    refresh_interval=1  # Poll every 1 second instead of 15
)

Flagsmith: The Open-Source Middle Ground

If you want self-hosted but don’t want to run Postgres, try Flagsmith. It’s lighter than Unleash and has a better UI than rolling your own.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from flagsmith import Flagsmith
from flagsmith.models import Flags

flagsmith = Flagsmith(environment_key=os.getenv("FLAGSMITH_ENV_KEY"))

@app.post("/predict")
async def predict(request: Request, features: dict):
    user_id = request.headers.get("X-User-ID", "anonymous")

    # Get flags for this user
    identity_flags: Flags = flagsmith.get_identity_flags(identifier=user_id)
    model_version = identity_flags.get_feature_value("ml-model-version") or "v1"

    if model_version == "v2":
        prediction = model_v2.predict([list(features.values())])[0]
    else:
        prediction = model_v1.predict([list(features.values())])[0]

    return {"prediction": float(prediction), "model_version": model_version}

Flagsmith has a hosted option too, but the self-hosted version is dead simple: one Docker container, no Postgres required (uses SQLite by default).

My Recommendation

For production ML systems, go with Unleash if you have DevOps support to run it, or LaunchDarkly if you don’t. Unleash gives you more control and costs nothing but server time. LaunchDarkly is faster to set up and has better analytics out of the box.

Avoid building your own feature flag system. You’ll spend weeks reinventing percentage rollouts, user targeting, and flag state synchronization. These tools are battle-tested on millions of requests per second.

Always start with shadow mode, then do a slow percentage rollout. If you skip straight to 50%, you’ll get paged at 3 AM when your new model crashes half your traffic.

Fast Track: LaunchDarkly with FastAPI#

Self-Hosted: Unleash for Full Control#

Shadow Mode: Compare Models Without Risk#

Percentage Rollouts: The Right Way to Ship Models#

Common Errors and Fixes#

Flagsmith: The Open-Source Middle Ground#

My Recommendation#

Related Guides#

About the Author