You’ve got three replicas of your model server running. Requests come in, and you need something in front to distribute them, check health, and kill bad connections before they eat GPU time. That something is Envoy.

Envoy is the best proxy for gRPC ML traffic. It has native gRPC support — not bolted on as an afterthought. It handles HTTP/2 properly, which is a requirement for gRPC. It does health checking, circuit breaking, retries, and observability out of the box. NGINX can do gRPC, but Envoy was built for it.

Here’s what the architecture looks like:

1
2
3
4
5
6
7
8
9
Client (grpcurl / your app)
   Envoy Proxy (port 8080)
    ┌────┼────┐
    ▼    ▼    ▼
 Model  Model  Model
 Server Server Server
 :50051 :50052 :50053

Clients hit Envoy on a single port. Envoy routes to healthy model servers using round-robin. If a server goes down, Envoy stops sending traffic to it. Simple, effective, production-ready.

Defining the gRPC Service

Start with the protobuf definition. This describes the prediction API contract between clients and your model servers.

Create prediction.proto:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
syntax = "proto3";

package prediction;

service PredictionService {
  rpc Predict (PredictRequest) returns (PredictResponse);
  rpc HealthCheck (HealthRequest) returns (HealthResponse);
}

message PredictRequest {
  repeated float features = 1;
  string model_name = 2;
}

message PredictResponse {
  repeated float predictions = 1;
  string model_name = 2;
  float latency_ms = 3;
}

message HealthRequest {}

message HealthResponse {
  bool healthy = 1;
  string model_name = 2;
}

Compile the proto to Python:

1
2
3
4
5
6
pip install grpcio grpcio-tools scikit-learn
python -m grpc_tools.protoc \
  --python_out=. \
  --grpc_python_out=. \
  --proto_path=. \
  prediction.proto

This generates prediction_pb2.py and prediction_pb2_grpc.py. Now implement the server.

Create model_server.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import time
import os
from concurrent import futures

import grpc
import numpy as np
from sklearn.ensemble import RandomForestClassifier

import prediction_pb2
import prediction_pb2_grpc


class PredictionServicer(prediction_pb2_grpc.PredictionServiceServicer):
    def __init__(self):
        self.model_name = os.getenv("MODEL_NAME", "rf-classifier")
        self.model = RandomForestClassifier(n_estimators=100, random_state=42)
        # Train on dummy data at startup
        X_train = np.random.rand(1000, 4)
        y_train = (X_train[:, 0] + X_train[:, 1] > 1.0).astype(int)
        self.model.fit(X_train, y_train)
        print(f"Model '{self.model_name}' loaded and ready")

    def Predict(self, request, context):
        start = time.time()
        features = np.array(request.features).reshape(1, -1)

        if features.shape[1] != 4:
            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
            context.set_details(f"Expected 4 features, got {features.shape[1]}")
            return prediction_pb2.PredictResponse()

        preds = self.model.predict_proba(features)[0].tolist()
        latency = (time.time() - start) * 1000

        return prediction_pb2.PredictResponse(
            predictions=preds,
            model_name=self.model_name,
            latency_ms=latency,
        )

    def HealthCheck(self, request, context):
        return prediction_pb2.HealthResponse(
            healthy=True,
            model_name=self.model_name,
        )


def serve():
    port = os.getenv("GRPC_PORT", "50051")
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    prediction_pb2_grpc.add_PredictionServiceServicer_to_server(
        PredictionServicer(), server
    )
    server.add_insecure_port(f"0.0.0.0:{port}")
    server.start()
    print(f"gRPC server listening on port {port}")
    server.wait_for_termination()


if __name__ == "__main__":
    serve()

Nothing fancy here. A RandomForest model trained on dummy data at startup. In production, you’d load a real model from disk or a model registry. The key point is the gRPC interface — Envoy doesn’t care what’s behind it as long as it speaks gRPC.

Configuring Envoy as a Gateway

This is where Envoy earns its keep. Create envoy.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
static_resources:
  listeners:
    - name: grpc_listener
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: grpc_gateway
                codec_type: AUTO
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: prediction_service
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/prediction.PredictionService/"
                          route:
                            cluster: model_servers
                            timeout: 30s
                        - match:
                            prefix: "/"
                          route:
                            cluster: model_servers
                            timeout: 10s
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: model_servers
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options: {}
      health_checks:
        - timeout: 2s
          interval: 10s
          unhealthy_threshold: 3
          healthy_threshold: 2
          grpc_health_check: {}
      load_assignment:
        cluster_name: model_servers
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: model-server-1
                      port_value: 50051
              - endpoint:
                  address:
                    socket_address:
                      address: model-server-2
                      port_value: 50051
              - endpoint:
                  address:
                    socket_address:
                      address: model-server-3
                      port_value: 50051

admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

A few things to call out:

  • http2_protocol_options — gRPC runs on HTTP/2. Without this, Envoy tries HTTP/1.1 and your gRPC calls fail silently. This is the most common misconfiguration.
  • ROUND_ROBIN lb_policy — distributes requests evenly. For ML workloads where some requests take longer (batch vs. single inference), consider LEAST_REQUEST instead.
  • grpc_health_check — Envoy calls the standard gRPC health checking protocol. Your server needs to implement it, or use the HealthCheck RPC we defined above.
  • STRICT_DNS — resolves DNS names to addresses. Works perfectly with Docker Compose service names.
  • admin on port 9901 — gives you stats, config dumps, and cluster health at http://localhost:9901/clusters.

Running with Docker Compose

Create a Dockerfile for the model server:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
FROM python:3.11-slim

WORKDIR /app

RUN pip install --no-cache-dir grpcio grpcio-tools grpcio-health-checking scikit-learn numpy

COPY prediction.proto .
RUN python -m grpc_tools.protoc --python_out=. --grpc_python_out=. --proto_path=. prediction.proto

COPY model_server.py .

CMD ["python", "model_server.py"]

Now the docker-compose.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
services:
  model-server-1:
    build: .
    environment:
      - GRPC_PORT=50051
      - MODEL_NAME=rf-replica-1
    ports:
      - "50051:50051"

  model-server-2:
    build: .
    environment:
      - GRPC_PORT=50051
      - MODEL_NAME=rf-replica-2
    ports:
      - "50052:50051"

  model-server-3:
    build: .
    environment:
      - GRPC_PORT=50051
      - MODEL_NAME=rf-replica-3
    ports:
      - "50053:50051"

  envoy:
    image: envoyproxy/envoy:v1.31-latest
    volumes:
      - ./envoy.yaml:/etc/envoy/envoy.yaml:ro
    ports:
      - "8080:8080"
      - "9901:9901"
    depends_on:
      - model-server-1
      - model-server-2
      - model-server-3
    command: envoy -c /etc/envoy/envoy.yaml --log-level info

Start everything:

1
docker compose up --build -d

Test with grpcurl:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Install grpcurl if you don't have it
# brew install grpcurl   (macOS)
# go install github.com/fullstorydev/grpcurl/cmd/grpcurl@latest   (Go)

# List available services
grpcurl -plaintext localhost:8080 list

# Send a prediction request
grpcurl -plaintext \
  -d '{"features": [0.5, 0.8, 0.3, 0.1], "model_name": "rf-classifier"}' \
  localhost:8080 prediction.PredictionService/Predict

You should see the response come back with predictions and the model name. Run the predict call multiple times and watch the model_name field rotate through rf-replica-1, rf-replica-2, rf-replica-3. That’s round-robin in action.

Check Envoy’s admin dashboard at http://localhost:9901/clusters to see health status and request counts per upstream.

Adding Circuit Breaking and Rate Limiting

GPU memory is precious. One misbehaving client sending a flood of requests can starve your other users. Circuit breaking and rate limiting protect your model servers.

Add these settings to the model_servers cluster in envoy.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
  clusters:
    - name: model_servers
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 100
            max_pending_requests: 50
            max_requests: 200
            max_retries: 3
            retry_budget:
              budget_percent:
                value: 20.0
              min_retry_concurrency: 5
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options:
              max_concurrent_streams: 50
      health_checks:
        - timeout: 2s
          interval: 10s
          unhealthy_threshold: 3
          healthy_threshold: 2
          grpc_health_check: {}
      load_assignment:
        cluster_name: model_servers
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: model-server-1
                      port_value: 50051
              - endpoint:
                  address:
                    socket_address:
                      address: model-server-2
                      port_value: 50051
              - endpoint:
                  address:
                    socket_address:
                      address: model-server-3
                      port_value: 50051

Here’s what each setting does:

  • max_connections: 100 — caps total TCP connections to the cluster. Once hit, new connections get queued or rejected.
  • max_pending_requests: 50 — limits how many requests wait in the queue. This prevents unbounded memory growth when model servers are slow.
  • max_requests: 200 — hard limit on active requests across all connections.
  • max_concurrent_streams: 50 — limits HTTP/2 streams per connection. This is critical for gRPC because a single HTTP/2 connection can multiplex many requests. Without this, one client can open thousands of streams.
  • retry_budget — caps retries at 20% of active requests. Prevents retry storms from amplifying load during outages.

When a circuit breaker trips, Envoy returns an immediate gRPC UNAVAILABLE status. Your client sees a fast failure instead of a slow timeout. That’s what you want — fail fast, retry with backoff, don’t pile up on a struggling server.

For per-client rate limiting, you’d add a rate limit filter pointing to an external rate limit service (like envoy-ratelimit). But for most ML serving setups, circuit breakers alone handle the common failure modes.

Common Errors and Fixes

“upstream connect error or disconnect/reset before headers. reset reason: connection failure”

This means Envoy can’t reach the upstream server. Check that your model servers are actually running and listening on the right port. Verify the hostnames in envoy.yaml match the Docker Compose service names exactly. Run docker compose logs model-server-1 to confirm the server started.

1
2
# Verify servers are reachable from the Envoy container
docker compose exec envoy sh -c "nc -z model-server-1 50051 && echo OK"

“Error validating config: … Didn’t find a registered implementation for ’envoy.filters.http.router’”

You’re running a newer Envoy image that requires the typed config format. Make sure your http_filters block uses the full typed config, not the shorthand:

1
2
3
4
5
6
7
8
9
# Wrong (old format)
http_filters:
  - name: envoy.filters.http.router

# Right (typed config format)
http_filters:
  - name: envoy.filters.http.router
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

Validate your config before deploying:

1
2
3
4
docker run --rm \
  -v ./envoy.yaml:/etc/envoy/envoy.yaml:ro \
  envoyproxy/envoy:v1.31-latest \
  envoy --mode validate -c /etc/envoy/envoy.yaml

“Health check failures — all upstream hosts unhealthy”

Envoy’s grpc_health_check expects the standard gRPC health checking protocol (grpc.health.v1.Health/Check). If your server doesn’t implement it, every health check fails and Envoy marks all upstreams as unhealthy. Either implement the standard health protocol:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from grpc_health.v1 import health
from grpc_health.v1 import health_pb2
from grpc_health.v1 import health_pb2_grpc

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))

    # Register your prediction service
    prediction_pb2_grpc.add_PredictionServiceServicer_to_server(
        PredictionServicer(), server
    )

    # Register the standard gRPC health service
    health_servicer = health.HealthServicer()
    health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)
    health_servicer.set(
        "prediction.PredictionService",
        health_pb2.HealthCheckResponse.SERVING,
    )

    server.add_insecure_port("0.0.0.0:50051")
    server.start()
    server.wait_for_termination()

Or switch to HTTP health checks in Envoy if you can’t modify the server.