Every time you deploy a model serving container, it downloads the same 4 GB weights file from S3. Cold starts take minutes. Your autoscaler spins up a new pod and it sits there pulling bytes while requests queue up. The fix is a two-tier cache: check local disk first, fall back to S3, and only hit the origin (HuggingFace Hub, your model registry) as a last resort.
| |
That gives you the three knobs you need: where to store files locally, which S3 bucket to use, and how big the local cache can grow before eviction kicks in.
The ModelCache Class
The core idea is simple. get() checks local disk, then S3. put() writes to both. evict() removes from both. Every artifact is keyed by model_name/version and verified with a SHA256 checksum stored alongside it.
| |
That’s the full class. A few design choices worth calling out: the LRU eviction runs before downloading so you don’t blow past your disk limit. The checksum verification is optional — if someone stored an artifact without a checksum file, the download still works but logs a warning. The in-memory index gets rebuilt from disk on startup, so you survive process restarts without losing your cache state.
Integrating with FastAPI
Here’s how to wire the cache into a model serving endpoint. This uses the lifespan context manager pattern — the correct way to handle startup/shutdown in modern FastAPI.
| |
The /cache/stats endpoint is useful for debugging in production. Hit it to see what’s cached, how much disk you’re using, and when each artifact was last accessed.
Cache Hit/Miss Metrics
Logging alone won’t cut it at scale. You want counters you can scrape with Prometheus or push to your metrics backend. Here’s a lightweight wrapper that tracks hit rates.
| |
Drop a CacheMetrics instance into your ModelCache.__init__ and call record_hit() in the appropriate branches of get(). Then expose metrics.to_dict() on a /cache/metrics endpoint. In a real deployment, you’d also export these as Prometheus gauges, but the pattern is the same.
Common Errors and Fixes
botocore.exceptions.NoCredentialsError — Your container doesn’t have AWS credentials. If running on ECS or EKS, attach an IAM role to the task/pod. For local development, set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or use aws configure.
FileNotFoundError: /tmp/model_cache/... — The local cache directory doesn’t exist or the container’s filesystem is read-only. Make sure MODEL_CACHE_DIR points to a writable volume. On Kubernetes, mount an emptyDir volume at that path.
Checksum mismatch after download — Usually means a partial download or a corrupted file in S3. Re-upload the artifact with put() to regenerate the checksum. You can also add retry logic around the download_file call — boto3 retries on transient errors by default, but network interruptions mid-transfer won’t be caught.
Cache fills up instantly with large models — Set MODEL_CACHE_MAX_BYTES to something reasonable for your disk. If you’re serving a single 7B model (~14 GB in fp16), you need at least 15-16 GB of cache space. The LRU eviction only helps when you have multiple model versions rotating through.
S3 SlowDown errors under heavy autoscaling — When 50 pods all start at once, they all hit S3 simultaneously. Add jitter to your startup: time.sleep(random.uniform(0, 5)) before the first cache.get() call. Or better yet, pre-warm your local cache by baking the most common model into your container image.
tarfile.ReadError: not a gzip file — The artifact in your cache isn’t actually a gzipped tarball. This happens when someone uploads a raw model file instead of a tarball. Either enforce the tar.gz format in put() with a check, or adapt load_model_from_artifact to handle both formats.
Related Guides
- How to Build a Model Registry with MLflow and PostgreSQL
- How to Build a Model Artifact CDN with CloudFront and S3
- How to Build a Model Artifact Signing and Verification Pipeline
- How to Build a Model Serving Pipeline with Docker Compose and Traefik
- How to Build a Model Registry with S3 and DynamoDB
- How to Build a Model Training Pipeline with Lightning Fabric
- How to Build a Model Serving Cost Dashboard with Prometheus and Grafana
- How to Build a Model Artifact Pipeline with ORAS and Container Registries
- How to Build a Model Inference Cost Tracking Pipeline with OpenTelemetry
- How to Build a Model Artifact Garbage Collection Pipeline with S3 Lifecycle Rules