Every training run dumps checkpoints. Every experiment fork leaves behind artifacts nobody will ever load again. After six months your S3 bill looks like a landfill invoice. The fix isn’t “be more disciplined about deleting things.” The fix is automated garbage collection.
Here’s a basic lifecycle rule that expires objects older than 90 days in your artifacts prefix:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3", region_name="us-east-1")
BUCKET = "ml-model-artifacts"
lifecycle_config = {
"Rules": [
{
"ID": "expire-old-checkpoints",
"Filter": {"Prefix": "checkpoints/"},
"Status": "Enabled",
"Expiration": {"Days": 90},
},
{
"ID": "expire-failed-experiments",
"Filter": {"Prefix": "experiments/failed/"},
"Status": "Enabled",
"Expiration": {"Days": 14},
},
]
}
try:
s3.put_bucket_lifecycle_configuration(
Bucket=BUCKET,
LifecycleConfiguration=lifecycle_config,
)
print(f"Lifecycle rules applied to {BUCKET}")
except ClientError as e:
print(f"Failed to set lifecycle config: {e.response['Error']['Message']}")
|
That handles the obvious stuff – stale checkpoints and dead experiments. But lifecycle rules are blunt instruments. They don’t know which artifacts are still referenced by your model registry. For that, you need something smarter.
Registry-Aware Garbage Collection#
Lifecycle rules delete by age. A proper GC deletes by reachability. The idea is simple: scan your model registry for every artifact path that’s still in use, then mark everything else in the bucket as garbage.
This assumes you have a registry (a DynamoDB table, a database, even a JSON manifest) that maps model versions to S3 paths. The GC script walks the bucket, checks each key against the registry, and flags unreferenced objects.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
| import boto3
from datetime import datetime, timezone, timedelta
s3 = boto3.client("s3", region_name="us-east-1")
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
BUCKET = "ml-model-artifacts"
REGISTRY_TABLE = "model-registry"
ARTIFACT_PREFIX = "models/"
MIN_AGE_DAYS = 7 # never GC anything younger than 7 days
def get_registered_paths() -> set:
"""Pull all active artifact paths from the model registry."""
table = dynamodb.Table(REGISTRY_TABLE)
registered = set()
scan_kwargs = {"ProjectionExpression": "s3_path"}
while True:
response = table.scan(**scan_kwargs)
for item in response.get("Items", []):
path = item.get("s3_path", "")
# Normalize: strip the s3://bucket/ prefix to get the key
if path.startswith(f"s3://{BUCKET}/"):
registered.add(path[len(f"s3://{BUCKET}/"):])
if "LastEvaluatedKey" not in response:
break
scan_kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
return registered
def find_garbage(dry_run: bool = True) -> list:
"""Scan the bucket and find unreferenced artifacts."""
registered = get_registered_paths()
print(f"Found {len(registered)} registered artifact paths in registry")
cutoff = datetime.now(timezone.utc) - timedelta(days=MIN_AGE_DAYS)
garbage = []
total_size = 0
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=BUCKET, Prefix=ARTIFACT_PREFIX):
for obj in page.get("Contents", []):
key = obj["Key"]
last_modified = obj["LastModified"]
size = obj["Size"]
# Skip recent objects -- give people time to register them
if last_modified > cutoff:
continue
# Check if this key (or its parent directory) is registered
is_referenced = False
for reg_path in registered:
if key.startswith(reg_path) or key == reg_path:
is_referenced = True
break
if not is_referenced:
garbage.append({"Key": key, "Size": size, "LastModified": last_modified})
total_size += size
size_gb = total_size / (1024 ** 3)
print(f"Found {len(garbage)} unreferenced objects ({size_gb:.2f} GB)")
if dry_run:
print("DRY RUN -- no objects deleted")
for item in garbage[:20]: # show first 20
print(f" WOULD DELETE: {item['Key']} ({item['Size'] / 1024 / 1024:.1f} MB)")
if len(garbage) > 20:
print(f" ... and {len(garbage) - 20} more")
else:
delete_garbage(garbage)
return garbage
def delete_garbage(garbage: list):
"""Batch-delete garbage objects. S3 delete supports 1000 keys per request."""
deleted = 0
for i in range(0, len(garbage), 1000):
batch = garbage[i : i + 1000]
delete_request = {"Objects": [{"Key": obj["Key"]} for obj in batch]}
response = s3.delete_objects(Bucket=BUCKET, Delete=delete_request)
errors = response.get("Errors", [])
if errors:
for err in errors:
print(f" ERROR deleting {err['Key']}: {err['Message']}")
deleted += len(batch) - len(errors)
print(f"Deleted {deleted} objects")
# Run in dry-run mode first -- always
find_garbage(dry_run=True)
|
Always run dry_run=True first. Always. The first time I skipped dry-run I deleted a staging model that was about to go to production. The registry entry had a trailing slash mismatch. Lesson learned.
Multi-Tier Storage Strategy#
Deleting old artifacts is one strategy. A better one moves them through storage tiers first. S3 pricing drops dramatically as you go from Standard to Intelligent-Tiering to Glacier.
The pattern works like this:
- Hot (0-30 days): S3 Standard. Active experiments, recent checkpoints. Fast access.
- Warm (30-90 days): S3 Standard-IA. Old experiments you might revisit. Cheaper storage, per-request retrieval fee.
- Cold (90-365 days): S3 Glacier Instant Retrieval. Archived models. Very cheap, millisecond retrieval when needed.
- Delete (365+ days): Gone. If nobody’s touched it in a year, it’s dead weight.
Here’s how to set that up as lifecycle rules:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3", region_name="us-east-1")
BUCKET = "ml-model-artifacts"
tiered_lifecycle = {
"Rules": [
{
"ID": "tiered-storage-models",
"Filter": {"Prefix": "models/"},
"Status": "Enabled",
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER_IR"},
],
"Expiration": {"Days": 365},
},
{
"ID": "tiered-storage-checkpoints",
"Filter": {"Prefix": "checkpoints/"},
"Status": "Enabled",
"Transitions": [
{"Days": 14, "StorageClass": "STANDARD_IA"},
{"Days": 60, "StorageClass": "GLACIER_IR"},
],
"Expiration": {"Days": 180},
},
]
}
try:
s3.put_bucket_lifecycle_configuration(
Bucket=BUCKET,
LifecycleConfiguration=tiered_lifecycle,
)
print("Tiered storage lifecycle rules applied")
# Verify what we just set
result = s3.get_bucket_lifecycle_configuration(Bucket=BUCKET)
for rule in result["Rules"]:
print(f"\nRule: {rule['ID']} (Status: {rule['Status']})")
for t in rule.get("Transitions", []):
print(f" After {t['Days']} days -> {t['StorageClass']}")
if "Expiration" in rule:
print(f" Expire after {rule['Expiration']['Days']} days")
except ClientError as e:
print(f"Error: {e.response['Error']['Message']}")
|
Checkpoints get shorter timelines because they’re inherently disposable – you only need the last few during training, and the final model artifact is what actually matters. Production model files get a longer runway since someone might need to roll back.
Tracking Storage Costs and Savings#
Running GC without tracking savings is like dieting without a scale. You need numbers to justify the engineering time and to know if your rules are aggressive enough.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
| import boto3
from collections import defaultdict
from datetime import datetime, timezone
s3 = boto3.client("s3", region_name="us-east-1")
cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
BUCKET = "ml-model-artifacts"
# S3 pricing per GB/month (us-east-1, approximate)
PRICE_PER_GB = {
"STANDARD": 0.023,
"STANDARD_IA": 0.0125,
"GLACIER_IR": 0.004,
"GLACIER": 0.004,
"DEEP_ARCHIVE": 0.00099,
}
def analyze_storage_breakdown():
"""Break down bucket usage by prefix and estimate monthly cost."""
paginator = s3.get_paginator("list_objects_v2")
prefix_stats = defaultdict(lambda: {"count": 0, "size": 0})
for page in paginator.paginate(Bucket=BUCKET):
for obj in page.get("Contents", []):
key = obj["Key"]
size = obj["Size"]
top_prefix = key.split("/")[0] if "/" in key else "root"
prefix_stats[top_prefix]["count"] += 1
prefix_stats[top_prefix]["size"] += size
total_cost = 0.0
print(f"{'Prefix':<25} {'Objects':>10} {'Size (GB)':>12} {'Est. $/mo':>12}")
print("-" * 62)
for prefix, stats in sorted(prefix_stats.items(), key=lambda x: -x[1]["size"]):
size_gb = stats["size"] / (1024 ** 3)
# Assume Standard pricing for this scan (lifecycle handles actual tiers)
cost = size_gb * PRICE_PER_GB["STANDARD"]
total_cost += cost
print(f"{prefix:<25} {stats['count']:>10,} {size_gb:>12.2f} {cost:>12.2f}")
print("-" * 62)
total_gb = sum(s["size"] for s in prefix_stats.values()) / (1024 ** 3)
print(f"{'TOTAL':<25} {'':>10} {total_gb:>12.2f} {total_cost:>12.2f}")
return prefix_stats
def estimate_savings_from_tiering(prefix_stats: dict) -> float:
"""Estimate how much tiered storage saves vs all-Standard."""
# Rough assumption: 20% hot, 30% warm, 40% cold, 10% deletable
total_bytes = sum(s["size"] for s in prefix_stats.values())
total_gb = total_bytes / (1024 ** 3)
current_cost = total_gb * PRICE_PER_GB["STANDARD"]
tiered_cost = (
total_gb * 0.2 * PRICE_PER_GB["STANDARD"]
+ total_gb * 0.3 * PRICE_PER_GB["STANDARD_IA"]
+ total_gb * 0.4 * PRICE_PER_GB["GLACIER_IR"]
# 10% deleted, cost = 0
)
savings = current_cost - tiered_cost
print(f"\nAll-Standard cost: ${current_cost:.2f}/month")
print(f"With tiering: ${tiered_cost:.2f}/month")
print(f"Estimated savings: ${savings:.2f}/month ({savings / current_cost * 100:.0f}%)")
return savings
stats = analyze_storage_breakdown()
estimate_savings_from_tiering(stats)
|
For a team generating 500 GB of artifacts per month, tiered storage with GC typically saves 60-70% on storage costs. That’s real money – on the order of $50-80/month at moderate scale, and much more at larger volumes.
Common Errors and Fixes#
NoSuchLifecycleConfiguration when reading rules: The bucket has no lifecycle config yet. This is normal on a fresh bucket. Just call put_bucket_lifecycle_configuration first.
MalformedXML on lifecycle put: Your rule is missing a required field. Every rule needs an ID, a Filter (even if it’s {"Prefix": ""}), and a Status. Transitions must be in ascending order of Days.
Lifecycle rule not firing: Rules run once per day, not in real-time. Objects won’t transition or expire until S3’s background process picks them up. Expect up to 48 hours for transitions to complete on large buckets.
AccessDenied on delete_objects: Your IAM role needs s3:DeleteObject permission on the bucket. Lifecycle-based expiration uses S3’s internal permissions, but programmatic deletes go through your IAM policy.
Objects reappearing after deletion: If versioning is enabled, delete_objects adds a delete marker but doesn’t remove previous versions. Add NoncurrentVersionExpiration to your lifecycle rules:
1
2
3
4
5
6
| {
"ID": "clean-old-versions",
"Filter": {"Prefix": ""},
"Status": "Enabled",
"NoncurrentVersionExpiration": {"NoncurrentDays": 7},
}
|
DynamoDB scan timeout on large registries: The get_registered_paths function paginates, but very large tables can be slow. Add a FilterExpression to skip archived or deleted entries, or maintain a separate index of active S3 paths.