Every training run dumps checkpoints. Every experiment fork leaves behind artifacts nobody will ever load again. After six months your S3 bill looks like a landfill invoice. The fix isn’t “be more disciplined about deleting things.” The fix is automated garbage collection.

Here’s a basic lifecycle rule that expires objects older than 90 days in your artifacts prefix:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import boto3
from botocore.exceptions import ClientError

s3 = boto3.client("s3", region_name="us-east-1")

BUCKET = "ml-model-artifacts"

lifecycle_config = {
    "Rules": [
        {
            "ID": "expire-old-checkpoints",
            "Filter": {"Prefix": "checkpoints/"},
            "Status": "Enabled",
            "Expiration": {"Days": 90},
        },
        {
            "ID": "expire-failed-experiments",
            "Filter": {"Prefix": "experiments/failed/"},
            "Status": "Enabled",
            "Expiration": {"Days": 14},
        },
    ]
}

try:
    s3.put_bucket_lifecycle_configuration(
        Bucket=BUCKET,
        LifecycleConfiguration=lifecycle_config,
    )
    print(f"Lifecycle rules applied to {BUCKET}")
except ClientError as e:
    print(f"Failed to set lifecycle config: {e.response['Error']['Message']}")

That handles the obvious stuff – stale checkpoints and dead experiments. But lifecycle rules are blunt instruments. They don’t know which artifacts are still referenced by your model registry. For that, you need something smarter.

Registry-Aware Garbage Collection

Lifecycle rules delete by age. A proper GC deletes by reachability. The idea is simple: scan your model registry for every artifact path that’s still in use, then mark everything else in the bucket as garbage.

This assumes you have a registry (a DynamoDB table, a database, even a JSON manifest) that maps model versions to S3 paths. The GC script walks the bucket, checks each key against the registry, and flags unreferenced objects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import boto3
from datetime import datetime, timezone, timedelta

s3 = boto3.client("s3", region_name="us-east-1")
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")

BUCKET = "ml-model-artifacts"
REGISTRY_TABLE = "model-registry"
ARTIFACT_PREFIX = "models/"
MIN_AGE_DAYS = 7  # never GC anything younger than 7 days


def get_registered_paths() -> set:
    """Pull all active artifact paths from the model registry."""
    table = dynamodb.Table(REGISTRY_TABLE)
    registered = set()
    scan_kwargs = {"ProjectionExpression": "s3_path"}
    while True:
        response = table.scan(**scan_kwargs)
        for item in response.get("Items", []):
            path = item.get("s3_path", "")
            # Normalize: strip the s3://bucket/ prefix to get the key
            if path.startswith(f"s3://{BUCKET}/"):
                registered.add(path[len(f"s3://{BUCKET}/"):])
        if "LastEvaluatedKey" not in response:
            break
        scan_kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
    return registered


def find_garbage(dry_run: bool = True) -> list:
    """Scan the bucket and find unreferenced artifacts."""
    registered = get_registered_paths()
    print(f"Found {len(registered)} registered artifact paths in registry")

    cutoff = datetime.now(timezone.utc) - timedelta(days=MIN_AGE_DAYS)
    garbage = []
    total_size = 0
    paginator = s3.get_paginator("list_objects_v2")

    for page in paginator.paginate(Bucket=BUCKET, Prefix=ARTIFACT_PREFIX):
        for obj in page.get("Contents", []):
            key = obj["Key"]
            last_modified = obj["LastModified"]
            size = obj["Size"]

            # Skip recent objects -- give people time to register them
            if last_modified > cutoff:
                continue

            # Check if this key (or its parent directory) is registered
            is_referenced = False
            for reg_path in registered:
                if key.startswith(reg_path) or key == reg_path:
                    is_referenced = True
                    break

            if not is_referenced:
                garbage.append({"Key": key, "Size": size, "LastModified": last_modified})
                total_size += size

    size_gb = total_size / (1024 ** 3)
    print(f"Found {len(garbage)} unreferenced objects ({size_gb:.2f} GB)")

    if dry_run:
        print("DRY RUN -- no objects deleted")
        for item in garbage[:20]:  # show first 20
            print(f"  WOULD DELETE: {item['Key']} ({item['Size'] / 1024 / 1024:.1f} MB)")
        if len(garbage) > 20:
            print(f"  ... and {len(garbage) - 20} more")
    else:
        delete_garbage(garbage)

    return garbage


def delete_garbage(garbage: list):
    """Batch-delete garbage objects. S3 delete supports 1000 keys per request."""
    deleted = 0
    for i in range(0, len(garbage), 1000):
        batch = garbage[i : i + 1000]
        delete_request = {"Objects": [{"Key": obj["Key"]} for obj in batch]}
        response = s3.delete_objects(Bucket=BUCKET, Delete=delete_request)
        errors = response.get("Errors", [])
        if errors:
            for err in errors:
                print(f"  ERROR deleting {err['Key']}: {err['Message']}")
        deleted += len(batch) - len(errors)
    print(f"Deleted {deleted} objects")


# Run in dry-run mode first -- always
find_garbage(dry_run=True)

Always run dry_run=True first. Always. The first time I skipped dry-run I deleted a staging model that was about to go to production. The registry entry had a trailing slash mismatch. Lesson learned.

Multi-Tier Storage Strategy

Deleting old artifacts is one strategy. A better one moves them through storage tiers first. S3 pricing drops dramatically as you go from Standard to Intelligent-Tiering to Glacier.

The pattern works like this:

  • Hot (0-30 days): S3 Standard. Active experiments, recent checkpoints. Fast access.
  • Warm (30-90 days): S3 Standard-IA. Old experiments you might revisit. Cheaper storage, per-request retrieval fee.
  • Cold (90-365 days): S3 Glacier Instant Retrieval. Archived models. Very cheap, millisecond retrieval when needed.
  • Delete (365+ days): Gone. If nobody’s touched it in a year, it’s dead weight.

Here’s how to set that up as lifecycle rules:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import boto3
from botocore.exceptions import ClientError

s3 = boto3.client("s3", region_name="us-east-1")

BUCKET = "ml-model-artifacts"

tiered_lifecycle = {
    "Rules": [
        {
            "ID": "tiered-storage-models",
            "Filter": {"Prefix": "models/"},
            "Status": "Enabled",
            "Transitions": [
                {"Days": 30, "StorageClass": "STANDARD_IA"},
                {"Days": 90, "StorageClass": "GLACIER_IR"},
            ],
            "Expiration": {"Days": 365},
        },
        {
            "ID": "tiered-storage-checkpoints",
            "Filter": {"Prefix": "checkpoints/"},
            "Status": "Enabled",
            "Transitions": [
                {"Days": 14, "StorageClass": "STANDARD_IA"},
                {"Days": 60, "StorageClass": "GLACIER_IR"},
            ],
            "Expiration": {"Days": 180},
        },
    ]
}

try:
    s3.put_bucket_lifecycle_configuration(
        Bucket=BUCKET,
        LifecycleConfiguration=tiered_lifecycle,
    )
    print("Tiered storage lifecycle rules applied")

    # Verify what we just set
    result = s3.get_bucket_lifecycle_configuration(Bucket=BUCKET)
    for rule in result["Rules"]:
        print(f"\nRule: {rule['ID']} (Status: {rule['Status']})")
        for t in rule.get("Transitions", []):
            print(f"  After {t['Days']} days -> {t['StorageClass']}")
        if "Expiration" in rule:
            print(f"  Expire after {rule['Expiration']['Days']} days")
except ClientError as e:
    print(f"Error: {e.response['Error']['Message']}")

Checkpoints get shorter timelines because they’re inherently disposable – you only need the last few during training, and the final model artifact is what actually matters. Production model files get a longer runway since someone might need to roll back.

Tracking Storage Costs and Savings

Running GC without tracking savings is like dieting without a scale. You need numbers to justify the engineering time and to know if your rules are aggressive enough.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import boto3
from collections import defaultdict
from datetime import datetime, timezone

s3 = boto3.client("s3", region_name="us-east-1")
cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")

BUCKET = "ml-model-artifacts"

# S3 pricing per GB/month (us-east-1, approximate)
PRICE_PER_GB = {
    "STANDARD": 0.023,
    "STANDARD_IA": 0.0125,
    "GLACIER_IR": 0.004,
    "GLACIER": 0.004,
    "DEEP_ARCHIVE": 0.00099,
}


def analyze_storage_breakdown():
    """Break down bucket usage by prefix and estimate monthly cost."""
    paginator = s3.get_paginator("list_objects_v2")
    prefix_stats = defaultdict(lambda: {"count": 0, "size": 0})

    for page in paginator.paginate(Bucket=BUCKET):
        for obj in page.get("Contents", []):
            key = obj["Key"]
            size = obj["Size"]
            top_prefix = key.split("/")[0] if "/" in key else "root"
            prefix_stats[top_prefix]["count"] += 1
            prefix_stats[top_prefix]["size"] += size

    total_cost = 0.0
    print(f"{'Prefix':<25} {'Objects':>10} {'Size (GB)':>12} {'Est. $/mo':>12}")
    print("-" * 62)
    for prefix, stats in sorted(prefix_stats.items(), key=lambda x: -x[1]["size"]):
        size_gb = stats["size"] / (1024 ** 3)
        # Assume Standard pricing for this scan (lifecycle handles actual tiers)
        cost = size_gb * PRICE_PER_GB["STANDARD"]
        total_cost += cost
        print(f"{prefix:<25} {stats['count']:>10,} {size_gb:>12.2f} {cost:>12.2f}")

    print("-" * 62)
    total_gb = sum(s["size"] for s in prefix_stats.values()) / (1024 ** 3)
    print(f"{'TOTAL':<25} {'':>10} {total_gb:>12.2f} {total_cost:>12.2f}")
    return prefix_stats


def estimate_savings_from_tiering(prefix_stats: dict) -> float:
    """Estimate how much tiered storage saves vs all-Standard."""
    # Rough assumption: 20% hot, 30% warm, 40% cold, 10% deletable
    total_bytes = sum(s["size"] for s in prefix_stats.values())
    total_gb = total_bytes / (1024 ** 3)

    current_cost = total_gb * PRICE_PER_GB["STANDARD"]
    tiered_cost = (
        total_gb * 0.2 * PRICE_PER_GB["STANDARD"]
        + total_gb * 0.3 * PRICE_PER_GB["STANDARD_IA"]
        + total_gb * 0.4 * PRICE_PER_GB["GLACIER_IR"]
        # 10% deleted, cost = 0
    )
    savings = current_cost - tiered_cost
    print(f"\nAll-Standard cost:  ${current_cost:.2f}/month")
    print(f"With tiering:       ${tiered_cost:.2f}/month")
    print(f"Estimated savings:  ${savings:.2f}/month ({savings / current_cost * 100:.0f}%)")
    return savings


stats = analyze_storage_breakdown()
estimate_savings_from_tiering(stats)

For a team generating 500 GB of artifacts per month, tiered storage with GC typically saves 60-70% on storage costs. That’s real money – on the order of $50-80/month at moderate scale, and much more at larger volumes.

Common Errors and Fixes

NoSuchLifecycleConfiguration when reading rules: The bucket has no lifecycle config yet. This is normal on a fresh bucket. Just call put_bucket_lifecycle_configuration first.

MalformedXML on lifecycle put: Your rule is missing a required field. Every rule needs an ID, a Filter (even if it’s {"Prefix": ""}), and a Status. Transitions must be in ascending order of Days.

Lifecycle rule not firing: Rules run once per day, not in real-time. Objects won’t transition or expire until S3’s background process picks them up. Expect up to 48 hours for transitions to complete on large buckets.

AccessDenied on delete_objects: Your IAM role needs s3:DeleteObject permission on the bucket. Lifecycle-based expiration uses S3’s internal permissions, but programmatic deletes go through your IAM policy.

Objects reappearing after deletion: If versioning is enabled, delete_objects adds a delete marker but doesn’t remove previous versions. Add NoncurrentVersionExpiration to your lifecycle rules:

1
2
3
4
5
6
{
    "ID": "clean-old-versions",
    "Filter": {"Prefix": ""},
    "Status": "Enabled",
    "NoncurrentVersionExpiration": {"NoncurrentDays": 7},
}

DynamoDB scan timeout on large registries: The get_registered_paths function paginates, but very large tables can be slow. Add a FilterExpression to skip archived or deleted entries, or maintain a separate index of active S3 paths.