How to Build a Model Artifact Garbage Collection Pipeline with S3 Lifecycle Rules

Every training run dumps checkpoints. Every experiment fork leaves behind artifacts nobody will ever load again. After six months your S3 bill looks like a landfill invoice. The fix isn’t “be more disciplined about deleting things.” The fix is automated garbage collection.

Here’s a basic lifecycle rule that expires objects older than 90 days in your artifacts prefix:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import boto3
from botocore.exceptions import ClientError

s3 = boto3.client("s3", region_name="us-east-1")

BUCKET = "ml-model-artifacts"

lifecycle_config = {
    "Rules": [
        {
            "ID": "expire-old-checkpoints",
            "Filter": {"Prefix": "checkpoints/"},
            "Status": "Enabled",
            "Expiration": {"Days": 90},
        },
        {
            "ID": "expire-failed-experiments",
            "Filter": {"Prefix": "experiments/failed/"},
            "Status": "Enabled",
            "Expiration": {"Days": 14},
        },
    ]
}

try:
    s3.put_bucket_lifecycle_configuration(
        Bucket=BUCKET,
        LifecycleConfiguration=lifecycle_config,
    )
    print(f"Lifecycle rules applied to {BUCKET}")
except ClientError as e:
    print(f"Failed to set lifecycle config: {e.response['Error']['Message']}")

That handles the obvious stuff – stale checkpoints and dead experiments. But lifecycle rules are blunt instruments. They don’t know which artifacts are still referenced by your model registry. For that, you need something smarter.

Registry-Aware Garbage Collection

Lifecycle rules delete by age. A proper GC deletes by reachability. The idea is simple: scan your model registry for every artifact path that’s still in use, then mark everything else in the bucket as garbage.

This assumes you have a registry (a DynamoDB table, a database, even a JSON manifest) that maps model versions to S3 paths. The GC script walks the bucket, checks each key against the registry, and flags unreferenced objects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import boto3
from datetime import datetime, timezone, timedelta

s3 = boto3.client("s3", region_name="us-east-1")
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")

BUCKET = "ml-model-artifacts"
REGISTRY_TABLE = "model-registry"
ARTIFACT_PREFIX = "models/"
MIN_AGE_DAYS = 7  # never GC anything younger than 7 days


def get_registered_paths() -> set:
    """Pull all active artifact paths from the model registry."""
    table = dynamodb.Table(REGISTRY_TABLE)
    registered = set()
    scan_kwargs = {"ProjectionExpression": "s3_path"}
    while True:
        response = table.scan(**scan_kwargs)
        for item in response.get("Items", []):
            path = item.get("s3_path", "")
            # Normalize: strip the s3://bucket/ prefix to get the key
            if path.startswith(f"s3://{BUCKET}/"):
                registered.add(path[len(f"s3://{BUCKET}/"):])
        if "LastEvaluatedKey" not in response:
            break
        scan_kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
    return registered


def find_garbage(dry_run: bool = True) -> list:
    """Scan the bucket and find unreferenced artifacts."""
    registered = get_registered_paths()
    print(f"Found {len(registered)} registered artifact paths in registry")

    cutoff = datetime.now(timezone.utc) - timedelta(days=MIN_AGE_DAYS)
    garbage = []
    total_size = 0
    paginator = s3.get_paginator("list_objects_v2")

    for page in paginator.paginate(Bucket=BUCKET, Prefix=ARTIFACT_PREFIX):
        for obj in page.get("Contents", []):
            key = obj["Key"]
            last_modified = obj["LastModified"]
            size = obj["Size"]

            # Skip recent objects -- give people time to register them
            if last_modified > cutoff:
                continue

            # Check if this key (or its parent directory) is registered
            is_referenced = False
            for reg_path in registered:
                if key.startswith(reg_path) or key == reg_path:
                    is_referenced = True
                    break

            if not is_referenced:
                garbage.append({"Key": key, "Size": size, "LastModified": last_modified})
                total_size += size

    size_gb = total_size / (1024 ** 3)
    print(f"Found {len(garbage)} unreferenced objects ({size_gb:.2f} GB)")

    if dry_run:
        print("DRY RUN -- no objects deleted")
        for item in garbage[:20]:  # show first 20
            print(f"  WOULD DELETE: {item['Key']} ({item['Size'] / 1024 / 1024:.1f} MB)")
        if len(garbage) > 20:
            print(f"  ... and {len(garbage) - 20} more")
    else:
        delete_garbage(garbage)

    return garbage


def delete_garbage(garbage: list):
    """Batch-delete garbage objects. S3 delete supports 1000 keys per request."""
    deleted = 0
    for i in range(0, len(garbage), 1000):
        batch = garbage[i : i + 1000]
        delete_request = {"Objects": [{"Key": obj["Key"]} for obj in batch]}
        response = s3.delete_objects(Bucket=BUCKET, Delete=delete_request)
        errors = response.get("Errors", [])
        if errors:
            for err in errors:
                print(f"  ERROR deleting {err['Key']}: {err['Message']}")
        deleted += len(batch) - len(errors)
    print(f"Deleted {deleted} objects")


# Run in dry-run mode first -- always
find_garbage(dry_run=True)

Always run dry_run=True first. Always. The first time I skipped dry-run I deleted a staging model that was about to go to production. The registry entry had a trailing slash mismatch. Lesson learned.

Multi-Tier Storage Strategy

Deleting old artifacts is one strategy. A better one moves them through storage tiers first. S3 pricing drops dramatically as you go from Standard to Intelligent-Tiering to Glacier.

The pattern works like this:

Hot (0-30 days): S3 Standard. Active experiments, recent checkpoints. Fast access.
Warm (30-90 days): S3 Standard-IA. Old experiments you might revisit. Cheaper storage, per-request retrieval fee.
Cold (90-365 days): S3 Glacier Instant Retrieval. Archived models. Very cheap, millisecond retrieval when needed.
Delete (365+ days): Gone. If nobody’s touched it in a year, it’s dead weight.

Here’s how to set that up as lifecycle rules:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import boto3
from botocore.exceptions import ClientError

s3 = boto3.client("s3", region_name="us-east-1")

BUCKET = "ml-model-artifacts"

tiered_lifecycle = {
    "Rules": [
        {
            "ID": "tiered-storage-models",
            "Filter": {"Prefix": "models/"},
            "Status": "Enabled",
            "Transitions": [
                {"Days": 30, "StorageClass": "STANDARD_IA"},
                {"Days": 90, "StorageClass": "GLACIER_IR"},
            ],
            "Expiration": {"Days": 365},
        },
        {
            "ID": "tiered-storage-checkpoints",
            "Filter": {"Prefix": "checkpoints/"},
            "Status": "Enabled",
            "Transitions": [
                {"Days": 14, "StorageClass": "STANDARD_IA"},
                {"Days": 60, "StorageClass": "GLACIER_IR"},
            ],
            "Expiration": {"Days": 180},
        },
    ]
}

try:
    s3.put_bucket_lifecycle_configuration(
        Bucket=BUCKET,
        LifecycleConfiguration=tiered_lifecycle,
    )
    print("Tiered storage lifecycle rules applied")

    # Verify what we just set
    result = s3.get_bucket_lifecycle_configuration(Bucket=BUCKET)
    for rule in result["Rules"]:
        print(f"\nRule: {rule['ID']} (Status: {rule['Status']})")
        for t in rule.get("Transitions", []):
            print(f"  After {t['Days']} days -> {t['StorageClass']}")
        if "Expiration" in rule:
            print(f"  Expire after {rule['Expiration']['Days']} days")
except ClientError as e:
    print(f"Error: {e.response['Error']['Message']}")

Checkpoints get shorter timelines because they’re inherently disposable – you only need the last few during training, and the final model artifact is what actually matters. Production model files get a longer runway since someone might need to roll back.

Tracking Storage Costs and Savings

Running GC without tracking savings is like dieting without a scale. You need numbers to justify the engineering time and to know if your rules are aggressive enough.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import boto3
from collections import defaultdict
from datetime import datetime, timezone

s3 = boto3.client("s3", region_name="us-east-1")
cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")

BUCKET = "ml-model-artifacts"

# S3 pricing per GB/month (us-east-1, approximate)
PRICE_PER_GB = {
    "STANDARD": 0.023,
    "STANDARD_IA": 0.0125,
    "GLACIER_IR": 0.004,
    "GLACIER": 0.004,
    "DEEP_ARCHIVE": 0.00099,
}


def analyze_storage_breakdown():
    """Break down bucket usage by prefix and estimate monthly cost."""
    paginator = s3.get_paginator("list_objects_v2")
    prefix_stats = defaultdict(lambda: {"count": 0, "size": 0})

    for page in paginator.paginate(Bucket=BUCKET):
        for obj in page.get("Contents", []):
            key = obj["Key"]
            size = obj["Size"]
            top_prefix = key.split("/")[0] if "/" in key else "root"
            prefix_stats[top_prefix]["count"] += 1
            prefix_stats[top_prefix]["size"] += size

    total_cost = 0.0
    print(f"{'Prefix':<25} {'Objects':>10} {'Size (GB)':>12} {'Est. $/mo':>12}")
    print("-" * 62)
    for prefix, stats in sorted(prefix_stats.items(), key=lambda x: -x[1]["size"]):
        size_gb = stats["size"] / (1024 ** 3)
        # Assume Standard pricing for this scan (lifecycle handles actual tiers)
        cost = size_gb * PRICE_PER_GB["STANDARD"]
        total_cost += cost
        print(f"{prefix:<25} {stats['count']:>10,} {size_gb:>12.2f} {cost:>12.2f}")

    print("-" * 62)
    total_gb = sum(s["size"] for s in prefix_stats.values()) / (1024 ** 3)
    print(f"{'TOTAL':<25} {'':>10} {total_gb:>12.2f} {total_cost:>12.2f}")
    return prefix_stats


def estimate_savings_from_tiering(prefix_stats: dict) -> float:
    """Estimate how much tiered storage saves vs all-Standard."""
    # Rough assumption: 20% hot, 30% warm, 40% cold, 10% deletable
    total_bytes = sum(s["size"] for s in prefix_stats.values())
    total_gb = total_bytes / (1024 ** 3)

    current_cost = total_gb * PRICE_PER_GB["STANDARD"]
    tiered_cost = (
        total_gb * 0.2 * PRICE_PER_GB["STANDARD"]
        + total_gb * 0.3 * PRICE_PER_GB["STANDARD_IA"]
        + total_gb * 0.4 * PRICE_PER_GB["GLACIER_IR"]
        # 10% deleted, cost = 0
    )
    savings = current_cost - tiered_cost
    print(f"\nAll-Standard cost:  ${current_cost:.2f}/month")
    print(f"With tiering:       ${tiered_cost:.2f}/month")
    print(f"Estimated savings:  ${savings:.2f}/month ({savings / current_cost * 100:.0f}%)")
    return savings


stats = analyze_storage_breakdown()
estimate_savings_from_tiering(stats)

For a team generating 500 GB of artifacts per month, tiered storage with GC typically saves 60-70% on storage costs. That’s real money – on the order of $50-80/month at moderate scale, and much more at larger volumes.

Common Errors and Fixes

NoSuchLifecycleConfiguration when reading rules: The bucket has no lifecycle config yet. This is normal on a fresh bucket. Just call put_bucket_lifecycle_configuration first.

MalformedXML on lifecycle put: Your rule is missing a required field. Every rule needs an ID, a Filter (even if it’s {"Prefix": ""}), and a Status. Transitions must be in ascending order of Days.

Lifecycle rule not firing: Rules run once per day, not in real-time. Objects won’t transition or expire until S3’s background process picks them up. Expect up to 48 hours for transitions to complete on large buckets.

AccessDenied on delete_objects: Your IAM role needs s3:DeleteObject permission on the bucket. Lifecycle-based expiration uses S3’s internal permissions, but programmatic deletes go through your IAM policy.

Objects reappearing after deletion: If versioning is enabled, delete_objects adds a delete marker but doesn’t remove previous versions. Add NoncurrentVersionExpiration to your lifecycle rules:

1
2
3
4
5
6
{
    "ID": "clean-old-versions",
    "Filter": {"Prefix": ""},
    "Status": "Enabled",
    "NoncurrentVersionExpiration": {"NoncurrentDays": 7},
}

DynamoDB scan timeout on large registries: The get_registered_paths function paginates, but very large tables can be slow. Add a FilterExpression to skip archived or deleted entries, or maintain a separate index of active S3 paths.

Registry-Aware Garbage Collection#

Multi-Tier Storage Strategy#

Tracking Storage Costs and Savings#

Common Errors and Fixes#

Related Guides#

About the Author

Registry-Aware Garbage Collection

Multi-Tier Storage Strategy

Tracking Storage Costs and Savings

Common Errors and Fixes

Related Guides