Spinning up 8xH100 instances for a week without checking the price first is how teams blow their entire ML budget in a single training run. A 70B parameter model on AWS p5.48xlarge costs roughly $800/day on-demand. Multiply that by a two-week training run, and you’re looking at over $11,000 before you’ve even tuned a hyperparameter. A quick cost estimate before you hit launch saves real money and prevents uncomfortable conversations with finance.

Here’s how to build a Python tool that estimates training costs across AWS, GCP, and Azure using real GPU pricing and standard training time approximations.

Define GPU Instance Pricing

Start with a data structure that holds instance specs and pricing for the most common GPU training instances. These are approximate on-demand hourly rates as of early 2026 — they shift regularly, but the ballpark is what matters for planning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# gpu_pricing.py

from dataclasses import dataclass


@dataclass
class GPUInstance:
    provider: str
    instance_type: str
    gpu_model: str
    num_gpus: int
    gpu_memory_gb: int
    gpu_tflops_fp16: float  # per GPU, FP16/BF16 peak TFLOPS
    hourly_price_ondemand: float
    hourly_price_spot: float  # spot/preemptible estimate


GPU_INSTANCES = {
    # AWS
    "p5.48xlarge": GPUInstance(
        provider="AWS",
        instance_type="p5.48xlarge",
        gpu_model="H100 SXM",
        num_gpus=8,
        gpu_memory_gb=80,
        gpu_tflops_fp16=989.0,
        hourly_price_ondemand=98.32,
        hourly_price_spot=65.00,
    ),
    "p4d.24xlarge": GPUInstance(
        provider="AWS",
        instance_type="p4d.24xlarge",
        gpu_model="A100 40GB",
        num_gpus=8,
        gpu_memory_gb=40,
        gpu_tflops_fp16=312.0,
        hourly_price_ondemand=32.77,
        hourly_price_spot=14.50,
    ),
    # GCP
    "a3-highgpu-8g": GPUInstance(
        provider="GCP",
        instance_type="a3-highgpu-8g",
        gpu_model="H100 SXM",
        num_gpus=8,
        gpu_memory_gb=80,
        gpu_tflops_fp16=989.0,
        hourly_price_ondemand=101.22,
        hourly_price_spot=35.43,
    ),
    "a2-highgpu-8g": GPUInstance(
        provider="GCP",
        instance_type="a2-highgpu-8g",
        gpu_model="A100 40GB",
        num_gpus=8,
        gpu_memory_gb=40,
        gpu_tflops_fp16=312.0,
        hourly_price_ondemand=29.39,
        hourly_price_spot=8.82,
    ),
    # Azure
    "Standard_ND96isr_H100_v5": GPUInstance(
        provider="Azure",
        instance_type="Standard_ND96isr_H100_v5",
        gpu_model="H100 SXM",
        num_gpus=8,
        gpu_memory_gb=80,
        gpu_tflops_fp16=989.0,
        hourly_price_ondemand=96.36,
        hourly_price_spot=38.54,
    ),
    "Standard_ND96asr_A100_v4": GPUInstance(
        provider="Azure",
        instance_type="Standard_ND96asr_A100_v4",
        gpu_model="A100 80GB",
        num_gpus=8,
        gpu_memory_gb=80,
        gpu_tflops_fp16=312.0,
        hourly_price_ondemand=27.20,
        hourly_price_spot=10.88,
    ),
}

The gpu_tflops_fp16 field is the per-GPU peak FP16/BF16 throughput. H100 SXM peaks at about 989 TFLOPS for FP16 with sparsity, and the A100 40GB hits around 312 TFLOPS. Real-world training throughput is always lower — that’s where the efficiency factor comes in.

Estimate Training Time

The standard approximation for transformer training compute comes from the Chinchilla scaling paper. The formula estimates total floating-point operations, then divides by your hardware throughput:

1
training_time_hours = (6 * params_B * tokens_B * 1e18) / (num_gpus * gpu_tflops * 1e12 * 3600 * efficiency)

The 6 multiplier accounts for forward and backward passes (roughly 2x FLOPs for forward, 4x for forward + backward with gradient computation). efficiency captures how much of peak TFLOPS you actually achieve — typically 0.3 to 0.5 for large training runs, depending on your parallelism strategy and interconnect.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# training_estimator.py

def estimate_training_hours(
    params_billions: float,
    tokens_billions: float,
    num_gpus: int,
    gpu_tflops_fp16: float,
    efficiency: float = 0.4,
) -> float:
    """
    Estimate training time using the Chinchilla-style approximation.

    6 * N * D gives total FLOPs for a transformer training run where:
      N = parameter count
      D = number of training tokens
    """
    total_flops = 6 * params_billions * 1e9 * tokens_billions * 1e9
    effective_throughput = num_gpus * gpu_tflops_fp16 * 1e12 * efficiency
    training_seconds = total_flops / effective_throughput
    return training_seconds / 3600


# Quick sanity check: 7B model, 2T tokens, 8x H100
hours = estimate_training_hours(
    params_billions=7,
    tokens_billions=2.0,
    num_gpus=8,
    gpu_tflops_fp16=989.0,
    efficiency=0.4,
)
print(f"Estimated training time: {hours:.1f} hours ({hours / 24:.1f} days)")
# Estimated training time: 7382.2 hours (307.6 days)
# That's a single 8-GPU node — you'd need multi-node for a run this size

That output makes sense. Training a 7B parameter model on 2 trillion tokens with a single 8-GPU node would take a very long time. In practice, teams use 32 to 256+ GPUs for runs like this. The calculator helps you figure out exactly how many nodes to rent and what it’ll cost.

Calculate Total Cost

Now combine the estimator with pricing data to produce a comparison across providers and pricing tiers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# cost_calculator.py

from gpu_pricing import GPU_INSTANCES, GPUInstance
from training_estimator import estimate_training_hours


def calculate_cost(
    params_billions: float,
    tokens_billions: float,
    instance_key: str,
    num_nodes: int = 1,
    efficiency: float = 0.4,
) -> dict:
    instance = GPU_INSTANCES[instance_key]
    total_gpus = instance.num_gpus * num_nodes

    hours = estimate_training_hours(
        params_billions=params_billions,
        tokens_billions=tokens_billions,
        num_gpus=total_gpus,
        gpu_tflops_fp16=instance.gpu_tflops_fp16,
        efficiency=efficiency,
    )

    cost_ondemand = hours * instance.hourly_price_ondemand * num_nodes
    cost_spot = hours * instance.hourly_price_spot * num_nodes

    return {
        "provider": instance.provider,
        "instance_type": instance.instance_type,
        "gpu_model": instance.gpu_model,
        "total_gpus": total_gpus,
        "estimated_hours": round(hours, 1),
        "estimated_days": round(hours / 24, 1),
        "cost_ondemand": round(cost_ondemand, 2),
        "cost_spot": round(cost_spot, 2),
    }


def compare_all_providers(
    params_billions: float,
    tokens_billions: float,
    gpu_model: str = "H100 SXM",
    num_nodes: int = 1,
    efficiency: float = 0.4,
) -> list[dict]:
    results = []
    for key, instance in GPU_INSTANCES.items():
        if instance.gpu_model == gpu_model:
            result = calculate_cost(
                params_billions, tokens_billions, key, num_nodes, efficiency
            )
            results.append(result)
    results.sort(key=lambda x: x["cost_ondemand"])
    return results


# Compare H100 pricing for a 13B model on 300B tokens across 4 nodes
results = compare_all_providers(
    params_billions=13,
    tokens_billions=0.3,
    gpu_model="H100 SXM",
    num_nodes=4,
    efficiency=0.4,
)

print(f"{'Provider':<10} {'Instance':<30} {'GPUs':>5} {'Hours':>8} {'On-Demand':>12} {'Spot':>12}")
print("-" * 80)
for r in results:
    print(
        f"{r['provider']:<10} {r['instance_type']:<30} {r['total_gpus']:>5} "
        f"{r['estimated_hours']:>8.1f} ${r['cost_ondemand']:>11,.2f} ${r['cost_spot']:>11,.2f}"
    )

Sample output for a 13B model trained on 300B tokens using 4 nodes (32 H100s):

1
2
3
4
5
Provider   Instance                        GPUs    Hours    On-Demand         Spot
--------------------------------------------------------------------------------
Azure      Standard_ND96isr_H100_v5          32     57.7  $  22,240.76  $   8,896.30
AWS        p5.48xlarge                        32     57.7  $  22,692.09  $  15,001.38
GCP        a3-highgpu-8g                      32     57.7  $  23,361.73  $   8,180.26

GCP spot pricing is the cheapest in this scenario, but spot instances get preempted. For uninterruptible runs, Azure edges out AWS on on-demand pricing. These numbers shift constantly, so treat them as planning estimates, not invoices.

Build a CLI Cost Estimator

Wrap everything in a clean command-line tool using argparse so you can run quick estimates from the terminal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/env python3
# train_cost_cli.py

import argparse
from dataclasses import dataclass


@dataclass
class GPUInstance:
    provider: str
    instance_type: str
    gpu_model: str
    num_gpus: int
    gpu_tflops_fp16: float
    hourly_price_ondemand: float
    hourly_price_spot: float


GPU_INSTANCES = {
    "p5.48xlarge": GPUInstance("AWS", "p5.48xlarge", "H100", 8, 989.0, 98.32, 65.00),
    "p4d.24xlarge": GPUInstance("AWS", "p4d.24xlarge", "A100", 8, 312.0, 32.77, 14.50),
    "a3-highgpu-8g": GPUInstance("GCP", "a3-highgpu-8g", "H100", 8, 989.0, 101.22, 35.43),
    "a2-highgpu-8g": GPUInstance("GCP", "a2-highgpu-8g", "A100", 8, 312.0, 29.39, 8.82),
    "nd-h100-v5": GPUInstance("Azure", "ND96isr_H100_v5", "H100", 8, 989.0, 96.36, 38.54),
    "nd-a100-v4": GPUInstance("Azure", "ND96asr_A100_v4", "A100", 8, 312.0, 27.20, 10.88),
}


def estimate_training_hours(
    params_b: float, tokens_b: float, num_gpus: int, tflops: float, eff: float
) -> float:
    total_flops = 6 * params_b * 1e9 * tokens_b * 1e9
    throughput = num_gpus * tflops * 1e12 * eff
    return total_flops / throughput / 3600


def main():
    parser = argparse.ArgumentParser(
        description="Estimate ML training costs across cloud providers"
    )
    parser.add_argument(
        "--params", type=float, required=True, help="Model size in billions of parameters"
    )
    parser.add_argument(
        "--tokens", type=float, required=True, help="Training tokens in billions"
    )
    parser.add_argument(
        "--nodes", type=int, default=1, help="Number of GPU nodes (default: 1)"
    )
    parser.add_argument(
        "--gpu", choices=["H100", "A100"], default="H100", help="GPU model (default: H100)"
    )
    parser.add_argument(
        "--efficiency", type=float, default=0.4,
        help="Hardware efficiency factor 0.0-1.0 (default: 0.4)",
    )
    args = parser.parse_args()

    print(f"\nTraining cost estimate: {args.params}B params, {args.tokens}B tokens")
    print(f"GPU: {args.gpu} | Nodes: {args.nodes} | Efficiency: {args.efficiency}")
    print("=" * 85)
    print(
        f"{'Provider':<9} {'Instance':<25} {'GPUs':>5} {'Hours':>9} "
        f"{'Days':>7} {'On-Demand':>12} {'Spot':>12}"
    )
    print("-" * 85)

    for key, inst in GPU_INSTANCES.items():
        if inst.gpu_model != args.gpu:
            continue

        total_gpus = inst.num_gpus * args.nodes
        hours = estimate_training_hours(
            args.params, args.tokens, total_gpus, inst.gpu_tflops_fp16, args.efficiency
        )
        cost_od = hours * inst.hourly_price_ondemand * args.nodes
        cost_sp = hours * inst.hourly_price_spot * args.nodes

        print(
            f"{inst.provider:<9} {inst.instance_type:<25} {total_gpus:>5} "
            f"{hours:>9.1f} {hours / 24:>7.1f} ${cost_od:>11,.2f} ${cost_sp:>11,.2f}"
        )

    print()


if __name__ == "__main__":
    main()

Run it from the terminal:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
python3 train_cost_cli.py --params 7 --tokens 1.0 --nodes 4 --gpu H100

# Training cost estimate: 7.0B params, 1.0B tokens
# GPU: H100 | Nodes: 4 | Efficiency: 0.4
# =====================================================================================
# Provider  Instance                   GPUs     Hours    Days    On-Demand         Spot
# -------------------------------------------------------------------------------------
# AWS       p5.48xlarge                  32     115.3     4.8  $  45,344.05  $  29,966.42
# GCP       a3-highgpu-8g                32     115.3     4.8  $  46,701.19  $  16,337.73
# Azure     ND96isr_H100_v5              32     115.3     4.8  $  44,425.30  $  17,770.12

You can also estimate smaller runs — fine-tuning a 7B model on 10B tokens takes a fraction of the time:

1
python3 train_cost_cli.py --params 7 --tokens 0.01 --nodes 1 --gpu H100 --efficiency 0.45

Common Errors and Fixes

KeyError: 'p5.48xlarge' when looking up instance types

This happens when instance keys don’t match your pricing dictionary exactly. Cloud providers rename instances periodically. Double-check the key you’re passing matches the dictionary. Print GPU_INSTANCES.keys() to see available options.

1
2
3
if instance_key not in GPU_INSTANCES:
    available = ", ".join(GPU_INSTANCES.keys())
    raise ValueError(f"Unknown instance '{instance_key}'. Available: {available}")

Wildly inaccurate cost estimates (off by 10x or more)

The efficiency factor is almost always the culprit. If you set efficiency=1.0, you’re assuming perfect hardware utilization, which never happens. Real training efficiency on large runs is 0.3-0.5 for well-optimized pipelines. Data loading bottlenecks, gradient synchronization overhead, and pipeline bubble time all eat into throughput. Start with 0.35 for a conservative estimate and adjust once you have real profiling data.

ZeroDivisionError in training time calculation

This happens when num_gpus, gpu_tflops, or efficiency is zero. Add input validation before the calculation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def estimate_training_hours(params_b, tokens_b, num_gpus, tflops, eff):
    if num_gpus <= 0:
        raise ValueError(f"num_gpus must be positive, got {num_gpus}")
    if tflops <= 0:
        raise ValueError(f"gpu_tflops must be positive, got {tflops}")
    if not 0 < eff <= 1.0:
        raise ValueError(f"efficiency must be between 0 and 1, got {eff}")
    total_flops = 6 * params_b * 1e9 * tokens_b * 1e9
    throughput = num_gpus * tflops * 1e12 * eff
    return total_flops / throughput / 3600

Spot instance cost estimates are misleading

Spot prices shown here are approximations. Real spot pricing fluctuates by region and time of day. AWS spot can spike to near on-demand during high demand. GCP preemptible VMs have a 24-hour maximum runtime, so you need checkpointing. Azure spot VMs can be evicted with 30 seconds notice. Always add 15-20% buffer to spot estimates for checkpoint/restart overhead.