Spinning up 8xH100 instances for a week without checking the price first is how teams blow their entire ML budget in a single training run. A 70B parameter model on AWS p5.48xlarge costs roughly $800/day on-demand. Multiply that by a two-week training run, and you’re looking at over $11,000 before you’ve even tuned a hyperparameter. A quick cost estimate before you hit launch saves real money and prevents uncomfortable conversations with finance.
Here’s how to build a Python tool that estimates training costs across AWS, GCP, and Azure using real GPU pricing and standard training time approximations.
Define GPU Instance Pricing#
Start with a data structure that holds instance specs and pricing for the most common GPU training instances. These are approximate on-demand hourly rates as of early 2026 — they shift regularly, but the ballpark is what matters for planning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
| # gpu_pricing.py
from dataclasses import dataclass
@dataclass
class GPUInstance:
provider: str
instance_type: str
gpu_model: str
num_gpus: int
gpu_memory_gb: int
gpu_tflops_fp16: float # per GPU, FP16/BF16 peak TFLOPS
hourly_price_ondemand: float
hourly_price_spot: float # spot/preemptible estimate
GPU_INSTANCES = {
# AWS
"p5.48xlarge": GPUInstance(
provider="AWS",
instance_type="p5.48xlarge",
gpu_model="H100 SXM",
num_gpus=8,
gpu_memory_gb=80,
gpu_tflops_fp16=989.0,
hourly_price_ondemand=98.32,
hourly_price_spot=65.00,
),
"p4d.24xlarge": GPUInstance(
provider="AWS",
instance_type="p4d.24xlarge",
gpu_model="A100 40GB",
num_gpus=8,
gpu_memory_gb=40,
gpu_tflops_fp16=312.0,
hourly_price_ondemand=32.77,
hourly_price_spot=14.50,
),
# GCP
"a3-highgpu-8g": GPUInstance(
provider="GCP",
instance_type="a3-highgpu-8g",
gpu_model="H100 SXM",
num_gpus=8,
gpu_memory_gb=80,
gpu_tflops_fp16=989.0,
hourly_price_ondemand=101.22,
hourly_price_spot=35.43,
),
"a2-highgpu-8g": GPUInstance(
provider="GCP",
instance_type="a2-highgpu-8g",
gpu_model="A100 40GB",
num_gpus=8,
gpu_memory_gb=40,
gpu_tflops_fp16=312.0,
hourly_price_ondemand=29.39,
hourly_price_spot=8.82,
),
# Azure
"Standard_ND96isr_H100_v5": GPUInstance(
provider="Azure",
instance_type="Standard_ND96isr_H100_v5",
gpu_model="H100 SXM",
num_gpus=8,
gpu_memory_gb=80,
gpu_tflops_fp16=989.0,
hourly_price_ondemand=96.36,
hourly_price_spot=38.54,
),
"Standard_ND96asr_A100_v4": GPUInstance(
provider="Azure",
instance_type="Standard_ND96asr_A100_v4",
gpu_model="A100 80GB",
num_gpus=8,
gpu_memory_gb=80,
gpu_tflops_fp16=312.0,
hourly_price_ondemand=27.20,
hourly_price_spot=10.88,
),
}
|
The gpu_tflops_fp16 field is the per-GPU peak FP16/BF16 throughput. H100 SXM peaks at about 989 TFLOPS for FP16 with sparsity, and the A100 40GB hits around 312 TFLOPS. Real-world training throughput is always lower — that’s where the efficiency factor comes in.
Estimate Training Time#
The standard approximation for transformer training compute comes from the Chinchilla scaling paper. The formula estimates total floating-point operations, then divides by your hardware throughput:
1
| training_time_hours = (6 * params_B * tokens_B * 1e18) / (num_gpus * gpu_tflops * 1e12 * 3600 * efficiency)
|
The 6 multiplier accounts for forward and backward passes (roughly 2x FLOPs for forward, 4x for forward + backward with gradient computation). efficiency captures how much of peak TFLOPS you actually achieve — typically 0.3 to 0.5 for large training runs, depending on your parallelism strategy and interconnect.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| # training_estimator.py
def estimate_training_hours(
params_billions: float,
tokens_billions: float,
num_gpus: int,
gpu_tflops_fp16: float,
efficiency: float = 0.4,
) -> float:
"""
Estimate training time using the Chinchilla-style approximation.
6 * N * D gives total FLOPs for a transformer training run where:
N = parameter count
D = number of training tokens
"""
total_flops = 6 * params_billions * 1e9 * tokens_billions * 1e9
effective_throughput = num_gpus * gpu_tflops_fp16 * 1e12 * efficiency
training_seconds = total_flops / effective_throughput
return training_seconds / 3600
# Quick sanity check: 7B model, 2T tokens, 8x H100
hours = estimate_training_hours(
params_billions=7,
tokens_billions=2.0,
num_gpus=8,
gpu_tflops_fp16=989.0,
efficiency=0.4,
)
print(f"Estimated training time: {hours:.1f} hours ({hours / 24:.1f} days)")
# Estimated training time: 7382.2 hours (307.6 days)
# That's a single 8-GPU node — you'd need multi-node for a run this size
|
That output makes sense. Training a 7B parameter model on 2 trillion tokens with a single 8-GPU node would take a very long time. In practice, teams use 32 to 256+ GPUs for runs like this. The calculator helps you figure out exactly how many nodes to rent and what it’ll cost.
Calculate Total Cost#
Now combine the estimator with pricing data to produce a comparison across providers and pricing tiers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
| # cost_calculator.py
from gpu_pricing import GPU_INSTANCES, GPUInstance
from training_estimator import estimate_training_hours
def calculate_cost(
params_billions: float,
tokens_billions: float,
instance_key: str,
num_nodes: int = 1,
efficiency: float = 0.4,
) -> dict:
instance = GPU_INSTANCES[instance_key]
total_gpus = instance.num_gpus * num_nodes
hours = estimate_training_hours(
params_billions=params_billions,
tokens_billions=tokens_billions,
num_gpus=total_gpus,
gpu_tflops_fp16=instance.gpu_tflops_fp16,
efficiency=efficiency,
)
cost_ondemand = hours * instance.hourly_price_ondemand * num_nodes
cost_spot = hours * instance.hourly_price_spot * num_nodes
return {
"provider": instance.provider,
"instance_type": instance.instance_type,
"gpu_model": instance.gpu_model,
"total_gpus": total_gpus,
"estimated_hours": round(hours, 1),
"estimated_days": round(hours / 24, 1),
"cost_ondemand": round(cost_ondemand, 2),
"cost_spot": round(cost_spot, 2),
}
def compare_all_providers(
params_billions: float,
tokens_billions: float,
gpu_model: str = "H100 SXM",
num_nodes: int = 1,
efficiency: float = 0.4,
) -> list[dict]:
results = []
for key, instance in GPU_INSTANCES.items():
if instance.gpu_model == gpu_model:
result = calculate_cost(
params_billions, tokens_billions, key, num_nodes, efficiency
)
results.append(result)
results.sort(key=lambda x: x["cost_ondemand"])
return results
# Compare H100 pricing for a 13B model on 300B tokens across 4 nodes
results = compare_all_providers(
params_billions=13,
tokens_billions=0.3,
gpu_model="H100 SXM",
num_nodes=4,
efficiency=0.4,
)
print(f"{'Provider':<10} {'Instance':<30} {'GPUs':>5} {'Hours':>8} {'On-Demand':>12} {'Spot':>12}")
print("-" * 80)
for r in results:
print(
f"{r['provider']:<10} {r['instance_type']:<30} {r['total_gpus']:>5} "
f"{r['estimated_hours']:>8.1f} ${r['cost_ondemand']:>11,.2f} ${r['cost_spot']:>11,.2f}"
)
|
Sample output for a 13B model trained on 300B tokens using 4 nodes (32 H100s):
1
2
3
4
5
| Provider Instance GPUs Hours On-Demand Spot
--------------------------------------------------------------------------------
Azure Standard_ND96isr_H100_v5 32 57.7 $ 22,240.76 $ 8,896.30
AWS p5.48xlarge 32 57.7 $ 22,692.09 $ 15,001.38
GCP a3-highgpu-8g 32 57.7 $ 23,361.73 $ 8,180.26
|
GCP spot pricing is the cheapest in this scenario, but spot instances get preempted. For uninterruptible runs, Azure edges out AWS on on-demand pricing. These numbers shift constantly, so treat them as planning estimates, not invoices.
Build a CLI Cost Estimator#
Wrap everything in a clean command-line tool using argparse so you can run quick estimates from the terminal.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
| #!/usr/bin/env python3
# train_cost_cli.py
import argparse
from dataclasses import dataclass
@dataclass
class GPUInstance:
provider: str
instance_type: str
gpu_model: str
num_gpus: int
gpu_tflops_fp16: float
hourly_price_ondemand: float
hourly_price_spot: float
GPU_INSTANCES = {
"p5.48xlarge": GPUInstance("AWS", "p5.48xlarge", "H100", 8, 989.0, 98.32, 65.00),
"p4d.24xlarge": GPUInstance("AWS", "p4d.24xlarge", "A100", 8, 312.0, 32.77, 14.50),
"a3-highgpu-8g": GPUInstance("GCP", "a3-highgpu-8g", "H100", 8, 989.0, 101.22, 35.43),
"a2-highgpu-8g": GPUInstance("GCP", "a2-highgpu-8g", "A100", 8, 312.0, 29.39, 8.82),
"nd-h100-v5": GPUInstance("Azure", "ND96isr_H100_v5", "H100", 8, 989.0, 96.36, 38.54),
"nd-a100-v4": GPUInstance("Azure", "ND96asr_A100_v4", "A100", 8, 312.0, 27.20, 10.88),
}
def estimate_training_hours(
params_b: float, tokens_b: float, num_gpus: int, tflops: float, eff: float
) -> float:
total_flops = 6 * params_b * 1e9 * tokens_b * 1e9
throughput = num_gpus * tflops * 1e12 * eff
return total_flops / throughput / 3600
def main():
parser = argparse.ArgumentParser(
description="Estimate ML training costs across cloud providers"
)
parser.add_argument(
"--params", type=float, required=True, help="Model size in billions of parameters"
)
parser.add_argument(
"--tokens", type=float, required=True, help="Training tokens in billions"
)
parser.add_argument(
"--nodes", type=int, default=1, help="Number of GPU nodes (default: 1)"
)
parser.add_argument(
"--gpu", choices=["H100", "A100"], default="H100", help="GPU model (default: H100)"
)
parser.add_argument(
"--efficiency", type=float, default=0.4,
help="Hardware efficiency factor 0.0-1.0 (default: 0.4)",
)
args = parser.parse_args()
print(f"\nTraining cost estimate: {args.params}B params, {args.tokens}B tokens")
print(f"GPU: {args.gpu} | Nodes: {args.nodes} | Efficiency: {args.efficiency}")
print("=" * 85)
print(
f"{'Provider':<9} {'Instance':<25} {'GPUs':>5} {'Hours':>9} "
f"{'Days':>7} {'On-Demand':>12} {'Spot':>12}"
)
print("-" * 85)
for key, inst in GPU_INSTANCES.items():
if inst.gpu_model != args.gpu:
continue
total_gpus = inst.num_gpus * args.nodes
hours = estimate_training_hours(
args.params, args.tokens, total_gpus, inst.gpu_tflops_fp16, args.efficiency
)
cost_od = hours * inst.hourly_price_ondemand * args.nodes
cost_sp = hours * inst.hourly_price_spot * args.nodes
print(
f"{inst.provider:<9} {inst.instance_type:<25} {total_gpus:>5} "
f"{hours:>9.1f} {hours / 24:>7.1f} ${cost_od:>11,.2f} ${cost_sp:>11,.2f}"
)
print()
if __name__ == "__main__":
main()
|
Run it from the terminal:
1
2
3
4
5
6
7
8
9
10
| python3 train_cost_cli.py --params 7 --tokens 1.0 --nodes 4 --gpu H100
# Training cost estimate: 7.0B params, 1.0B tokens
# GPU: H100 | Nodes: 4 | Efficiency: 0.4
# =====================================================================================
# Provider Instance GPUs Hours Days On-Demand Spot
# -------------------------------------------------------------------------------------
# AWS p5.48xlarge 32 115.3 4.8 $ 45,344.05 $ 29,966.42
# GCP a3-highgpu-8g 32 115.3 4.8 $ 46,701.19 $ 16,337.73
# Azure ND96isr_H100_v5 32 115.3 4.8 $ 44,425.30 $ 17,770.12
|
You can also estimate smaller runs — fine-tuning a 7B model on 10B tokens takes a fraction of the time:
1
| python3 train_cost_cli.py --params 7 --tokens 0.01 --nodes 1 --gpu H100 --efficiency 0.45
|
Common Errors and Fixes#
KeyError: 'p5.48xlarge' when looking up instance types
This happens when instance keys don’t match your pricing dictionary exactly. Cloud providers rename instances periodically. Double-check the key you’re passing matches the dictionary. Print GPU_INSTANCES.keys() to see available options.
1
2
3
| if instance_key not in GPU_INSTANCES:
available = ", ".join(GPU_INSTANCES.keys())
raise ValueError(f"Unknown instance '{instance_key}'. Available: {available}")
|
Wildly inaccurate cost estimates (off by 10x or more)
The efficiency factor is almost always the culprit. If you set efficiency=1.0, you’re assuming perfect hardware utilization, which never happens. Real training efficiency on large runs is 0.3-0.5 for well-optimized pipelines. Data loading bottlenecks, gradient synchronization overhead, and pipeline bubble time all eat into throughput. Start with 0.35 for a conservative estimate and adjust once you have real profiling data.
ZeroDivisionError in training time calculation
This happens when num_gpus, gpu_tflops, or efficiency is zero. Add input validation before the calculation:
1
2
3
4
5
6
7
8
9
10
| def estimate_training_hours(params_b, tokens_b, num_gpus, tflops, eff):
if num_gpus <= 0:
raise ValueError(f"num_gpus must be positive, got {num_gpus}")
if tflops <= 0:
raise ValueError(f"gpu_tflops must be positive, got {tflops}")
if not 0 < eff <= 1.0:
raise ValueError(f"efficiency must be between 0 and 1, got {eff}")
total_flops = 6 * params_b * 1e9 * tokens_b * 1e9
throughput = num_gpus * tflops * 1e12 * eff
return total_flops / throughput / 3600
|
Spot instance cost estimates are misleading
Spot prices shown here are approximations. Real spot pricing fluctuates by region and time of day. AWS spot can spike to near on-demand during high demand. GCP preemptible VMs have a 24-hour maximum runtime, so you need checkpointing. Azure spot VMs can be evicted with 30 seconds notice. Always add 15-20% buffer to spot estimates for checkpoint/restart overhead.