Running a training job blind is a recipe for wasted GPU hours. You need two kinds of visibility: ML metrics (loss, accuracy, learning rate) and system metrics (GPU utilization, memory, throughput). TensorBoard handles the first part well. Prometheus and Grafana handle the second. This guide wires them together so you get a single monitoring stack for everything.

The Architecture

TensorBoard reads event files written by PyTorch’s SummaryWriter. Prometheus scrapes a metrics endpoint your training script exposes via prometheus_client. Grafana queries Prometheus and displays system metrics alongside your ML metrics. All three services run in Docker containers.

Your training script does double duty: it writes TensorBoard logs to disk and serves Prometheus metrics on an HTTP port.

Install Dependencies

1
pip install torch torchvision tensorboard prometheus_client

You also need Docker and Docker Compose for the monitoring stack. The training script itself runs on your host (or in its own container if you prefer).

Log ML Metrics with TensorBoard

PyTorch’s SummaryWriter writes event files that TensorBoard reads. Here is the pattern for logging training metrics from a real CIFAR-10 training loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.tensorboard import SummaryWriter

# Simple CNN for CIFAR-10
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 8 * 8, 256)
        self.fc2 = nn.Linear(256, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(-1, 64 * 8 * 8)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

writer = SummaryWriter(log_dir="runs/cifar10_experiment")

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

trainset = torchvision.datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

valset = torchvision.datasets.CIFAR10(root="./data", train=False, download=True, transform=transform)
valloader = torch.utils.data.DataLoader(valset, batch_size=256, shuffle=False, num_workers=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

# Log the model graph
sample_input = torch.randn(1, 3, 32, 32).to(device)
writer.add_graph(model, sample_input)

global_step = 0
for epoch in range(30):
    model.train()
    running_loss = 0.0

    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        global_step += 1

        if batch_idx % 50 == 49:
            avg_loss = running_loss / 50
            writer.add_scalar("train/loss", avg_loss, global_step)
            writer.add_scalar("train/learning_rate", scheduler.get_last_lr()[0], global_step)
            running_loss = 0.0

    # Validation after each epoch
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, targets in valloader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

    val_accuracy = correct / total
    writer.add_scalar("val/accuracy", val_accuracy, epoch)
    scheduler.step()

writer.close()

Key details: add_scalar takes a tag name, the value, and a step counter. Use global_step (batch-level) for training loss and epoch for validation accuracy – this gives you fine-grained loss curves and per-epoch accuracy. The add_graph call logs your model architecture so you can inspect it in TensorBoard’s Graphs tab.

Add Prometheus Metrics

The prometheus_client library exposes metrics on an HTTP endpoint that Prometheus scrapes at a configurable interval. You define three metric types: gauges for values that go up and down (loss, GPU memory), counters for monotonically increasing values (samples processed), and histograms for distributions (epoch duration).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import time
from prometheus_client import start_http_server, Gauge, Counter, Histogram

# Start metrics server on port 8000
start_http_server(8000)

# Define custom metrics
training_loss_gauge = Gauge("training_loss", "Current training loss")
val_accuracy_gauge = Gauge("val_accuracy", "Current validation accuracy")
gpu_memory_gauge = Gauge("gpu_memory_used_bytes", "GPU memory used in bytes")
gpu_utilization_gauge = Gauge("gpu_utilization_percent", "GPU utilization percentage")
samples_processed = Counter("samples_processed_total", "Total training samples processed")
epoch_duration = Histogram(
    "epoch_duration_seconds",
    "Time to complete one epoch",
    buckets=[30, 60, 120, 300, 600, 1200],
)

The start_http_server(8000) call launches a background thread serving metrics at http://localhost:8000/metrics. Prometheus will scrape this endpoint.

Combine Both in a Training Loop

Here is the full training loop that logs to TensorBoard and exposes Prometheus metrics simultaneously. This builds on the CIFAR-10 setup above.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.tensorboard import SummaryWriter
from prometheus_client import start_http_server, Gauge, Counter, Histogram

# ---- Prometheus setup ----
start_http_server(8000)
training_loss_gauge = Gauge("training_loss", "Current training loss")
val_accuracy_gauge = Gauge("val_accuracy", "Current validation accuracy")
gpu_memory_gauge = Gauge("gpu_memory_used_bytes", "GPU memory used in bytes")
gpu_utilization_gauge = Gauge("gpu_utilization_percent", "GPU utilization percent")
samples_processed = Counter("samples_processed_total", "Total samples processed")
epoch_duration = Histogram(
    "epoch_duration_seconds",
    "Time per epoch",
    buckets=[30, 60, 120, 300, 600, 1200],
)

# ---- TensorBoard setup ----
writer = SummaryWriter(log_dir="runs/cifar10_monitored")

# ---- Data and model (same as before) ----
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 8 * 8, 256)
        self.fc2 = nn.Linear(256, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(-1, 64 * 8 * 8)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])
trainset = torchvision.datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
valset = torchvision.datasets.CIFAR10(root="./data", train=False, download=True, transform=transform)
valloader = torch.utils.data.DataLoader(valset, batch_size=256, shuffle=False, num_workers=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)


def update_gpu_metrics():
    """Push GPU stats to Prometheus gauges."""
    if torch.cuda.is_available():
        gpu_memory_gauge.set(torch.cuda.memory_allocated(0))
        # torch doesn't expose utilization directly; use pynvml for real values
        # Here we report memory utilization as a proxy
        total = torch.cuda.get_device_properties(0).total_mem
        used = torch.cuda.memory_allocated(0)
        gpu_utilization_gauge.set((used / total) * 100)


global_step = 0
for ep in range(30):
    epoch_start = time.time()
    model.train()
    running_loss = 0.0

    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        global_step += 1
        samples_processed.inc(inputs.size(0))

        if batch_idx % 50 == 49:
            avg_loss = running_loss / 50
            # TensorBoard
            writer.add_scalar("train/loss", avg_loss, global_step)
            writer.add_scalar("train/lr", scheduler.get_last_lr()[0], global_step)
            # Prometheus
            training_loss_gauge.set(avg_loss)
            update_gpu_metrics()
            running_loss = 0.0

    # Validation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, targets in valloader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

    val_acc = correct / total
    writer.add_scalar("val/accuracy", val_acc, ep)
    val_accuracy_gauge.set(val_acc)

    elapsed = time.time() - epoch_start
    epoch_duration.observe(elapsed)
    scheduler.step()
    print(f"Epoch {ep+1}/30 | Loss: {avg_loss:.4f} | Val Acc: {val_acc:.4f} | Time: {elapsed:.1f}s")

writer.close()

Every 50 batches the script updates both TensorBoard event files and Prometheus gauges. At the end of each epoch it records validation accuracy and epoch duration. The samples_processed counter increments by batch size on every step, giving you throughput data in Prometheus.

Docker Compose for the Monitoring Stack

This docker-compose.yaml runs TensorBoard, Prometheus, and Grafana together. Mount your TensorBoard log directory and Prometheus config as volumes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
version: "3.8"

services:
  tensorboard:
    image: tensorflow/tensorflow:latest
    command: tensorboard --logdir=/logs --host=0.0.0.0 --port=6006
    ports:
      - "6006:6006"
    volumes:
      - ./runs:/logs
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped
    depends_on:
      - prometheus

Create the Prometheus config file prometheus.yml in your project root:

1
2
3
4
5
6
7
global:
  scrape_interval: 5s

scrape_configs:
  - job_name: "training_metrics"
    static_configs:
      - targets: ["host.docker.internal:8000"]

The host.docker.internal target tells Prometheus (running in Docker) to scrape your training script (running on the host) at port 8000. On Linux you may need to add extra_hosts: ["host.docker.internal:host-gateway"] to the Prometheus service, or use your machine’s LAN IP instead.

Start everything with:

1
docker compose up -d

Then open TensorBoard at http://localhost:6006, Prometheus at http://localhost:9090, and Grafana at http://localhost:3000 (login: admin/admin).

Set Up Grafana Dashboards

In Grafana, add Prometheus as a data source (URL: http://prometheus:9090 since they share a Docker network). Then create panels for:

  • Training loss: query training_loss – use a time series panel
  • Throughput: query rate(samples_processed_total[1m]) – gives you samples/second
  • Epoch duration: query histogram_quantile(0.95, rate(epoch_duration_seconds_bucket[5m])) – 95th percentile epoch time
  • GPU memory: query gpu_memory_used_bytes / 1073741824 – converts to GB

The rate() function is critical for counters. Raw counter values only go up, so rate() computes the per-second increase, which is what you actually want for throughput.

Better GPU Metrics with pynvml

The training loop above uses torch.cuda.memory_allocated() as a proxy for GPU utilization. For real GPU utilization (SM activity, actual memory, temperature, power), use pynvml directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

def update_gpu_metrics():
    mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
    util_info = pynvml.nvmlDeviceGetUtilizationRates(handle)
    gpu_memory_gauge.set(mem_info.used)
    gpu_utilization_gauge.set(util_info.gpu)

Install with pip install pynvml. This gives you the same metrics nvidia-smi reports, but programmatically and at whatever frequency Prometheus scrapes.

Common Errors and Fixes

Port 8000 already in use: Another process is using port 8000. Either kill it or change the port in start_http_server():

1
2
3
4
# Find what's using the port
lsof -i :8000
# Or use a different port in your script
start_http_server(8001)

TensorBoard shows no data: The log_dir path in Docker must match where your event files actually are. If you wrote logs to runs/cifar10_monitored but mounted ./logs:/logs, TensorBoard sees nothing. Make sure the volume mount matches your SummaryWriter log directory.

Prometheus target is down: Prometheus can’t reach host.docker.internal:8000. On Linux, add this to the Prometheus service in docker-compose.yaml:

1
2
3
  prometheus:
    extra_hosts:
      - "host.docker.internal:host-gateway"

Grafana “No data” on panels: Check that the Prometheus data source URL uses the Docker service name (http://prometheus:9090), not localhost. Containers resolve each other by service name, but localhost inside the Grafana container points to Grafana itself.

prometheus_client conflicts with multiprocessing: If you use PyTorch’s DataLoader with num_workers > 0, the forked workers might collide with the metrics server. The metrics server should only run in the main process. Since start_http_server is called before the training loop, this is fine by default – just don’t call it inside a worker subprocess.

TensorBoard event files are huge: Writing histograms or images every step balloons disk usage fast. Log scalars frequently (every 50 steps) but log histograms and images sparingly (every epoch or less). Use writer.add_scalar often; use writer.add_histogram and writer.add_image rarely.