The Pipeline at a Glance

Multi-object tracking (MOT) in video boils down to two steps: detect objects in each frame, then link those detections across frames so each object keeps a consistent ID. ByteTrack handles the second part. It takes per-frame bounding boxes from any detector and associates them using a combination of Kalman filtering and IoU matching.

The best way to use ByteTrack in Python right now is through the supervision library by Roboflow. It wraps ByteTrack in a clean API, pairs it with annotation tools, and works directly with YOLO detections. Here is a complete pipeline that reads a video, runs YOLOv8 detection, tracks objects with ByteTrack, and writes the output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import supervision as sv
from ultralytics import YOLO

# Load detector and tracker
model = YOLO("yolov8n.pt")
tracker = sv.ByteTrack()

# Annotators for drawing boxes and labels
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()

def process_frame(frame):
    results = model(frame, verbose=False)[0]
    detections = sv.Detections.from_ultralytics(results)

    # Update tracker with new detections
    detections = tracker.update_with_detections(detections)

    # Build labels with tracker IDs
    labels = [
        f"#{tracker_id} {model.names[class_id]} {conf:.2f}"
        for tracker_id, class_id, conf
        in zip(detections.tracker_id, detections.class_id, detections.confidence)
    ]

    # Draw annotations
    frame = box_annotator.annotate(frame.copy(), detections)
    frame = label_annotator.annotate(frame, detections, labels=labels)
    return frame

# Process video end-to-end
sv.process_video(
    source_path="input.mp4",
    target_path="output_tracked.mp4",
    callback=process_frame,
)

That is the whole thing. Around 30 lines from input video to tracked output with persistent IDs drawn on each object.

Install Dependencies

You need two packages: ultralytics for YOLO detection and supervision for ByteTrack and video utilities.

1
pip install ultralytics supervision

supervision version 0.25+ includes ByteTrack built in – no separate install required. The tracker ships as sv.ByteTrack() and has zero external dependencies beyond numpy.

If you want GPU-accelerated YOLO inference (and you do for video), make sure you have a working CUDA setup. Check with:

1
python -c "import torch; print(torch.cuda.is_available())"

If that prints False, reinstall PyTorch with CUDA support from pytorch.org.

How ByteTrack Works

Most trackers only match high-confidence detections. ByteTrack’s key insight is that low-confidence detections still carry useful information. A person partially occluded behind a car might get a detection score of 0.3 – too low for most trackers to consider, but ByteTrack uses it.

The algorithm runs in two stages:

  1. First association – Match high-confidence detections (above track_activation_threshold) to existing tracks using IoU with Kalman filter predictions.
  2. Second association – Take unmatched tracks and try to match them against the remaining low-confidence detections. This recovers objects that are briefly occluded or partially visible.

This two-stage approach is why ByteTrack consistently outperforms older trackers like SORT and DeepSORT on benchmarks, especially in crowded scenes.

Tuning the Tracker

ByteTrack exposes three main parameters. The defaults work well for most cases, but tuning them matters for specific scenarios.

1
2
3
4
5
6
tracker = sv.ByteTrack(
    track_activation_threshold=0.25,   # min confidence to start a new track
    lost_track_buffer=30,              # frames to keep a lost track alive
    minimum_matching_threshold=0.8,    # IoU threshold for matching
    frame_rate=30,                     # video frame rate (affects buffer timing)
)

What Each Parameter Does

  • track_activation_threshold (default 0.25) – Detections below this score are used for second-stage matching but won’t create new tracks. Lower it in crowded scenes where objects frequently occlude each other. Raise it if you get too many false tracks.

  • lost_track_buffer (default 30) – How many frames a track survives without a matching detection. Set this to about 1 second of your video’s frame rate. A 30fps video with lost_track_buffer=30 keeps tracks alive for 1 second after they disappear.

  • minimum_matching_threshold (default 0.8) – The IoU cutoff for the second-stage association. Lower values make matching more aggressive, which helps with fast-moving objects but increases ID switches.

  • frame_rate (default 30) – Used internally to scale the lost track buffer. Set this to your actual video frame rate for correct timing.

For a surveillance camera watching a parking lot at 15fps, you might use:

1
2
3
4
5
6
tracker = sv.ByteTrack(
    track_activation_threshold=0.3,
    lost_track_buffer=45,     # 3 seconds at 15fps
    minimum_matching_threshold=0.7,
    frame_rate=15,
)

Filtering Tracked Classes

You rarely want to track every single COCO class. Filter detections before feeding them to the tracker to keep things clean.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import supervision as sv
from ultralytics import YOLO

model = YOLO("yolov8m.pt")
tracker = sv.ByteTrack()

# Only track people (class 0) and cars (class 2)
TRACK_CLASSES = [0, 2]

def process_frame(frame):
    results = model(frame, verbose=False)[0]
    detections = sv.Detections.from_ultralytics(results)

    # Filter to only classes we care about
    mask = np.isin(detections.class_id, TRACK_CLASSES)
    detections = detections[mask]

    detections = tracker.update_with_detections(detections)

    labels = [
        f"#{tid} {model.names[cid]}"
        for tid, cid in zip(detections.tracker_id, detections.class_id)
    ]

    annotated = sv.BoxAnnotator().annotate(frame.copy(), detections)
    annotated = sv.LabelAnnotator().annotate(annotated, detections, labels=labels)
    return annotated

Counting Objects Crossing a Line

A practical use case: count how many people or vehicles cross a boundary in the frame. The supervision library has LineZone built in for exactly this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import supervision as sv
from ultralytics import YOLO

model = YOLO("yolov8m.pt")
tracker = sv.ByteTrack()

# Define a counting line across the frame
# Start point (x1, y1) to end point (x2, y2)
LINE_START = sv.Point(0, 500)
LINE_END = sv.Point(1280, 500)
line_zone = sv.LineZone(start=LINE_START, end=LINE_END)
line_annotator = sv.LineZoneAnnotator(thickness=2)

box_annotator = sv.BoxAnnotator()

def process_frame(frame):
    results = model(frame, verbose=False)[0]
    detections = sv.Detections.from_ultralytics(results)
    detections = tracker.update_with_detections(detections)

    # Update line counter
    line_zone.trigger(detections)

    # Annotate
    frame = box_annotator.annotate(frame.copy(), detections)
    frame = line_annotator.annotate(frame, line_zone)
    return frame

sv.process_video(
    source_path="traffic.mp4",
    target_path="counted.mp4",
    callback=process_frame,
)

print(f"In: {line_zone.in_count}, Out: {line_zone.out_count}")

The counter distinguishes direction – objects crossing from one side count as “in” and the other direction as “out.” The direction is determined by the line orientation.

Evaluation Metrics

If you are benchmarking your tracker, two metrics matter:

  • MOTA (Multiple Object Tracking Accuracy) – Combines false positives, missed detections, and ID switches into a single score. Higher is better. A MOTA above 70 on MOT17 is competitive.
  • IDF1 (ID F1 Score) – Measures how well the tracker maintains consistent IDs over time. This penalizes ID switches more heavily than MOTA. If your tracks keep swapping IDs, IDF1 drops fast.

ByteTrack achieves a MOTA of 80.3 and IDF1 of 77.3 on the MOT17 benchmark, which is strong. Most of your real-world accuracy will depend on the quality of your detector though. A better YOLO model (like yolov8l instead of yolov8n) usually improves tracking metrics more than any tracker tuning.

Common Errors and Fixes

AttributeError: 'Detections' object has no attribute 'tracker_id'

You called tracker.update_with_detections() but are trying to access tracker_id on the original detections object instead of the returned one. The tracker returns a new Detections object with the tracker_id field populated.

1
2
3
4
5
6
7
8
9
# Wrong
detections = sv.Detections.from_ultralytics(results)
tracker.update_with_detections(detections)
print(detections.tracker_id)  # None!

# Right
detections = sv.Detections.from_ultralytics(results)
detections = tracker.update_with_detections(detections)
print(detections.tracker_id)  # [1, 2, 3, ...]

TypeError: update_with_detections() got an unexpected keyword argument

You are on an older version of supervision that used a different API. Versions before 0.18 used tracker.update_with_tensors() with raw numpy arrays. Upgrade to get the current API:

1
pip install --upgrade supervision

Tracker IDs keep resetting to 1

You are creating a new ByteTrack() instance inside your frame callback. The tracker maintains state between frames, so it must be instantiated once and reused. Create it outside the callback function.

Objects get different IDs after brief occlusion

Increase lost_track_buffer. The default of 30 frames is only 1 second at 30fps. If objects disappear behind obstacles for longer, bump this up:

1
tracker = sv.ByteTrack(lost_track_buffer=90)  # 3 seconds at 30fps

VideoSink or process_video produces 0-byte output

This usually means OpenCV cannot find the right video codec. Install the system codec libraries:

1
2
3
4
5
# Ubuntu/Debian
sudo apt install libx264-dev ffmpeg

# Then reinstall opencv
pip install --force-reinstall opencv-python-headless

YOLO inference is slow (< 5 FPS on GPU)

Make sure YOLO is actually using the GPU. Check with:

1
2
3
4
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
results = model("test.jpg")
print(results[0].speed)  # Shows preprocessing, inference, postprocessing times in ms

If inference time is over 100ms, PyTorch is likely using CPU. Reinstall with CUDA support. Also consider using yolov8n instead of larger models – the nano model runs at 80+ FPS on most GPUs and the tracking quality difference is smaller than you might expect.

When to Use ByteTrack vs. Alternatives

ByteTrack is the go-to for most video tracking tasks. It is fast, accurate, and does not need a separate re-identification model. Use it as your default.

DeepSORT adds an appearance embedding network for re-identification. This helps when objects look different from each other (tracking specific people in a crowd). But it is slower and the extra model adds complexity.

BoT-SORT extends ByteTrack with camera motion compensation and a re-identification module. Better for moving cameras (dashcams, drones) but heavier.

For a fixed-position camera tracking vehicles or pedestrians, ByteTrack is the right choice. It gives you the best speed-accuracy tradeoff without needing extra models or GPU memory for feature extraction.