The Pipeline at a Glance
Multi-object tracking (MOT) in video boils down to two steps: detect objects in each frame, then link those detections across frames so each object keeps a consistent ID. ByteTrack handles the second part. It takes per-frame bounding boxes from any detector and associates them using a combination of Kalman filtering and IoU matching.
The best way to use ByteTrack in Python right now is through the supervision library by Roboflow. It wraps ByteTrack in a clean API, pairs it with annotation tools, and works directly with YOLO detections. Here is a complete pipeline that reads a video, runs YOLOv8 detection, tracks objects with ByteTrack, and writes the output:
| |
That is the whole thing. Around 30 lines from input video to tracked output with persistent IDs drawn on each object.
Install Dependencies
You need two packages: ultralytics for YOLO detection and supervision for ByteTrack and video utilities.
| |
supervision version 0.25+ includes ByteTrack built in – no separate install required. The tracker ships as sv.ByteTrack() and has zero external dependencies beyond numpy.
If you want GPU-accelerated YOLO inference (and you do for video), make sure you have a working CUDA setup. Check with:
| |
If that prints False, reinstall PyTorch with CUDA support from pytorch.org.
How ByteTrack Works
Most trackers only match high-confidence detections. ByteTrack’s key insight is that low-confidence detections still carry useful information. A person partially occluded behind a car might get a detection score of 0.3 – too low for most trackers to consider, but ByteTrack uses it.
The algorithm runs in two stages:
- First association – Match high-confidence detections (above
track_activation_threshold) to existing tracks using IoU with Kalman filter predictions. - Second association – Take unmatched tracks and try to match them against the remaining low-confidence detections. This recovers objects that are briefly occluded or partially visible.
This two-stage approach is why ByteTrack consistently outperforms older trackers like SORT and DeepSORT on benchmarks, especially in crowded scenes.
Tuning the Tracker
ByteTrack exposes three main parameters. The defaults work well for most cases, but tuning them matters for specific scenarios.
| |
What Each Parameter Does
track_activation_threshold(default 0.25) – Detections below this score are used for second-stage matching but won’t create new tracks. Lower it in crowded scenes where objects frequently occlude each other. Raise it if you get too many false tracks.lost_track_buffer(default 30) – How many frames a track survives without a matching detection. Set this to about 1 second of your video’s frame rate. A 30fps video withlost_track_buffer=30keeps tracks alive for 1 second after they disappear.minimum_matching_threshold(default 0.8) – The IoU cutoff for the second-stage association. Lower values make matching more aggressive, which helps with fast-moving objects but increases ID switches.frame_rate(default 30) – Used internally to scale the lost track buffer. Set this to your actual video frame rate for correct timing.
For a surveillance camera watching a parking lot at 15fps, you might use:
| |
Filtering Tracked Classes
You rarely want to track every single COCO class. Filter detections before feeding them to the tracker to keep things clean.
| |
Counting Objects Crossing a Line
A practical use case: count how many people or vehicles cross a boundary in the frame. The supervision library has LineZone built in for exactly this.
| |
The counter distinguishes direction – objects crossing from one side count as “in” and the other direction as “out.” The direction is determined by the line orientation.
Evaluation Metrics
If you are benchmarking your tracker, two metrics matter:
- MOTA (Multiple Object Tracking Accuracy) – Combines false positives, missed detections, and ID switches into a single score. Higher is better. A MOTA above 70 on MOT17 is competitive.
- IDF1 (ID F1 Score) – Measures how well the tracker maintains consistent IDs over time. This penalizes ID switches more heavily than MOTA. If your tracks keep swapping IDs, IDF1 drops fast.
ByteTrack achieves a MOTA of 80.3 and IDF1 of 77.3 on the MOT17 benchmark, which is strong. Most of your real-world accuracy will depend on the quality of your detector though. A better YOLO model (like yolov8l instead of yolov8n) usually improves tracking metrics more than any tracker tuning.
Common Errors and Fixes
AttributeError: 'Detections' object has no attribute 'tracker_id'
You called tracker.update_with_detections() but are trying to access tracker_id on the original detections object instead of the returned one. The tracker returns a new Detections object with the tracker_id field populated.
| |
TypeError: update_with_detections() got an unexpected keyword argument
You are on an older version of supervision that used a different API. Versions before 0.18 used tracker.update_with_tensors() with raw numpy arrays. Upgrade to get the current API:
| |
Tracker IDs keep resetting to 1
You are creating a new ByteTrack() instance inside your frame callback. The tracker maintains state between frames, so it must be instantiated once and reused. Create it outside the callback function.
Objects get different IDs after brief occlusion
Increase lost_track_buffer. The default of 30 frames is only 1 second at 30fps. If objects disappear behind obstacles for longer, bump this up:
| |
VideoSink or process_video produces 0-byte output
This usually means OpenCV cannot find the right video codec. Install the system codec libraries:
| |
YOLO inference is slow (< 5 FPS on GPU)
Make sure YOLO is actually using the GPU. Check with:
| |
If inference time is over 100ms, PyTorch is likely using CPU. Reinstall with CUDA support. Also consider using yolov8n instead of larger models – the nano model runs at 80+ FPS on most GPUs and the tracking quality difference is smaller than you might expect.
When to Use ByteTrack vs. Alternatives
ByteTrack is the go-to for most video tracking tasks. It is fast, accurate, and does not need a separate re-identification model. Use it as your default.
DeepSORT adds an appearance embedding network for re-identification. This helps when objects look different from each other (tracking specific people in a crowd). But it is slower and the extra model adds complexity.
BoT-SORT extends ByteTrack with camera motion compensation and a re-identification module. Better for moving cameras (dashcams, drones) but heavier.
For a fixed-position camera tracking vehicles or pedestrians, ByteTrack is the right choice. It gives you the best speed-accuracy tradeoff without needing extra models or GPU memory for feature extraction.
Related Guides
- How to Build a Video Surveillance Analytics Pipeline with YOLOv8
- How to Build Multi-Object Tracking with DeepSORT and YOLOv8
- How to Build a Vehicle Counting Pipeline with YOLOv8 and OpenCV
- How to Build Video Analytics Pipelines with OpenCV and Deep Learning
- How to Build a Video Shot Boundary Detection Pipeline with PySceneDetect
- How to Detect Objects in Images with YOLOv8
- How to Build Video Action Recognition with SlowFast and PyTorch
- How to Build a Document Comparison Pipeline with Vision Models
- How to Build a Lane Detection Pipeline with OpenCV and YOLO
- How to Build a Real-Time Pose Estimation Pipeline with MediaPipe