Multi-object tracking needs two things: a detector to find objects in each frame and a tracker to assign persistent IDs across frames. YOLOv8 handles detection. DeepSORT handles identity – it uses a combination of Kalman filtering, IoU matching, and a deep appearance embedding to keep track of who is who, even through brief occlusions.
Here is the full pipeline. Install the dependencies first:
| |
And here is a working tracker that reads a video, runs YOLOv8 detection on every frame, feeds the detections to DeepSORT, and writes an output video with bounding boxes and track IDs:
| |
That is a complete, runnable pipeline. About 50 lines of code. Each object gets a persistent integer ID that follows it across frames, even when it briefly disappears and comes back.
How DeepSORT Works
DeepSORT extends the original SORT tracker with a deep appearance descriptor. SORT uses only Kalman filtering and IoU matching – it predicts where an object will be in the next frame, then matches predictions to new detections based on bounding box overlap. This works well when objects move predictably and don’t cross paths.
DeepSORT adds a third signal: appearance similarity. A small neural network (MobileNetV2 by default in deep_sort_realtime) extracts a feature vector from the image crop inside each bounding box. When matching detections to existing tracks, DeepSORT computes cosine distance between the new detection’s feature vector and the track’s stored features. This means two people walking close together get different IDs because they look different, even if their bounding boxes overlap.
The matching cascade works in three stages:
- Kalman prediction – Predict each track’s position in the current frame.
- Appearance + motion matching – Match detections to tracks using a weighted combination of cosine distance (appearance) and Mahalanobis distance (motion).
- IoU fallback – Unmatched tracks get a second chance through pure IoU matching against remaining detections.
Tracks that go unmatched for max_age frames (default 30) get deleted. New detections that don’t match any track start in a tentative state and only become confirmed after n_init consecutive matches (default 3).
Tuning DeepSORT Parameters
The DeepSort constructor takes several parameters that control tracking behavior. The defaults are reasonable, but tuning them for your specific scenario makes a real difference.
| |
What Each Parameter Does
max_age(default 30) – How many frames a track survives without a matching detection. Raise this for scenarios where objects get fully occluded for a second or two (set to frame_rate * seconds). Setting it too high keeps ghost tracks around.n_init(default 3) – Number of consecutive detections required before a track is confirmed. Raise to 5 if you’re getting false tracks from spurious detections. Lower to 1 if you need tracks immediately (at the cost of more false positives).max_cosine_distance(default 0.2) – Threshold for the appearance matching. Lower values require more visual similarity, which reduces ID switches but can cause tracks to break if objects change appearance (lighting shifts, rotation). A value around 0.2-0.4 works for most cases.max_iou_distance(default 0.7) – IoU threshold for the fallback matching stage. Lower values require more overlap. Leave this alone unless you know objects move very fast between frames.nn_budget(default None) – Caps how many appearance descriptors are stored per track. Setting this to 100 prevents memory from growing indefinitely on long videos. Without a budget, the tracker stores every embedding it has ever seen for each track.embedder– The appearance model."mobilenet"is fast and good enough for most cases."clip_RN50"or"clip_ViT-B/32"use OpenAI’s CLIP for richer features but are slower."torchreid"uses a re-identification model trained specifically for person tracking.
For a traffic camera at 30fps where cars occasionally stop behind each other:
| |
Filtering and Tracking Specific Classes
You probably don’t want to track every COCO class. Filter detections before passing them to the tracker to avoid wasting compute on irrelevant objects.
| |
This filters at the detection level, so the tracker never sees irrelevant objects. You can also use model("frame.jpg", classes=[0, 2]) to have YOLO itself only detect those classes, which saves even more compute.
When to Use DeepSORT vs. ByteTrack
DeepSORT and ByteTrack solve the same problem differently. Pick based on your scenario:
Use DeepSORT when:
- Objects look visually distinct from each other (people in different clothes, vehicles of different types)
- You need tracking through extended occlusions (2+ seconds)
- Re-identification matters – the same person leaving and re-entering the frame should get the same ID
- You can afford the extra GPU cost of running the appearance model
Use ByteTrack when:
- Speed is the priority and you need maximum FPS
- Objects are uniform in appearance (identical boxes on a conveyor belt)
- The camera is fixed and objects rarely occlude each other for long
- You don’t want to manage an extra model dependency
ByteTrack is faster because it skips the appearance embedding entirely. DeepSORT is more accurate at identity preservation when appearances differ. For a surveillance system tracking specific individuals, DeepSORT is the better choice. For counting vehicles at an intersection, ByteTrack is sufficient and faster.
Common Errors and Fixes
ModuleNotFoundError: No module named 'deep_sort_realtime'
The package name has hyphens on PyPI but underscores in Python. Install it correctly:
| |
RuntimeError: CUDA error: out of memory
The appearance embedder and YOLO both consume GPU memory. Two things to try: switch to a smaller YOLO model (yolov8n instead of yolov8m), or disable GPU for the embedder:
| |
This runs the MobileNetV2 embedder on CPU, freeing GPU memory for YOLO.
Track IDs keep resetting or jumping to high numbers
You are probably creating a new DeepSort() instance every frame. The tracker maintains state between frames, so you must instantiate it once before the loop and reuse it. This is the single most common mistake.
ValueError: not enough values to unpack
The update_tracks method expects detections as a list of tuples in the format ([left, top, width, height], confidence, class_name). If you are passing (x1, y1, x2, y2) format, you need to convert:
| |
Objects briefly get wrong IDs when crossing paths
Increase max_cosine_distance slightly (try 0.3 or 0.4) to give the appearance matcher more room. If the objects look very similar, this is a fundamental limitation – DeepSORT’s appearance model can only distinguish objects that actually look different. Consider using embedder="torchreid" for person-tracking scenarios, as it is trained specifically for person re-identification.
Output video is 0 bytes or unplayable
OpenCV’s VideoWriter depends on system codecs. The mp4v codec is the most portable. If it still fails, try XVID with an .avi extension:
| |
Or install ffmpeg and use the avc1 codec:
| |
Related Guides
- How to Build a Vehicle Counting Pipeline with YOLOv8 and OpenCV
- How to Build a Video Surveillance Analytics Pipeline with YOLOv8
- How to Build a Wildlife Camera Trap Classifier with YOLOv8 and FastAPI
- How to Build a Lane Detection Pipeline with OpenCV and YOLO
- How to Build Video Analytics Pipelines with OpenCV and Deep Learning
- How to Detect Objects in Images with YOLOv8
- How to Train a Custom Object Detection Model with Ultralytics
- How to Build a Product Defect Detector with YOLOv8 and OpenCV
- How to Build Real-Time Object Segmentation with SAM 2 and WebSocket
- How to Track Objects in Video with ByteTrack