Multi-object tracking needs two things: a detector to find objects in each frame and a tracker to assign persistent IDs across frames. YOLOv8 handles detection. DeepSORT handles identity – it uses a combination of Kalman filtering, IoU matching, and a deep appearance embedding to keep track of who is who, even through brief occlusions.

Here is the full pipeline. Install the dependencies first:

1
pip install ultralytics deep-sort-realtime opencv-python numpy

And here is a working tracker that reads a video, runs YOLOv8 detection on every frame, feeds the detections to DeepSORT, and writes an output video with bounding boxes and track IDs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import cv2
from ultralytics import YOLO
from deep_sort_realtime.deepsort_tracker import DeepSort

# Load YOLOv8 detector and DeepSORT tracker
model = YOLO("yolov8n.pt")
tracker = DeepSort(max_age=30, n_init=3, max_cosine_distance=0.2)

cap = cv2.VideoCapture("input.mp4")
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = int(cap.get(cv2.CAP_PROP_FPS))

writer = cv2.VideoWriter(
    "output_tracked.mp4",
    cv2.VideoWriter_fourcc(*"mp4v"),
    fps,
    (width, height),
)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Run YOLOv8 detection
    results = model(frame, verbose=False)[0]

    # Convert detections to DeepSORT format: ([left, top, w, h], confidence, class_name)
    detections = []
    for box in results.boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        conf = float(box.conf[0])
        cls = int(box.cls[0])
        class_name = model.names[cls]
        w = x2 - x1
        h = y2 - y1
        detections.append(([x1, y1, w, h], conf, class_name))

    # Update tracker -- pass the frame so DeepSORT can extract appearance features
    tracks = tracker.update_tracks(detections, frame=frame)

    # Draw tracked objects
    for track in tracks:
        if not track.is_confirmed():
            continue
        track_id = track.track_id
        ltrb = track.to_ltrb()  # [left, top, right, bottom]
        x1, y1, x2, y2 = [int(v) for v in ltrb]

        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(
            frame,
            f"ID {track_id}",
            (x1, y1 - 10),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.6,
            (0, 255, 0),
            2,
        )

    writer.write(frame)

cap.release()
writer.release()
print("Done. Output saved to output_tracked.mp4")

That is a complete, runnable pipeline. About 50 lines of code. Each object gets a persistent integer ID that follows it across frames, even when it briefly disappears and comes back.

How DeepSORT Works

DeepSORT extends the original SORT tracker with a deep appearance descriptor. SORT uses only Kalman filtering and IoU matching – it predicts where an object will be in the next frame, then matches predictions to new detections based on bounding box overlap. This works well when objects move predictably and don’t cross paths.

DeepSORT adds a third signal: appearance similarity. A small neural network (MobileNetV2 by default in deep_sort_realtime) extracts a feature vector from the image crop inside each bounding box. When matching detections to existing tracks, DeepSORT computes cosine distance between the new detection’s feature vector and the track’s stored features. This means two people walking close together get different IDs because they look different, even if their bounding boxes overlap.

The matching cascade works in three stages:

  1. Kalman prediction – Predict each track’s position in the current frame.
  2. Appearance + motion matching – Match detections to tracks using a weighted combination of cosine distance (appearance) and Mahalanobis distance (motion).
  3. IoU fallback – Unmatched tracks get a second chance through pure IoU matching against remaining detections.

Tracks that go unmatched for max_age frames (default 30) get deleted. New detections that don’t match any track start in a tentative state and only become confirmed after n_init consecutive matches (default 3).

Tuning DeepSORT Parameters

The DeepSort constructor takes several parameters that control tracking behavior. The defaults are reasonable, but tuning them for your specific scenario makes a real difference.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from deep_sort_realtime.deepsort_tracker import DeepSort

tracker = DeepSort(
    max_age=50,               # frames to keep a lost track alive
    n_init=3,                 # hits needed to confirm a new track
    max_cosine_distance=0.3,  # appearance matching threshold
    max_iou_distance=0.7,     # IoU gating threshold
    nn_budget=100,            # max stored appearance descriptors per track
    embedder="mobilenet",     # appearance model: 'mobilenet', 'clip_RN50', 'torchreid'
    embedder_gpu=True,        # run embedder on GPU
)

What Each Parameter Does

  • max_age (default 30) – How many frames a track survives without a matching detection. Raise this for scenarios where objects get fully occluded for a second or two (set to frame_rate * seconds). Setting it too high keeps ghost tracks around.

  • n_init (default 3) – Number of consecutive detections required before a track is confirmed. Raise to 5 if you’re getting false tracks from spurious detections. Lower to 1 if you need tracks immediately (at the cost of more false positives).

  • max_cosine_distance (default 0.2) – Threshold for the appearance matching. Lower values require more visual similarity, which reduces ID switches but can cause tracks to break if objects change appearance (lighting shifts, rotation). A value around 0.2-0.4 works for most cases.

  • max_iou_distance (default 0.7) – IoU threshold for the fallback matching stage. Lower values require more overlap. Leave this alone unless you know objects move very fast between frames.

  • nn_budget (default None) – Caps how many appearance descriptors are stored per track. Setting this to 100 prevents memory from growing indefinitely on long videos. Without a budget, the tracker stores every embedding it has ever seen for each track.

  • embedder – The appearance model. "mobilenet" is fast and good enough for most cases. "clip_RN50" or "clip_ViT-B/32" use OpenAI’s CLIP for richer features but are slower. "torchreid" uses a re-identification model trained specifically for person tracking.

For a traffic camera at 30fps where cars occasionally stop behind each other:

1
2
3
4
5
6
tracker = DeepSort(
    max_age=90,               # keep tracks for 3 seconds during occlusion
    n_init=2,                 # confirm tracks fast since detections are reliable
    max_cosine_distance=0.4,  # cars look similar, so be lenient on appearance
    nn_budget=50,             # limit memory on long recordings
)

Filtering and Tracking Specific Classes

You probably don’t want to track every COCO class. Filter detections before passing them to the tracker to avoid wasting compute on irrelevant objects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import cv2
from ultralytics import YOLO
from deep_sort_realtime.deepsort_tracker import DeepSort

model = YOLO("yolov8m.pt")
tracker = DeepSort(max_age=30)

# Only track people (class 0) and cars (class 2)
TRACK_CLASSES = {0, 2}

cap = cv2.VideoCapture("street.mp4")

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, verbose=False)[0]

    detections = []
    for box in results.boxes:
        cls = int(box.cls[0])
        if cls not in TRACK_CLASSES:
            continue
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        conf = float(box.conf[0])
        class_name = model.names[cls]
        detections.append(([x1, y1, x2 - x1, y2 - y1], conf, class_name))

    tracks = tracker.update_tracks(detections, frame=frame)

    for track in tracks:
        if not track.is_confirmed():
            continue
        track_id = track.track_id
        det_class = track.get_det_class()
        ltrb = track.to_ltrb()
        x1, y1, x2, y2 = [int(v) for v in ltrb]

        color = (0, 255, 0) if det_class == "person" else (255, 0, 0)
        cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
        cv2.putText(
            frame,
            f"{det_class} #{track_id}",
            (x1, y1 - 10),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.5,
            color,
            2,
        )

    cv2.imshow("Tracking", frame)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

This filters at the detection level, so the tracker never sees irrelevant objects. You can also use model("frame.jpg", classes=[0, 2]) to have YOLO itself only detect those classes, which saves even more compute.

When to Use DeepSORT vs. ByteTrack

DeepSORT and ByteTrack solve the same problem differently. Pick based on your scenario:

Use DeepSORT when:

  • Objects look visually distinct from each other (people in different clothes, vehicles of different types)
  • You need tracking through extended occlusions (2+ seconds)
  • Re-identification matters – the same person leaving and re-entering the frame should get the same ID
  • You can afford the extra GPU cost of running the appearance model

Use ByteTrack when:

  • Speed is the priority and you need maximum FPS
  • Objects are uniform in appearance (identical boxes on a conveyor belt)
  • The camera is fixed and objects rarely occlude each other for long
  • You don’t want to manage an extra model dependency

ByteTrack is faster because it skips the appearance embedding entirely. DeepSORT is more accurate at identity preservation when appearances differ. For a surveillance system tracking specific individuals, DeepSORT is the better choice. For counting vehicles at an intersection, ByteTrack is sufficient and faster.

Common Errors and Fixes

ModuleNotFoundError: No module named 'deep_sort_realtime'

The package name has hyphens on PyPI but underscores in Python. Install it correctly:

1
pip install deep-sort-realtime

RuntimeError: CUDA error: out of memory

The appearance embedder and YOLO both consume GPU memory. Two things to try: switch to a smaller YOLO model (yolov8n instead of yolov8m), or disable GPU for the embedder:

1
tracker = DeepSort(embedder_gpu=False)

This runs the MobileNetV2 embedder on CPU, freeing GPU memory for YOLO.

Track IDs keep resetting or jumping to high numbers

You are probably creating a new DeepSort() instance every frame. The tracker maintains state between frames, so you must instantiate it once before the loop and reuse it. This is the single most common mistake.

ValueError: not enough values to unpack

The update_tracks method expects detections as a list of tuples in the format ([left, top, width, height], confidence, class_name). If you are passing (x1, y1, x2, y2) format, you need to convert:

1
2
3
4
5
# Wrong -- xyxy format
detections.append(([x1, y1, x2, y2], conf, class_name))

# Right -- ltwh format
detections.append(([x1, y1, x2 - x1, y2 - y1], conf, class_name))

Objects briefly get wrong IDs when crossing paths

Increase max_cosine_distance slightly (try 0.3 or 0.4) to give the appearance matcher more room. If the objects look very similar, this is a fundamental limitation – DeepSORT’s appearance model can only distinguish objects that actually look different. Consider using embedder="torchreid" for person-tracking scenarios, as it is trained specifically for person re-identification.

Output video is 0 bytes or unplayable

OpenCV’s VideoWriter depends on system codecs. The mp4v codec is the most portable. If it still fails, try XVID with an .avi extension:

1
2
3
4
5
6
writer = cv2.VideoWriter(
    "output.avi",
    cv2.VideoWriter_fourcc(*"XVID"),
    fps,
    (width, height),
)

Or install ffmpeg and use the avc1 codec:

1
sudo apt install ffmpeg