SlowFast networks process video at two temporal resolutions simultaneously — a Slow pathway captures spatial detail at low frame rate, while a Fast pathway captures motion at high frame rate with fewer channels. This dual-pathway design hits 76.94% top-1 accuracy on Kinetics-400 and runs efficiently enough for real-time use cases.
Here’s the fastest path from zero to working action recognition.
Load a Pretrained SlowFast Model#
Install the dependencies first:
1
| pip install torch torchvision pytorchvideo opencv-python
|
Load the pretrained SlowFast R50 model from Torch Hub and grab the Kinetics-400 label mapping:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import torch
import json
import urllib.request
from typing import List
# Load pretrained SlowFast R50 (Kinetics-400, 8x8 setting)
model = torch.hub.load(
"facebookresearch/pytorchvideo", "slowfast_r50", pretrained=True
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.eval().to(device)
# Download Kinetics-400 class labels
labels_url = "https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json"
urllib.request.urlretrieve(labels_url, "kinetics_classnames.json")
with open("kinetics_classnames.json", "r") as f:
kinetics_classnames = json.load(f)
# Map integer IDs to human-readable labels
id_to_label = {v: k.replace('"', "") for k, v in kinetics_classnames.items()}
|
The model outputs logits for 400 action classes — everything from “archery” and “bowling” to “playing guitar” and “riding a bike.”
Build the Preprocessing Pipeline#
SlowFast’s preprocessing is more involved than a typical image classifier. You need to sample 32 frames uniformly, normalize them, then split the tensor into two pathways with different temporal resolutions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
| from torchvision.transforms import Compose, Lambda
from torchvision.transforms._transforms_video import CenterCropVideo, NormalizeVideo
from pytorchvideo.data.encoded_video import EncodedVideo
from pytorchvideo.transforms import (
ApplyTransformToKey,
ShortSideScale,
UniformTemporalSubsample,
)
# SlowFast R50 preprocessing constants
NUM_FRAMES = 32
SAMPLING_RATE = 2
FPS = 30
SIDE_SIZE = 256
CROP_SIZE = 256
MEAN = [0.45, 0.45, 0.45]
STD = [0.225, 0.225, 0.225]
SLOWFAST_ALPHA = 4 # temporal stride ratio between slow and fast pathways
class PackPathway(torch.nn.Module):
"""Split video tensor into slow and fast pathway inputs.
The fast pathway gets all 32 frames.
The slow pathway gets every 4th frame (32 / alpha = 8 frames).
"""
def forward(self, frames: torch.Tensor) -> List[torch.Tensor]:
fast_pathway = frames
slow_pathway = torch.index_select(
frames,
1, # temporal dimension
torch.linspace(
0, frames.shape[1] - 1, frames.shape[1] // SLOWFAST_ALPHA
).long(),
)
return [slow_pathway, fast_pathway]
transform = ApplyTransformToKey(
key="video",
transform=Compose(
[
UniformTemporalSubsample(NUM_FRAMES),
Lambda(lambda x: x / 255.0),
NormalizeVideo(MEAN, STD),
ShortSideScale(size=SIDE_SIZE),
CenterCropVideo(CROP_SIZE),
PackPathway(),
]
),
)
# Clip duration in seconds: (32 frames * 2 sampling_rate) / 30 fps ≈ 2.13s
clip_duration = (NUM_FRAMES * SAMPLING_RATE) / FPS
|
The PackPathway transform is the critical piece. SlowFast expects a list of two tensors, not a single tensor. The slow pathway tensor has shape [1, 3, 8, 256, 256] and the fast pathway has shape [1, 3, 32, 256, 256].
Run Inference on a Video File#
With the pipeline in place, classify an action from any video file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| def classify_video(video_path: str, start_sec: float = 0.0, top_k: int = 5):
"""Classify the action in a video clip starting at start_sec."""
video = EncodedVideo.from_path(video_path)
video_data = video.get_clip(
start_sec=start_sec, end_sec=start_sec + clip_duration
)
video_data = transform(video_data)
# Move both pathway tensors to device and add batch dimension
inputs = [pathway.to(device)[None, ...] for pathway in video_data["video"]]
with torch.no_grad():
preds = model(inputs)
probs = torch.nn.functional.softmax(preds, dim=1)
top_probs, top_indices = probs.topk(top_k)
results = []
for i in range(top_k):
label = id_to_label[int(top_indices[0][i])]
confidence = float(top_probs[0][i])
results.append((label, confidence))
print(f" {label}: {confidence:.3f}")
return results
# Example usage
classify_video("archery.mp4")
# Output:
# archery: 0.932
# throwing axe: 0.012
# playing cricket: 0.008
# ...
|
The EncodedVideo class from PyTorchVideo handles video decoding internally — it supports mp4, avi, and most common formats via PyAV.
Process a Live Webcam or Video Stream#
For real-time inference, grab frames from OpenCV, buffer them, and run the model on rolling windows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
| import cv2
import numpy as np
def process_video_stream(source=0, interval_sec: float = 2.0):
"""Run action recognition on a webcam or video file.
Args:
source: 0 for webcam, or a path like "video.mp4"
interval_sec: seconds between predictions
"""
cap = cv2.VideoCapture(source)
if not cap.isOpened():
raise RuntimeError(f"Cannot open video source: {source}")
fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
frames_needed = NUM_FRAMES * SAMPLING_RATE # 64 frames at native fps
frame_buffer = []
frame_count = 0
prediction_interval = int(fps * interval_sec)
print(f"Stream opened — FPS: {fps:.0f}, predicting every {prediction_interval} frames")
while True:
ret, frame = cap.read()
if not ret:
break
# Convert BGR -> RGB and store
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frame_buffer.append(frame_rgb)
# Keep only the most recent frames we need
if len(frame_buffer) > frames_needed:
frame_buffer = frame_buffer[-frames_needed:]
frame_count += 1
# Run prediction at fixed intervals once we have enough frames
if frame_count % prediction_interval == 0 and len(frame_buffer) >= frames_needed:
label, confidence = _predict_from_frames(frame_buffer)
# Overlay prediction on the display frame
text = f"{label} ({confidence:.2f})"
cv2.putText(
frame, text, (20, 40),
cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 255, 0), 2,
)
print(f" -> {text}")
cv2.imshow("Action Recognition", frame)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
def _predict_from_frames(frame_buffer: list) -> tuple:
"""Run SlowFast inference on a list of RGB numpy frames."""
# Stack frames into tensor: (T, H, W, C) -> (C, T, H, W)
video_tensor = torch.from_numpy(np.array(frame_buffer)).float()
video_tensor = video_tensor.permute(3, 0, 1, 2) # C, T, H, W
# Subsample to NUM_FRAMES uniformly
indices = torch.linspace(0, video_tensor.shape[1] - 1, NUM_FRAMES).long()
video_tensor = video_tensor[:, indices, :, :]
# Normalize
video_tensor = video_tensor / 255.0
for c in range(3):
video_tensor[c] = (video_tensor[c] - MEAN[c]) / STD[c]
# Resize short side to 256 and center crop
_, t, h, w = video_tensor.shape
scale = SIDE_SIZE / min(h, w)
new_h, new_w = int(h * scale), int(w * scale)
video_tensor = torch.nn.functional.interpolate(
video_tensor.permute(1, 0, 2, 3), # T, C, H, W for interpolate
size=(new_h, new_w),
mode="bilinear",
align_corners=False,
).permute(1, 0, 2, 3) # back to C, T, H, W
# Center crop
start_h = (new_h - CROP_SIZE) // 2
start_w = (new_w - CROP_SIZE) // 2
video_tensor = video_tensor[:, :, start_h:start_h + CROP_SIZE, start_w:start_w + CROP_SIZE]
# Pack into slow/fast pathways
fast_pathway = video_tensor
slow_indices = torch.linspace(0, NUM_FRAMES - 1, NUM_FRAMES // SLOWFAST_ALPHA).long()
slow_pathway = video_tensor[:, slow_indices, :, :]
inputs = [
slow_pathway.unsqueeze(0).to(device), # [1, 3, 8, 256, 256]
fast_pathway.unsqueeze(0).to(device), # [1, 3, 32, 256, 256]
]
with torch.no_grad():
preds = model(inputs)
probs = torch.nn.functional.softmax(preds, dim=1)
top_prob, top_idx = probs.topk(1)
label = id_to_label[int(top_idx[0][0])]
return label, float(top_prob[0][0])
# Run on webcam
# process_video_stream(0)
# Or on a video file
# process_video_stream("surveillance_clip.mp4")
|
The key insight: you need to buffer enough raw frames (64 at native FPS with sampling_rate=2) before running inference. The model expects a ~2 second window of video, so predictions naturally lag by that amount.
Batch Inference on Multiple Clips#
When you need to process many video files — say a directory of security footage clips — batch them to max out GPU utilization:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
| import os
def batch_classify(video_dir: str, top_k: int = 3) -> dict:
"""Classify actions in all video files in a directory."""
results = {}
video_extensions = {".mp4", ".avi", ".mov", ".mkv"}
video_files = [
f for f in os.listdir(video_dir)
if os.path.splitext(f)[1].lower() in video_extensions
]
# Collect pathway tensors for batching
slow_batch, fast_batch = [], []
valid_files = []
for filename in video_files:
filepath = os.path.join(video_dir, filename)
try:
video = EncodedVideo.from_path(filepath)
video_data = video.get_clip(start_sec=0, end_sec=clip_duration)
video_data = transform(video_data)
pathways = video_data["video"]
slow_batch.append(pathways[0]) # [3, 8, 256, 256]
fast_batch.append(pathways[1]) # [3, 32, 256, 256]
valid_files.append(filename)
except Exception as e:
print(f"Skipping {filename}: {e}")
if not valid_files:
return results
# Stack into batch tensors
slow_input = torch.stack(slow_batch).to(device) # [B, 3, 8, 256, 256]
fast_input = torch.stack(fast_batch).to(device) # [B, 3, 32, 256, 256]
with torch.no_grad():
preds = model([slow_input, fast_input])
probs = torch.nn.functional.softmax(preds, dim=1)
top_probs, top_indices = probs.topk(top_k, dim=1)
for i, filename in enumerate(valid_files):
results[filename] = [
(id_to_label[int(top_indices[i][j])], float(top_probs[i][j]))
for j in range(top_k)
]
return results
# Example
# results = batch_classify("/data/clips/")
# for name, preds in results.items():
# print(f"{name}: {preds}")
|
Watch your GPU memory with large batches. A batch of 8 clips uses roughly 4 GB of VRAM on an RTX 3080. Scale the batch size to fit your hardware.
Half Precision (FP16)#
Cut memory usage in half and speed up inference on modern GPUs:
1
2
3
4
5
6
7
| model_fp16 = model.half()
# Inputs must also be half precision
inputs_fp16 = [pathway.half() for pathway in inputs]
with torch.no_grad():
preds = model_fp16(inputs_fp16)
|
On an A100, FP16 inference runs about 1.8x faster than FP32 with negligible accuracy loss.
TorchScript Export#
Compile the model for production deployment:
1
2
3
4
5
6
7
8
9
10
| # SlowFast needs example inputs for tracing
example_slow = torch.randn(1, 3, 8, 256, 256).to(device)
example_fast = torch.randn(1, 3, 32, 256, 256).to(device)
scripted_model = torch.jit.trace(model, [example_slow, example_fast])
scripted_model.save("slowfast_r50_scripted.pt")
# Load and use later without pytorchvideo dependency
loaded = torch.jit.load("slowfast_r50_scripted.pt")
preds = loaded([example_slow, example_fast])
|
TorchScript removes the Python overhead and enables deployment in C++ applications or behind a TorchServe endpoint.
Other Tricks#
- Use
torch.compile (PyTorch 2.0+) for an easy 20-30% speedup: model = torch.compile(model) - Reduce input resolution to 224x224 if you can tolerate a small accuracy drop — it cuts compute by ~25%
- Skip frames more aggressively for real-time use — sampling every 4th frame instead of every 2nd still works reasonably well for most actions
Common Errors and Fixes#
RuntimeError: Expected input[0] to have 5 dimensions, got 4
You forgot to add the batch dimension. Each pathway tensor needs shape [B, C, T, H, W]. Fix it with tensor.unsqueeze(0).
TypeError: forward() takes 2 positional arguments but 3 were given
You passed the slow and fast pathways as separate arguments instead of a list. The model expects model([slow, fast]), not model(slow, fast).
IndexError: index 32 is out of bounds for dimension 1 with size 32
Your torch.linspace call in PackPathway is generating an index equal to frames.shape[1]. Make sure the end value is frames.shape[1] - 1, not frames.shape[1].
ImportError: No module named 'pytorchvideo'
Install it with pip install pytorchvideo. If you hit dependency conflicts with your PyTorch version, install from source: pip install "git+https://github.com/facebookresearch/pytorchvideo.git".
av.error.InvalidDataError when loading video
PyTorchVideo’s EncodedVideo uses PyAV under the hood. Install it: pip install av. If the video file itself is corrupt or uses an unsupported codec, convert it first with ffmpeg: ffmpeg -i input.mp4 -c:v libx264 output.mp4.
Predictions are wrong or random on custom video
Check three things: (1) your video is long enough to provide the full clip duration (~2.13 seconds), (2) frames are in RGB order not BGR if you loaded with OpenCV, and (3) normalization uses the Kinetics-400 values (mean=[0.45, 0.45, 0.45], std=[0.225, 0.225, 0.225]), not ImageNet values.