How to Build Video Action Recognition with SlowFast and PyTorch

SlowFast networks process video at two temporal resolutions simultaneously — a Slow pathway captures spatial detail at low frame rate, while a Fast pathway captures motion at high frame rate with fewer channels. This dual-pathway design hits 76.94% top-1 accuracy on Kinetics-400 and runs efficiently enough for real-time use cases.

Here’s the fastest path from zero to working action recognition.

Load a Pretrained SlowFast Model

Install the dependencies first:

1
pip install torch torchvision pytorchvideo opencv-python

Load the pretrained SlowFast R50 model from Torch Hub and grab the Kinetics-400 label mapping:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import json
import urllib.request
from typing import List

# Load pretrained SlowFast R50 (Kinetics-400, 8x8 setting)
model = torch.hub.load(
    "facebookresearch/pytorchvideo", "slowfast_r50", pretrained=True
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.eval().to(device)

# Download Kinetics-400 class labels
labels_url = "https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json"
urllib.request.urlretrieve(labels_url, "kinetics_classnames.json")

with open("kinetics_classnames.json", "r") as f:
    kinetics_classnames = json.load(f)

# Map integer IDs to human-readable labels
id_to_label = {v: k.replace('"', "") for k, v in kinetics_classnames.items()}

The model outputs logits for 400 action classes — everything from “archery” and “bowling” to “playing guitar” and “riding a bike.”

Build the Preprocessing Pipeline

SlowFast’s preprocessing is more involved than a typical image classifier. You need to sample 32 frames uniformly, normalize them, then split the tensor into two pathways with different temporal resolutions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from torchvision.transforms import Compose, Lambda
from torchvision.transforms._transforms_video import CenterCropVideo, NormalizeVideo
from pytorchvideo.data.encoded_video import EncodedVideo
from pytorchvideo.transforms import (
    ApplyTransformToKey,
    ShortSideScale,
    UniformTemporalSubsample,
)

# SlowFast R50 preprocessing constants
NUM_FRAMES = 32
SAMPLING_RATE = 2
FPS = 30
SIDE_SIZE = 256
CROP_SIZE = 256
MEAN = [0.45, 0.45, 0.45]
STD = [0.225, 0.225, 0.225]
SLOWFAST_ALPHA = 4  # temporal stride ratio between slow and fast pathways


class PackPathway(torch.nn.Module):
    """Split video tensor into slow and fast pathway inputs.

    The fast pathway gets all 32 frames.
    The slow pathway gets every 4th frame (32 / alpha = 8 frames).
    """

    def forward(self, frames: torch.Tensor) -> List[torch.Tensor]:
        fast_pathway = frames
        slow_pathway = torch.index_select(
            frames,
            1,  # temporal dimension
            torch.linspace(
                0, frames.shape[1] - 1, frames.shape[1] // SLOWFAST_ALPHA
            ).long(),
        )
        return [slow_pathway, fast_pathway]


transform = ApplyTransformToKey(
    key="video",
    transform=Compose(
        [
            UniformTemporalSubsample(NUM_FRAMES),
            Lambda(lambda x: x / 255.0),
            NormalizeVideo(MEAN, STD),
            ShortSideScale(size=SIDE_SIZE),
            CenterCropVideo(CROP_SIZE),
            PackPathway(),
        ]
    ),
)

# Clip duration in seconds: (32 frames * 2 sampling_rate) / 30 fps ≈ 2.13s
clip_duration = (NUM_FRAMES * SAMPLING_RATE) / FPS

The PackPathway transform is the critical piece. SlowFast expects a list of two tensors, not a single tensor. The slow pathway tensor has shape [1, 3, 8, 256, 256] and the fast pathway has shape [1, 3, 32, 256, 256].

Run Inference on a Video File

With the pipeline in place, classify an action from any video file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def classify_video(video_path: str, start_sec: float = 0.0, top_k: int = 5):
    """Classify the action in a video clip starting at start_sec."""
    video = EncodedVideo.from_path(video_path)
    video_data = video.get_clip(
        start_sec=start_sec, end_sec=start_sec + clip_duration
    )
    video_data = transform(video_data)

    # Move both pathway tensors to device and add batch dimension
    inputs = [pathway.to(device)[None, ...] for pathway in video_data["video"]]

    with torch.no_grad():
        preds = model(inputs)

    probs = torch.nn.functional.softmax(preds, dim=1)
    top_probs, top_indices = probs.topk(top_k)

    results = []
    for i in range(top_k):
        label = id_to_label[int(top_indices[0][i])]
        confidence = float(top_probs[0][i])
        results.append((label, confidence))
        print(f"  {label}: {confidence:.3f}")

    return results


# Example usage
classify_video("archery.mp4")
# Output:
#   archery: 0.932
#   throwing axe: 0.012
#   playing cricket: 0.008
#   ...

The EncodedVideo class from PyTorchVideo handles video decoding internally — it supports mp4, avi, and most common formats via PyAV.

Process a Live Webcam or Video Stream

For real-time inference, grab frames from OpenCV, buffer them, and run the model on rolling windows:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
import cv2
import numpy as np


def process_video_stream(source=0, interval_sec: float = 2.0):
    """Run action recognition on a webcam or video file.

    Args:
        source: 0 for webcam, or a path like "video.mp4"
        interval_sec: seconds between predictions
    """
    cap = cv2.VideoCapture(source)
    if not cap.isOpened():
        raise RuntimeError(f"Cannot open video source: {source}")

    fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    frames_needed = NUM_FRAMES * SAMPLING_RATE  # 64 frames at native fps
    frame_buffer = []
    frame_count = 0
    prediction_interval = int(fps * interval_sec)

    print(f"Stream opened — FPS: {fps:.0f}, predicting every {prediction_interval} frames")

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Convert BGR -> RGB and store
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame_buffer.append(frame_rgb)

        # Keep only the most recent frames we need
        if len(frame_buffer) > frames_needed:
            frame_buffer = frame_buffer[-frames_needed:]

        frame_count += 1

        # Run prediction at fixed intervals once we have enough frames
        if frame_count % prediction_interval == 0 and len(frame_buffer) >= frames_needed:
            label, confidence = _predict_from_frames(frame_buffer)
            # Overlay prediction on the display frame
            text = f"{label} ({confidence:.2f})"
            cv2.putText(
                frame, text, (20, 40),
                cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 255, 0), 2,
            )
            print(f"  -> {text}")

        cv2.imshow("Action Recognition", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

    cap.release()
    cv2.destroyAllWindows()


def _predict_from_frames(frame_buffer: list) -> tuple:
    """Run SlowFast inference on a list of RGB numpy frames."""
    # Stack frames into tensor: (T, H, W, C) -> (C, T, H, W)
    video_tensor = torch.from_numpy(np.array(frame_buffer)).float()
    video_tensor = video_tensor.permute(3, 0, 1, 2)  # C, T, H, W

    # Subsample to NUM_FRAMES uniformly
    indices = torch.linspace(0, video_tensor.shape[1] - 1, NUM_FRAMES).long()
    video_tensor = video_tensor[:, indices, :, :]

    # Normalize
    video_tensor = video_tensor / 255.0
    for c in range(3):
        video_tensor[c] = (video_tensor[c] - MEAN[c]) / STD[c]

    # Resize short side to 256 and center crop
    _, t, h, w = video_tensor.shape
    scale = SIDE_SIZE / min(h, w)
    new_h, new_w = int(h * scale), int(w * scale)
    video_tensor = torch.nn.functional.interpolate(
        video_tensor.permute(1, 0, 2, 3),  # T, C, H, W for interpolate
        size=(new_h, new_w),
        mode="bilinear",
        align_corners=False,
    ).permute(1, 0, 2, 3)  # back to C, T, H, W

    # Center crop
    start_h = (new_h - CROP_SIZE) // 2
    start_w = (new_w - CROP_SIZE) // 2
    video_tensor = video_tensor[:, :, start_h:start_h + CROP_SIZE, start_w:start_w + CROP_SIZE]

    # Pack into slow/fast pathways
    fast_pathway = video_tensor
    slow_indices = torch.linspace(0, NUM_FRAMES - 1, NUM_FRAMES // SLOWFAST_ALPHA).long()
    slow_pathway = video_tensor[:, slow_indices, :, :]

    inputs = [
        slow_pathway.unsqueeze(0).to(device),  # [1, 3, 8, 256, 256]
        fast_pathway.unsqueeze(0).to(device),   # [1, 3, 32, 256, 256]
    ]

    with torch.no_grad():
        preds = model(inputs)

    probs = torch.nn.functional.softmax(preds, dim=1)
    top_prob, top_idx = probs.topk(1)
    label = id_to_label[int(top_idx[0][0])]
    return label, float(top_prob[0][0])


# Run on webcam
# process_video_stream(0)

# Or on a video file
# process_video_stream("surveillance_clip.mp4")

The key insight: you need to buffer enough raw frames (64 at native FPS with sampling_rate=2) before running inference. The model expects a ~2 second window of video, so predictions naturally lag by that amount.

Batch Inference on Multiple Clips

When you need to process many video files — say a directory of security footage clips — batch them to max out GPU utilization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import os


def batch_classify(video_dir: str, top_k: int = 3) -> dict:
    """Classify actions in all video files in a directory."""
    results = {}
    video_extensions = {".mp4", ".avi", ".mov", ".mkv"}

    video_files = [
        f for f in os.listdir(video_dir)
        if os.path.splitext(f)[1].lower() in video_extensions
    ]

    # Collect pathway tensors for batching
    slow_batch, fast_batch = [], []
    valid_files = []

    for filename in video_files:
        filepath = os.path.join(video_dir, filename)
        try:
            video = EncodedVideo.from_path(filepath)
            video_data = video.get_clip(start_sec=0, end_sec=clip_duration)
            video_data = transform(video_data)
            pathways = video_data["video"]
            slow_batch.append(pathways[0])  # [3, 8, 256, 256]
            fast_batch.append(pathways[1])  # [3, 32, 256, 256]
            valid_files.append(filename)
        except Exception as e:
            print(f"Skipping {filename}: {e}")

    if not valid_files:
        return results

    # Stack into batch tensors
    slow_input = torch.stack(slow_batch).to(device)   # [B, 3, 8, 256, 256]
    fast_input = torch.stack(fast_batch).to(device)    # [B, 3, 32, 256, 256]

    with torch.no_grad():
        preds = model([slow_input, fast_input])

    probs = torch.nn.functional.softmax(preds, dim=1)
    top_probs, top_indices = probs.topk(top_k, dim=1)

    for i, filename in enumerate(valid_files):
        results[filename] = [
            (id_to_label[int(top_indices[i][j])], float(top_probs[i][j]))
            for j in range(top_k)
        ]

    return results


# Example
# results = batch_classify("/data/clips/")
# for name, preds in results.items():
#     print(f"{name}: {preds}")

Watch your GPU memory with large batches. A batch of 8 clips uses roughly 4 GB of VRAM on an RTX 3080. Scale the batch size to fit your hardware.

Performance Optimization

Half Precision (FP16)

Cut memory usage in half and speed up inference on modern GPUs:

1
2
3
4
5
6
7
model_fp16 = model.half()

# Inputs must also be half precision
inputs_fp16 = [pathway.half() for pathway in inputs]

with torch.no_grad():
    preds = model_fp16(inputs_fp16)

On an A100, FP16 inference runs about 1.8x faster than FP32 with negligible accuracy loss.

TorchScript Export

Compile the model for production deployment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# SlowFast needs example inputs for tracing
example_slow = torch.randn(1, 3, 8, 256, 256).to(device)
example_fast = torch.randn(1, 3, 32, 256, 256).to(device)

scripted_model = torch.jit.trace(model, [example_slow, example_fast])
scripted_model.save("slowfast_r50_scripted.pt")

# Load and use later without pytorchvideo dependency
loaded = torch.jit.load("slowfast_r50_scripted.pt")
preds = loaded([example_slow, example_fast])

TorchScript removes the Python overhead and enables deployment in C++ applications or behind a TorchServe endpoint.

Other Tricks

Use torch.compile (PyTorch 2.0+) for an easy 20-30% speedup: model = torch.compile(model)
Reduce input resolution to 224x224 if you can tolerate a small accuracy drop — it cuts compute by ~25%
Skip frames more aggressively for real-time use — sampling every 4th frame instead of every 2nd still works reasonably well for most actions

Common Errors and Fixes

RuntimeError: Expected input[0] to have 5 dimensions, got 4

You forgot to add the batch dimension. Each pathway tensor needs shape [B, C, T, H, W]. Fix it with tensor.unsqueeze(0).

TypeError: forward() takes 2 positional arguments but 3 were given

You passed the slow and fast pathways as separate arguments instead of a list. The model expects model([slow, fast]), not model(slow, fast).

IndexError: index 32 is out of bounds for dimension 1 with size 32

Your torch.linspace call in PackPathway is generating an index equal to frames.shape[1]. Make sure the end value is frames.shape[1] - 1, not frames.shape[1].

ImportError: No module named 'pytorchvideo'

Install it with pip install pytorchvideo. If you hit dependency conflicts with your PyTorch version, install from source: pip install "git+https://github.com/facebookresearch/pytorchvideo.git".

av.error.InvalidDataError when loading video

PyTorchVideo’s EncodedVideo uses PyAV under the hood. Install it: pip install av. If the video file itself is corrupt or uses an unsupported codec, convert it first with ffmpeg: ffmpeg -i input.mp4 -c:v libx264 output.mp4.

Predictions are wrong or random on custom video

Check three things: (1) your video is long enough to provide the full clip duration (~2.13 seconds), (2) frames are in RGB order not BGR if you loaded with OpenCV, and (3) normalization uses the Kinetics-400 values (mean=[0.45, 0.45, 0.45], std=[0.225, 0.225, 0.225]), not ImageNet values.

Load a Pretrained SlowFast Model#

Build the Preprocessing Pipeline#

Run Inference on a Video File#

Process a Live Webcam or Video Stream#

Batch Inference on Multiple Clips#

Performance Optimization#

Half Precision (FP16)#

TorchScript Export#

Other Tricks#

Common Errors and Fixes#

Related Guides#

About the Author