How to Build Hand Gesture Recognition with MediaPipe and Python

MediaPipe Hands gives you 21 3D landmarks per hand, tracked in real time, straight out of the box. Pair that with a simple rule-based classifier and you can recognize gestures like thumbs up, peace sign, fist, and pointing without training a single model. Here’s the quickest way to get it running.

1
pip install mediapipe opencv-python

That pulls in everything you need. MediaPipe bundles its own hand detection and landmark models, so there’s no separate model download step.

Capturing Frames and Detecting Hands

The first step is opening your webcam with OpenCV and feeding each frame to MediaPipe Hands. MediaPipe expects RGB input, but OpenCV captures in BGR, so you need a color conversion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import cv2
import mediapipe as mp

mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)

with mp_hands.Hands(
    static_image_mode=False,
    max_num_hands=2,
    min_detection_confidence=0.7,
    min_tracking_confidence=0.5,
) as hands:
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        frame = cv2.flip(frame, 1)  # mirror for natural interaction
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = hands.process(rgb_frame)

        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                mp_drawing.draw_landmarks(
                    frame, hand_landmarks, mp_hands.HAND_CONNECTIONS
                )

        cv2.imshow("Hand Tracking", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()

Run this and you’ll see colored dots and lines overlaid on your hands. The static_image_mode=False flag tells MediaPipe to use lightweight tracking between frames instead of running full detection every time. That’s what keeps it fast.

Each landmark has an x, y, and z coordinate normalized to the image dimensions. The 21 points cover the wrist, each finger’s MCP, PIP, DIP, and TIP joints. Landmark 0 is the wrist, landmarks 4, 8, 12, 16, and 20 are the fingertips for thumb through pinky.

Building a Rule-Based Gesture Classifier

You don’t need a neural network for basic gestures. Finger state (extended or curled) is enough. A finger is extended when its tip landmark is above (lower y value) its PIP joint. The thumb is a special case since it moves laterally, so you compare the tip’s x position against the IP joint instead.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def is_finger_extended(landmarks, finger_tip, finger_pip):
    """Check if a finger is extended by comparing tip and PIP y-coordinates."""
    return landmarks[finger_tip].y < landmarks[finger_pip].y


def is_thumb_extended(landmarks, handedness):
    """Check if thumb is extended based on x-coordinate and hand orientation."""
    if handedness == "Right":
        return landmarks[4].x < landmarks[3].x
    else:
        return landmarks[4].x > landmarks[3].x


def classify_gesture(landmarks, handedness):
    """Classify hand gesture based on which fingers are extended."""
    thumb = is_thumb_extended(landmarks, handedness)
    index = is_finger_extended(landmarks, 8, 6)
    middle = is_finger_extended(landmarks, 12, 10)
    ring = is_finger_extended(landmarks, 16, 14)
    pinky = is_finger_extended(landmarks, 20, 18)

    fingers = [thumb, index, middle, ring, pinky]
    extended_count = sum(fingers)

    if extended_count == 0:
        return "Fist"
    elif thumb and not index and not middle and not ring and not pinky:
        return "Thumbs Up"
    elif not thumb and index and middle and not ring and not pinky:
        return "Peace"
    elif not thumb and index and not middle and not ring and not pinky:
        return "Pointing"
    elif extended_count == 5:
        return "Open Hand"
    else:
        return "Unknown"

The handedness parameter matters for the thumb check. MediaPipe reports whether a detected hand is left or right, but it labels from the camera’s perspective (mirrored). Since we flip the frame for a natural mirror view, a “Right” label actually corresponds to your right hand. The thumb extends away from the palm, so on a right hand the tip moves left (lower x), and on a left hand it moves right (higher x).

Putting It All Together

Now wire the classifier into the main loop. You need to pull the handedness label from the results and pass the landmark list to classify_gesture.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import cv2
import mediapipe as mp

mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils


def is_finger_extended(landmarks, finger_tip, finger_pip):
    return landmarks[finger_tip].y < landmarks[finger_pip].y


def is_thumb_extended(landmarks, handedness):
    if handedness == "Right":
        return landmarks[4].x < landmarks[3].x
    else:
        return landmarks[4].x > landmarks[3].x


def classify_gesture(landmarks, handedness):
    thumb = is_thumb_extended(landmarks, handedness)
    index = is_finger_extended(landmarks, 8, 6)
    middle = is_finger_extended(landmarks, 12, 10)
    ring = is_finger_extended(landmarks, 16, 14)
    pinky = is_finger_extended(landmarks, 20, 18)

    fingers = [thumb, index, middle, ring, pinky]
    extended_count = sum(fingers)

    if extended_count == 0:
        return "Fist"
    elif thumb and not index and not middle and not ring and not pinky:
        return "Thumbs Up"
    elif not thumb and index and middle and not ring and not pinky:
        return "Peace"
    elif not thumb and index and not middle and not ring and not pinky:
        return "Pointing"
    elif extended_count == 5:
        return "Open Hand"
    else:
        return "Unknown"


cap = cv2.VideoCapture(0)

with mp_hands.Hands(
    static_image_mode=False,
    max_num_hands=2,
    min_detection_confidence=0.7,
    min_tracking_confidence=0.5,
) as hands:
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        frame = cv2.flip(frame, 1)
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = hands.process(rgb_frame)

        if results.multi_hand_landmarks and results.multi_handedness:
            for hand_landmarks, hand_info in zip(
                results.multi_hand_landmarks, results.multi_handedness
            ):
                handedness = hand_info.classification[0].label
                gesture = classify_gesture(
                    hand_landmarks.landmark, handedness
                )

                mp_drawing.draw_landmarks(
                    frame, hand_landmarks, mp_hands.HAND_CONNECTIONS
                )

                wrist = hand_landmarks.landmark[0]
                h, w, _ = frame.shape
                cx, cy = int(wrist.x * w), int(wrist.y * h)
                cv2.putText(
                    frame,
                    gesture,
                    (cx - 40, cy - 30),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    1.2,
                    (0, 255, 0),
                    3,
                )

        cv2.imshow("Gesture Recognition", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()

The gesture label appears just above the wrist landmark. You’ll see it update in real time as you switch between fist, thumbs up, peace sign, pointing, and open hand. The label position uses the wrist’s normalized coordinates scaled to the actual frame size.

Improving Accuracy

The basic classifier works well for distinct poses, but you’ll hit edge cases. A few practical improvements:

Smooth noisy predictions. Landmark jitter causes the gesture to flicker between frames. Use a short history buffer and pick the most common gesture over the last N frames.

1
2
3
4
5
6
7
from collections import deque, Counter

gesture_buffer = deque(maxlen=10)

# Inside your main loop, after classifying:
gesture_buffer.append(gesture)
smoothed_gesture = Counter(gesture_buffer).most_common(1)[0][0]

Adjust thresholds per gesture. Some gestures are ambiguous when fingers are partially curled. You can add margin to the extended/curled checks by requiring a minimum y-distance between tip and PIP instead of a simple comparison.

1
2
def is_finger_extended_with_margin(landmarks, finger_tip, finger_pip, margin=0.02):
    return landmarks[finger_tip].y < landmarks[finger_pip].y - margin

Use z-depth for the thumb. The thumb’s lateral movement is harder to detect when the hand faces the camera edge-on. Checking the z-coordinate difference between the thumb tip and MCP can help distinguish extended from tucked.

For more complex gestures like “rock on” or ASL letters, the rule-based approach gets unwieldy fast. At that point, train a small classifier (a random forest or single-layer MLP) on the 21 landmark coordinates. MediaPipe gives you 63 features (x, y, z per landmark), which is plenty for a lightweight model.

Common Errors and Fixes

cv2.error: ... camera not found – OpenCV can’t open your webcam. Check that no other application is using it. On Linux, verify /dev/video0 exists. Try cv2.VideoCapture(1) if you have multiple cameras.

AttributeError: 'NoneType' object has no attribute 'multi_hand_landmarks' – This happens when hands.process() receives a bad frame. Always check ret from cap.read() before processing.

Thumb detection is wrong for left hand – The handedness label from MediaPipe is relative to the camera image, not the user. When you flip the frame with cv2.flip(frame, 1), the labels align with natural hand orientation. If you skip the flip, swap the x-comparison logic in is_thumb_extended.

Gesture flickers rapidly between two labels – Landmark positions have sub-pixel noise. Add the smoothing buffer shown above. Increasing min_tracking_confidence to 0.7 also helps, though it may cause more re-detections.

ImportError: cannot import name 'solutions' from 'mediapipe' – You likely have an outdated or broken mediapipe install. Run pip install --upgrade mediapipe and make sure you’re not in a virtual environment that has a conflicting version.

Performance is slow on CPU – MediaPipe Hands runs well on most CPUs, but if you’re processing high-res frames, resize before passing to hands.process(). A 640x480 input is more than sufficient for hand tracking and cuts processing time significantly compared to 1080p.

1
2
small_frame = cv2.resize(rgb_frame, (640, 480))
results = hands.process(small_frame)

Note that if you resize for processing, the normalized landmark coordinates still map correctly since they’re relative to the input dimensions. You’ll need to scale them back to the original frame size for drawing.

Capturing Frames and Detecting Hands#

Building a Rule-Based Gesture Classifier#

Putting It All Together#

Improving Accuracy#

Common Errors and Fixes#

Related Guides#

About the Author

Capturing Frames and Detecting Hands

Building a Rule-Based Gesture Classifier

Putting It All Together

Improving Accuracy

Common Errors and Fixes

Related Guides