How to Build Real-Time Face Detection with MediaPipe and Python

MediaPipe’s face detection runs on CPU at 30+ FPS and works out of the box with two lines of setup. No GPU required, no model downloads, no ONNX runtime headaches. You get bounding boxes, confidence scores, and six facial keypoints per face. Here’s how to wire it up with OpenCV for real-time webcam detection.

1
pip install mediapipe opencv-python

That pulls in everything you need. MediaPipe bundles its own TFLite models internally, so there’s no separate weight file to manage.

How MediaPipe Face Detection Works

MediaPipe ships two face detection models under the hood, and you pick one with a single integer:

Model 0 (short-range) — optimized for faces within 2 meters of the camera. Best for selfie-style webcam setups. This is the default.
Model 1 (full-range) — handles faces up to 5 meters away. Use this for surveillance-style cameras or when people are further from the lens.

Both models return a bounding box and six keypoints per detected face: right eye, left eye, nose tip, mouth center, right ear tragion, and left ear tragion. The detection pipeline runs a single-shot detector (SSD) with a custom encoder, which is why it’s so fast on CPU.

Here’s a minimal script that detects faces in a single image:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import cv2
import mediapipe as mp

mp_face_detection = mp.solutions.face_detection
mp_drawing = mp.solutions.drawing_utils

image = cv2.imread("photo.jpg")
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

with mp_face_detection.FaceDetection(
    model_selection=0, min_detection_confidence=0.5
) as face_detection:
    results = face_detection.process(rgb_image)

    if results.detections:
        for detection in results.detections:
            mp_drawing.draw_detection(image, detection)
        print(f"Found {len(results.detections)} face(s)")
    else:
        print("No faces detected")

cv2.imwrite("output.jpg", image)

Feed it any image with a face and you’ll get bounding boxes drawn on the output. The model_selection=0 picks the short-range model. Swap to 1 if your subjects are far from the camera.

Real-Time Webcam Face Detection

The single-image version is useful for testing, but the real payoff is live detection. Here’s a complete webcam loop that draws bounding boxes and confidence scores on every frame:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import cv2
import mediapipe as mp

mp_face_detection = mp.solutions.face_detection
mp_drawing = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

with mp_face_detection.FaceDetection(
    model_selection=0, min_detection_confidence=0.5
) as face_detection:

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = face_detection.process(rgb_frame)

        if results.detections:
            for detection in results.detections:
                mp_drawing.draw_detection(frame, detection)

                # Extract and display confidence score
                score = detection.score[0]
                bbox = detection.location_data.relative_bounding_box
                h, w, _ = frame.shape
                x = int(bbox.xmin * w)
                y = int(bbox.ymin * h)
                cv2.putText(
                    frame,
                    f"{score:.2f}",
                    (x, y - 10),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.8,
                    (0, 255, 0),
                    2,
                )

        cv2.imshow("Face Detection", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()

Press q to quit. The confidence score floats above each bounding box so you can see how sure the model is about each detection.

Extracting Face Keypoints

Beyond bounding boxes, each detection gives you six facial landmarks. These are normalized coordinates (0.0 to 1.0), so you multiply by frame dimensions to get pixel positions. This is useful for head pose estimation, gaze direction, or feeding crops into a face recognition model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import cv2
import mediapipe as mp

mp_face_detection = mp.solutions.face_detection

KEYPOINT_NAMES = [
    "right_eye",
    "left_eye",
    "nose_tip",
    "mouth_center",
    "right_ear_tragion",
    "left_ear_tragion",
]

cap = cv2.VideoCapture(0)

with mp_face_detection.FaceDetection(
    model_selection=0, min_detection_confidence=0.6
) as face_detection:

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        h, w, _ = frame.shape
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = face_detection.process(rgb_frame)

        if results.detections:
            for detection in results.detections:
                for idx, keypoint in enumerate(
                    detection.location_data.relative_keypoints
                ):
                    cx = int(keypoint.x * w)
                    cy = int(keypoint.y * h)
                    cv2.circle(frame, (cx, cy), 4, (0, 255, 255), -1)
                    cv2.putText(
                        frame,
                        KEYPOINT_NAMES[idx],
                        (cx + 5, cy - 5),
                        cv2.FONT_HERSHEY_SIMPLEX,
                        0.4,
                        (0, 255, 255),
                        1,
                    )

        cv2.imshow("Keypoints", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()

Each keypoint is a RelativeKeypoint with .x and .y fields. The six points give you enough geometry to estimate rough head orientation or to align face crops before passing them to a recognition model.

Performance Tips

MediaPipe face detection is already fast, but you can squeeze more out of it for edge deployments or older hardware.

Reduce resolution. The model internally resizes input anyway. Feeding 640x480 instead of 1920x1080 cuts preprocessing time significantly:

1
2
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

Skip frames. If you don’t need detection on every single frame, process every Nth frame and reuse the last result for the skipped ones:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
frame_count = 0
last_detections = None

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    frame_count += 1
    if frame_count % 3 == 0:
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = face_detection.process(rgb_frame)
        last_detections = results.detections

    if last_detections:
        for detection in last_detections:
            mp_drawing.draw_detection(frame, detection)

    cv2.imshow("Face Detection", frame)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

This processes every 3rd frame while still displaying all frames smoothly. You trade a bit of detection latency for a ~3x reduction in compute.

Raise the confidence threshold. Setting min_detection_confidence=0.7 or higher reduces false positives and slightly speeds up post-processing since fewer detections get returned. The default 0.5 is a good balance, but bump it up if you’re getting phantom faces in complex backgrounds.

Use model 0 unless you need range. The short-range model is faster than full-range. Only switch to model_selection=1 when your camera is mounted far from people.

Common Errors and Fixes

cv2.error: ... Can't open camera by index — Your webcam isn’t available at index 0. Try cv2.VideoCapture(1) or cv2.VideoCapture(2). On Linux, check that /dev/video0 exists. If you’re in a VM or WSL, webcam passthrough may need extra configuration.

AttributeError: 'NoneType' object has no attribute 'detections' — This happens when you call face_detection.process() with a BGR image instead of RGB. Always convert with cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) before processing. MediaPipe expects RGB input.

Detections flicker or disappear between frames — Lower the min_detection_confidence to 0.3 or 0.4. The default 0.5 can miss faces at oblique angles or in poor lighting. You can also try switching to model_selection=1 for better detection at distance.

ImportError: cannot import name 'solutions' from 'mediapipe' — You likely have a stale or partial install. Run pip install --upgrade mediapipe to get the latest version. If you’re on Apple Silicon, make sure you’re using Python 3.9+ since older builds had compatibility issues.

Bounding boxes are offset or misaligned — The relative bounding box coordinates assume the image hasn’t been flipped. If you’re using cv2.flip(frame, 1) for a mirror effect, apply the flip before passing the frame to MediaPipe, not after. Otherwise the coordinates won’t match the displayed frame.

How MediaPipe Face Detection Works#

Real-Time Webcam Face Detection#

Extracting Face Keypoints#

Performance Tips#

Common Errors and Fixes#

Related Guides#

About the Author

How MediaPipe Face Detection Works

Real-Time Webcam Face Detection

Extracting Face Keypoints

Performance Tips

Common Errors and Fixes

Related Guides