The Quick Version

Talking head generation takes a still portrait and an audio clip and produces a video where the person appears to be speaking the audio. SadTalker is the most accessible tool for this — it handles head motion, facial expressions, and lip sync from a single image.

1
2
3
4
5
6
7
# Clone and install SadTalker
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker
pip install -r requirements.txt

# Download pretrained models
bash scripts/download_models.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import subprocess

def generate_talking_head(
    image_path: str,
    audio_path: str,
    output_path: str = "output.mp4",
    enhancer: str = "gfpgan",
) -> str:
    """Generate a talking head video from a portrait and audio."""
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", "./results",
        "--enhancer", enhancer,
        "--still",              # minimal head movement (more stable)
        "--preprocess", "crop",
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"SadTalker failed: {result.stderr}")

    print(f"Generated: {output_path}")
    return output_path

# Generate a talking head video
generate_talking_head(
    image_path="portrait.jpg",
    audio_path="speech.wav",
)

The --still flag produces subtle head motion that looks natural without the exaggerated movements that can look uncanny. The --enhancer gfpgan flag upscales and sharpens the face for better quality output.

Generating Audio from Text First

If you don’t have audio, generate it with a TTS model and then feed it to the talking head generator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from openai import OpenAI
import subprocess

client = OpenAI()

def text_to_talking_head(
    text: str,
    image_path: str,
    voice: str = "alloy",
    output_path: str = "talking_head.mp4",
) -> str:
    """Generate a talking head video from text and a portrait."""

    # Step 1: Generate speech audio
    audio_response = client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,
        input=text,
    )
    audio_path = "temp_speech.mp3"
    audio_response.stream_to_file(audio_path)
    print(f"Generated audio: {audio_path}")

    # Step 2: Generate talking head video
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", "./results",
        "--enhancer", "gfpgan",
        "--still",
        "--preprocess", "crop",
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, cwd="SadTalker")
    if result.returncode != 0:
        raise RuntimeError(f"Generation failed: {result.stderr}")

    return output_path

# Create a video of someone explaining a concept
text_to_talking_head(
    text="Neural networks learn by adjusting weights through backpropagation. "
         "Each layer transforms the input data, and the network gradually learns "
         "to make better predictions through thousands of training iterations.",
    image_path="professor.jpg",
    voice="onyx",
)

LivePortrait for Expression Control

LivePortrait gives you finer control over facial expressions. Instead of audio-driven animation, you drive the portrait with a reference video of facial movements.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import subprocess
import os

def liveportrait_animate(
    source_image: str,
    driving_video: str,
    output_dir: str = "./liveportrait_output",
) -> str:
    """Animate a portrait using another video's facial movements."""
    os.makedirs(output_dir, exist_ok=True)

    cmd = [
        "python", "inference.py",
        "--source", source_image,
        "--driving", driving_video,
        "--output-dir", output_dir,
        "--flag_relative_motion",    # relative motion transfer
        "--flag_do_crop",
        "--flag_pasteback",          # paste animated face back onto original
    ]

    result = subprocess.run(
        cmd, capture_output=True, text=True, cwd="LivePortrait"
    )

    if result.returncode != 0:
        raise RuntimeError(f"LivePortrait failed: {result.stderr}")

    # Find the output video
    output_files = [f for f in os.listdir(output_dir) if f.endswith(".mp4")]
    return os.path.join(output_dir, output_files[0]) if output_files else ""

# Transfer expressions from a webcam recording to a portrait
result = liveportrait_animate(
    source_image="professional_headshot.jpg",
    driving_video="webcam_recording.mp4",
)
print(f"Output: {result}")

LivePortrait excels at transferring subtle expressions — eyebrow raises, smiles, head tilts — that make the output look natural. It’s better for expression transfer, while SadTalker is better for audio-driven lip sync.

Batch Processing Multiple Videos

Generate multiple talking head videos from a set of scripts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import json
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor

def batch_generate(
    scripts: list[dict],
    output_dir: str = "./batch_output",
) -> list[dict]:
    """Generate multiple talking head videos.

    scripts: [{"id": "intro", "text": "...", "image": "speaker.jpg", "voice": "alloy"}]
    """
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    results = []

    for script in scripts:
        try:
            # Generate TTS audio
            audio_response = client.audio.speech.create(
                model="tts-1-hd",
                voice=script.get("voice", "alloy"),
                input=script["text"],
            )
            audio_file = output_path / f"{script['id']}_audio.mp3"
            audio_response.stream_to_file(str(audio_file))

            # Generate video
            cmd = [
                "python", "inference.py",
                "--driven_audio", str(audio_file),
                "--source_image", script["image"],
                "--result_dir", str(output_path / script["id"]),
                "--enhancer", "gfpgan",
                "--still",
                "--preprocess", "crop",
            ]

            subprocess.run(cmd, capture_output=True, text=True, cwd="SadTalker", check=True)

            results.append({"id": script["id"], "status": "success"})
            print(f"Generated {script['id']}")

        except Exception as e:
            results.append({"id": script["id"], "status": "failed", "error": str(e)})
            print(f"Failed {script['id']}: {e}")

    return results

scripts = [
    {"id": "welcome", "text": "Welcome to our platform. Let me show you around.", "image": "host.jpg"},
    {"id": "feature1", "text": "The dashboard gives you real-time analytics.", "image": "host.jpg"},
    {"id": "closing", "text": "Thanks for watching. Sign up to get started.", "image": "host.jpg"},
]

results = batch_generate(scripts)
print(json.dumps(results, indent=2))

Quality Optimization

Several settings dramatically affect output quality:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def high_quality_generation(image_path: str, audio_path: str) -> str:
    """Generate the highest quality talking head video."""
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", "./results_hq",
        "--enhancer", "gfpgan",       # face enhancement
        "--still",                     # stable head motion
        "--preprocess", "full",        # process full image, not just crop
        "--size", "512",               # higher resolution
        "--expression_scale", "1.0",   # natural expression intensity
        "--input_yaw", "0",            # keep face centered
        "--input_pitch", "0",
        "--input_roll", "0",
    ]

    subprocess.run(cmd, capture_output=True, text=True, cwd="SadTalker", check=True)
    return "./results_hq"

Tips for Best Results

Source image quality matters most. Use a high-resolution, well-lit, front-facing portrait. The face should be clearly visible with no occlusion. Professional headshots work best.

Audio quality affects lip sync. Clean, clear speech without background noise produces the most accurate lip movements. Pre-process noisy audio with a denoiser before feeding it in.

Match the portrait style to the use case. Photorealistic portraits work for business videos. Illustrated or stylized portraits work for creative content and avoid uncanny valley issues.

Common Errors and Fixes

Lips don’t sync with audio

The audio might have long silences or background music that confuses the lip sync model. Strip the audio to speech only. Also ensure the audio is mono, 16kHz — some models expect specific formats:

1
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

Face looks distorted or warped

The source image face is at too extreme an angle. Use a near-frontal photo (within 15 degrees of center). Also check that the face detection finds the correct face — if there are multiple faces in the image, crop to just the subject.

Output video is choppy or low framerate

Default output is 25 FPS. For smoother results, interpolate frames with RIFE:

1
2
pip install rife-ncnn-vulkan-python
# Or use ffmpeg: ffmpeg -i output.mp4 -filter:v "minterpolate=fps=60" smooth.mp4

CUDA out of memory

Reduce the --size parameter from 512 to 256. Or use --preprocess crop instead of full to process only the face region. The GFPGAN enhancer also uses significant VRAM — disable it with --enhancer none if needed.

Uncanny valley — output looks creepy

This usually happens with extreme expressions or mismatched head motion. Use --still for minimal head movement and reduce --expression_scale to 0.8. For critical use cases, review generated videos before publishing.

Ethical Considerations

Talking head generation is a dual-use technology. Always disclose when content is AI-generated. Never create content that impersonates real people without their explicit consent. Many jurisdictions have laws against deepfakes used for deception — know the regulations in your area before deploying.

For legitimate use cases (educational content, product demos, accessibility), talking head generation is a powerful tool that makes video content creation accessible to teams without video production capability.