How to Use ByteDance Seedance 2.0's API to Generate Lip-Synced Video with Native Audio

ByteDance released Seedance 2.0 on February 8, 2026, and it immediately broke from the pack in one critical way: it generates audio and video simultaneously. Every other major video model — Sora 2, Kling, Runway Gen-3 — produces silent video and bolts audio on afterward. Seedance 2.0 uses a Dual-Branch Diffusion Transformer that synthesizes them in the same forward pass. The result is phoneme-accurate lip sync across 8+ languages without any post-processing step.

This guide covers how to call the API, what parameters actually matter, and how to get clean results across all four input modes.

What Seedance 2.0 Can Do

The model outputs up to 2K resolution at 4–15 seconds per clip. Its multimodal input supports up to 9 reference images, 3 video clips, and 3 audio clips mixed together — 15 reference assets in a single request. The four primary input modes are:

Text-to-video — prompt only, audio generated from scene context
Image-to-video — one or more reference images with a driving prompt
Audio-conditioned video — lip sync driven by a supplied audio clip
Omni-reference — mixed images + video + audio with asset tags in the prompt

The model also supports first-frame and last-frame anchoring, letting you control entry and exit frames for multi-shot continuity.

Aspect ratios available: 16:9, 9:16, 4:3, 3:4, 21:9, 1:1.

Two speed tiers: seedance_2.0 (standard quality, ~90–120 seconds generation) and seedance_2.0_fast (~30–60 seconds, slightly reduced quality).

API Authentication and Setup

The official Volcengine/Volcano Ark API launches February 24, 2026. In the meantime, fal.ai and several OpenAI-compatible proxy platforms expose the same model. All examples below use the async job pattern that all providers share.

Install the dependencies:

1
pip install requests python-dotenv

Set your credentials in a .env file:

1
2
3
SEEDANCE_API_KEY=sk-your-api-key-here
SEEDANCE_API_BASE=https://api.fal.ai/v1
# Or use: https://ark.cn-beijing.volces.com/api/v3 (Volcengine, when live)

The core client setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import os
import time
import requests
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["SEEDANCE_API_KEY"]
API_BASE = os.environ.get("SEEDANCE_API_BASE", "https://api.fal.ai/v1")

HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}

def submit_job(payload: dict) -> str:
    """Submit a video generation request and return the task ID."""
    resp = requests.post(
        f"{API_BASE}/video/generations",
        headers=HEADERS,
        json=payload,
        timeout=30,
    )
    resp.raise_for_status()
    data = resp.json()
    return data["id"]

def poll_job(task_id: str, interval: int = 5, max_wait: int = 300) -> dict:
    """Poll until the job completes or times out. Returns the completed job data."""
    deadline = time.time() + max_wait
    while time.time() < deadline:
        resp = requests.get(
            f"{API_BASE}/video/generations/{task_id}",
            headers=HEADERS,
            timeout=10,
        )
        resp.raise_for_status()
        data = resp.json()
        status = data.get("status")
        if status == "succeeded":
            return data
        if status == "failed":
            raise RuntimeError(f"Generation failed: {data.get('error', 'unknown error')}")
        time.sleep(interval)
    raise TimeoutError(f"Job {task_id} did not complete within {max_wait}s")

def download_video(url: str, output_path: str) -> None:
    """Download the generated video to disk."""
    resp = requests.get(url, stream=True, timeout=60)
    resp.raise_for_status()
    with open(output_path, "wb") as f:
        for chunk in resp.iter_content(chunk_size=8192):
            f.write(chunk)
    print(f"Saved: {output_path}")

Text-to-Video with Native Audio

The simplest call — provide a prompt, let the model generate both video and synchronized audio from the scene context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def text_to_video(
    prompt: str,
    aspect_ratio: str = "16:9",
    duration: int = 5,
    resolution: str = "1080p",
    speed: str = "seedance_2.0",
) -> str:
    """Generate a video from a text prompt. Returns local file path."""
    payload = {
        "model": speed,           # "seedance_2.0" or "seedance_2.0_fast"
        "prompt": prompt,
        "aspect_ratio": aspect_ratio,
        "duration": duration,     # 4–15 seconds
        "resolution": resolution, # "480p", "720p", "1080p", "2k"
        "audio": True,            # Enable native audio generation
    }

    print(f"Submitting: {prompt[:60]}...")
    task_id = submit_job(payload)
    print(f"Task ID: {task_id} — polling...")

    result = poll_job(task_id)
    video_url = result["output"]["url"]

    output_path = f"output_{task_id[:8]}.mp4"
    download_video(video_url, output_path)
    return output_path


if __name__ == "__main__":
    path = text_to_video(
        prompt=(
            "A jazz pianist performs at a smoky club in 1950s New York. "
            "Close-up on hands dancing over ivory keys. "
            "Ambient crowd murmur, clinking glasses, and live piano music."
        ),
        aspect_ratio="16:9",
        duration=8,
        resolution="2k",
    )
    print(f"Video saved to: {path}")

The audio: True flag is what unlocks native generation. When set to False, you get a silent video clip — useful if you’re supplying your own audio track in a later step.

Image-to-Video with Lip Sync

This is where Seedance 2.0 really pulls ahead. Provide a portrait image and an audio clip, and the model generates a talking-head video with phoneme-accurate lip sync — no separate lip-sync step required.

To reference a local file, encode it as a base64 data URI or upload it to a URL first. Most providers accept a public URL directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import base64
from pathlib import Path

def encode_image(image_path: str) -> str:
    """Encode a local image as a base64 data URI."""
    suffix = Path(image_path).suffix.lstrip(".")
    mime = f"image/{suffix}" if suffix != "jpg" else "image/jpeg"
    with open(image_path, "rb") as f:
        encoded = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{encoded}"

def image_to_video_with_lipsync(
    image_path: str,
    audio_url: str,
    prompt: str,
    duration: int = 8,
    language: str = "en",
) -> str:
    """
    Generate a lip-synced talking-head video.

    Args:
        image_path: Path to portrait image (JPEG or PNG)
        audio_url:  Public URL to the driving audio clip (WAV or MP3)
        prompt:     Scene description — keeps the background and camera consistent
        duration:   4–15 seconds (should match audio clip length)
        language:   ISO 639-1 code: "en", "zh", "es", "fr", "de", "ja", "ko", "pt"
    """
    image_data = encode_image(image_path)

    payload = {
        "model": "seedance_2.0",
        "prompt": prompt,
        "images": [image_data],     # Up to 9 images accepted
        "audios": [audio_url],      # Up to 3 audio clips accepted
        "duration": duration,
        "resolution": "1080p",
        "audio": True,              # Keep True — drives native lip sync
        "lipsync_language": language,
    }

    task_id = submit_job(payload)
    print(f"Lip-sync job: {task_id}")
    result = poll_job(task_id, max_wait=180)

    output_path = f"lipsync_{task_id[:8]}.mp4"
    download_video(result["output"]["url"], output_path)
    return output_path


# Example usage
path = image_to_video_with_lipsync(
    image_path="portrait.jpg",
    audio_url="https://example.com/speech_clip.wav",
    prompt="A professional speaker at a conference podium, clean white background, direct eye contact with camera.",
    duration=10,
    language="en",
)

The lipsync_language parameter maps phoneme tables to the target language. Without it, the model defaults to English phoneme detection, which produces obvious desync on Chinese, Spanish, or Japanese speech.

Omni-Reference Mode: Mixing All Four Input Types

Omni-reference mode lets you tag assets directly in the prompt string using @image_file_1, @video_file_1, and @audio_file_1 syntax. This gives you director-level control over which visual reference drives which part of the scene.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def omni_reference_video(
    prompt: str,
    images: list[str],   # List of public image URLs
    videos: list[str],   # List of public video URLs
    audios: list[str],   # List of public audio URLs
    aspect_ratio: str = "16:9",
    duration: int = 10,
) -> str:
    """
    Generate video using Omni-reference: mixed image, video, and audio inputs.

    In the prompt, reference assets as @image_file_1, @image_file_2,
    @video_file_1, @audio_file_1 etc. The model uses these to anchor
    the corresponding visual and audio elements.

    Example prompt:
        "The person in @image_file_1 walks into the scene from @video_file_1.
         Background music from @audio_file_1."
    """
    payload = {
        "model": "seedance_2.0",
        "prompt": prompt,
        "images": images,         # Up to 9
        "videos": videos,         # Up to 3
        "audios": audios,         # Up to 3
        "aspect_ratio": aspect_ratio,
        "duration": duration,
        "resolution": "1080p",
        "audio": True,
    }

    task_id = submit_job(payload)
    print(f"Omni-reference job: {task_id}")
    result = poll_job(task_id, max_wait=240)

    output_path = f"omni_{task_id[:8]}.mp4"
    download_video(result["output"]["url"], output_path)
    return output_path


# Example: product demo video combining brand image, reference motion, and voiceover
path = omni_reference_video(
    prompt=(
        "The product from @image_file_1 rotates on a sleek surface, "
        "mimicking the motion style from @video_file_1. "
        "Voiceover from @audio_file_1 plays synchronized with the visuals."
    ),
    images=["https://cdn.example.com/product_hero.jpg"],
    videos=["https://cdn.example.com/motion_ref.mp4"],
    audios=["https://cdn.example.com/voiceover.mp3"],
    duration=12,
)

Director-Level Parameters

Beyond the basics, these parameters give you fine-grained control over the cinematic output:

Parameter	Values	Effect
`model`	`seedance_2.0`, `seedance_2.0_fast`	Quality vs. speed tradeoff
`resolution`	`480p`, `720p`, `1080p`, `2k`	Output resolution
`duration`	`4`–`15` (int)	Clip length in seconds
`aspect_ratio`	`16:9`, `9:16`, `4:3`, `3:4`, `21:9`, `1:1`	Frame dimensions
`audio`	`true`, `false`	Enable native audio generation
`lipsync_language`	`en`, `zh`, `es`, `fr`, `de`, `ja`, `ko`, `pt`	Phoneme table for lip sync
`first_frame_image`	image URL or data URI	Anchor the opening frame
`last_frame_image`	image URL or data URI	Anchor the closing frame
`negative_prompt`	text string	What to avoid in generation

The first_frame_image and last_frame_image parameters are particularly useful for multi-shot productions — you can chain clips by anchoring the last frame of clip N as the first frame of clip N+1, maintaining visual continuity without stitching artifacts.

Batch Generation with Error Handling

Production workflows need retry logic and concurrent submissions. Here’s a pattern for batch generation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import concurrent.futures
from dataclasses import dataclass

@dataclass
class VideoJob:
    prompt: str
    output_name: str
    aspect_ratio: str = "16:9"
    duration: int = 5
    resolution: str = "1080p"

def run_job(job: VideoJob) -> tuple[str, str]:
    """Submit and await a single video job. Returns (output_name, file_path)."""
    payload = {
        "model": "seedance_2.0_fast",
        "prompt": job.prompt,
        "aspect_ratio": job.aspect_ratio,
        "duration": job.duration,
        "resolution": job.resolution,
        "audio": True,
    }
    try:
        task_id = submit_job(payload)
        result = poll_job(task_id, max_wait=180)
        path = f"{job.output_name}.mp4"
        download_video(result["output"]["url"], path)
        return job.output_name, path
    except Exception as e:
        print(f"Failed job '{job.output_name}': {e}")
        return job.output_name, ""

def batch_generate(jobs: list[VideoJob], max_workers: int = 3) -> dict[str, str]:
    """
    Run multiple video jobs concurrently.
    Returns dict of {output_name: file_path}.
    """
    results = {}
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(run_job, job): job for job in jobs}
        for future in concurrent.futures.as_completed(futures):
            name, path = future.result()
            results[name] = path
            status = "OK" if path else "FAILED"
            print(f"[{status}] {name}")
    return results


# Example batch run
jobs = [
    VideoJob("A chef plating a dish in a Michelin-star kitchen, ambient kitchen sounds.", "scene_01"),
    VideoJob("A street musician plays violin in Paris rain, realistic ambient sound.", "scene_02", duration=10),
    VideoJob("Time-lapse of city skyline at sunset, traffic hum below.", "scene_03", aspect_ratio="21:9"),
]

results = batch_generate(jobs, max_workers=2)
print(results)

Keep max_workers at 2–3 unless you have confirmed higher concurrency limits on your API tier.

Seedance 2.0 vs. Sora 2 vs. Kling vs. Runway Gen-3

Feature	Seedance 2.0	Sora 2	Kling 1.6	Runway Gen-3 Alpha
Max resolution	2K	1080p	1080p	1080p
Native audio	Yes (simultaneous)	No (post-process)	No	No
Lip sync	Phoneme-level, 8+ langs	No	Separate step	No
Max duration	15s	20s	30s	10s
Multimodal input	9 img + 3 vid + 3 audio	Image + text	Image + text	Image + text
Pricing (1080p/min)	~$0.10–$0.40	~$0.15 (est.)	~$0.08	~$0.10
API availability	Feb 24, 2026 (fal.ai now)	Limited access	GA	GA

The native audio advantage is significant for any workflow involving spoken content. Running Sora or Kling + a separate lip-sync step (SadTalker, Wav2Lip, or a commercial service) typically adds latency, introduces alignment drift, and requires two API budgets. Seedance 2.0 collapses that to a single request.

The 2K ceiling also matters for anything destined for large-format display. Most competitors cap at 1080p.

Common Errors

429 Too Many Requests — You’ve exceeded your tier’s concurrency limit. Add exponential backoff to poll_job and reduce max_workers in batch runs.

Job stuck in processing state — Omni-reference jobs with multiple large assets can take 3–5 minutes. Increase max_wait to 360 seconds for these jobs.

Lip sync desync on non-English audio — Always set lipsync_language explicitly. The model defaults to English phoneme detection; without this flag, accented speech or non-Latin scripts will show visible mismatch.

resolution: 2k not available — 2K is a paid-tier feature on most proxy platforms. Fall back to 1080p or check your plan limits.

Malformed base64 image — The data URI prefix must match the actual format: data:image/jpeg;base64,... for JPEGs, data:image/png;base64,... for PNGs. Mismatched MIME types cause silent failures.

What Seedance 2.0 Can Do#

API Authentication and Setup#

Text-to-Video with Native Audio#

Image-to-Video with Lip Sync#

Omni-Reference Mode: Mixing All Four Input Types#

Director-Level Parameters#

Batch Generation with Error Handling#

Seedance 2.0 vs. Sora 2 vs. Kling vs. Runway Gen-3#

Common Errors#

Related Guides#

About the Author