ByteDance released Seedance 2.0 on February 8, 2026, and it immediately broke from the pack in one critical way: it generates audio and video simultaneously. Every other major video model — Sora 2, Kling, Runway Gen-3 — produces silent video and bolts audio on afterward. Seedance 2.0 uses a Dual-Branch Diffusion Transformer that synthesizes them in the same forward pass. The result is phoneme-accurate lip sync across 8+ languages without any post-processing step.
This guide covers how to call the API, what parameters actually matter, and how to get clean results across all four input modes.
What Seedance 2.0 Can Do#
The model outputs up to 2K resolution at 4–15 seconds per clip. Its multimodal input supports up to 9 reference images, 3 video clips, and 3 audio clips mixed together — 15 reference assets in a single request. The four primary input modes are:
- Text-to-video — prompt only, audio generated from scene context
- Image-to-video — one or more reference images with a driving prompt
- Audio-conditioned video — lip sync driven by a supplied audio clip
- Omni-reference — mixed images + video + audio with asset tags in the prompt
The model also supports first-frame and last-frame anchoring, letting you control entry and exit frames for multi-shot continuity.
Aspect ratios available: 16:9, 9:16, 4:3, 3:4, 21:9, 1:1.
Two speed tiers: seedance_2.0 (standard quality, ~90–120 seconds generation) and seedance_2.0_fast (~30–60 seconds, slightly reduced quality).
API Authentication and Setup#
The official Volcengine/Volcano Ark API launches February 24, 2026. In the meantime, fal.ai and several OpenAI-compatible proxy platforms expose the same model. All examples below use the async job pattern that all providers share.
Install the dependencies:
1
| pip install requests python-dotenv
|
Set your credentials in a .env file:
1
2
3
| SEEDANCE_API_KEY=sk-your-api-key-here
SEEDANCE_API_BASE=https://api.fal.ai/v1
# Or use: https://ark.cn-beijing.volces.com/api/v3 (Volcengine, when live)
|
The core client setup:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| import os
import time
import requests
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["SEEDANCE_API_KEY"]
API_BASE = os.environ.get("SEEDANCE_API_BASE", "https://api.fal.ai/v1")
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
def submit_job(payload: dict) -> str:
"""Submit a video generation request and return the task ID."""
resp = requests.post(
f"{API_BASE}/video/generations",
headers=HEADERS,
json=payload,
timeout=30,
)
resp.raise_for_status()
data = resp.json()
return data["id"]
def poll_job(task_id: str, interval: int = 5, max_wait: int = 300) -> dict:
"""Poll until the job completes or times out. Returns the completed job data."""
deadline = time.time() + max_wait
while time.time() < deadline:
resp = requests.get(
f"{API_BASE}/video/generations/{task_id}",
headers=HEADERS,
timeout=10,
)
resp.raise_for_status()
data = resp.json()
status = data.get("status")
if status == "succeeded":
return data
if status == "failed":
raise RuntimeError(f"Generation failed: {data.get('error', 'unknown error')}")
time.sleep(interval)
raise TimeoutError(f"Job {task_id} did not complete within {max_wait}s")
def download_video(url: str, output_path: str) -> None:
"""Download the generated video to disk."""
resp = requests.get(url, stream=True, timeout=60)
resp.raise_for_status()
with open(output_path, "wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Saved: {output_path}")
|
Text-to-Video with Native Audio#
The simplest call — provide a prompt, let the model generate both video and synchronized audio from the scene context:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| def text_to_video(
prompt: str,
aspect_ratio: str = "16:9",
duration: int = 5,
resolution: str = "1080p",
speed: str = "seedance_2.0",
) -> str:
"""Generate a video from a text prompt. Returns local file path."""
payload = {
"model": speed, # "seedance_2.0" or "seedance_2.0_fast"
"prompt": prompt,
"aspect_ratio": aspect_ratio,
"duration": duration, # 4–15 seconds
"resolution": resolution, # "480p", "720p", "1080p", "2k"
"audio": True, # Enable native audio generation
}
print(f"Submitting: {prompt[:60]}...")
task_id = submit_job(payload)
print(f"Task ID: {task_id} — polling...")
result = poll_job(task_id)
video_url = result["output"]["url"]
output_path = f"output_{task_id[:8]}.mp4"
download_video(video_url, output_path)
return output_path
if __name__ == "__main__":
path = text_to_video(
prompt=(
"A jazz pianist performs at a smoky club in 1950s New York. "
"Close-up on hands dancing over ivory keys. "
"Ambient crowd murmur, clinking glasses, and live piano music."
),
aspect_ratio="16:9",
duration=8,
resolution="2k",
)
print(f"Video saved to: {path}")
|
The audio: True flag is what unlocks native generation. When set to False, you get a silent video clip — useful if you’re supplying your own audio track in a later step.
Image-to-Video with Lip Sync#
This is where Seedance 2.0 really pulls ahead. Provide a portrait image and an audio clip, and the model generates a talking-head video with phoneme-accurate lip sync — no separate lip-sync step required.
To reference a local file, encode it as a base64 data URI or upload it to a URL first. Most providers accept a public URL directly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| import base64
from pathlib import Path
def encode_image(image_path: str) -> str:
"""Encode a local image as a base64 data URI."""
suffix = Path(image_path).suffix.lstrip(".")
mime = f"image/{suffix}" if suffix != "jpg" else "image/jpeg"
with open(image_path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime};base64,{encoded}"
def image_to_video_with_lipsync(
image_path: str,
audio_url: str,
prompt: str,
duration: int = 8,
language: str = "en",
) -> str:
"""
Generate a lip-synced talking-head video.
Args:
image_path: Path to portrait image (JPEG or PNG)
audio_url: Public URL to the driving audio clip (WAV or MP3)
prompt: Scene description — keeps the background and camera consistent
duration: 4–15 seconds (should match audio clip length)
language: ISO 639-1 code: "en", "zh", "es", "fr", "de", "ja", "ko", "pt"
"""
image_data = encode_image(image_path)
payload = {
"model": "seedance_2.0",
"prompt": prompt,
"images": [image_data], # Up to 9 images accepted
"audios": [audio_url], # Up to 3 audio clips accepted
"duration": duration,
"resolution": "1080p",
"audio": True, # Keep True — drives native lip sync
"lipsync_language": language,
}
task_id = submit_job(payload)
print(f"Lip-sync job: {task_id}")
result = poll_job(task_id, max_wait=180)
output_path = f"lipsync_{task_id[:8]}.mp4"
download_video(result["output"]["url"], output_path)
return output_path
# Example usage
path = image_to_video_with_lipsync(
image_path="portrait.jpg",
audio_url="https://example.com/speech_clip.wav",
prompt="A professional speaker at a conference podium, clean white background, direct eye contact with camera.",
duration=10,
language="en",
)
|
The lipsync_language parameter maps phoneme tables to the target language. Without it, the model defaults to English phoneme detection, which produces obvious desync on Chinese, Spanish, or Japanese speech.
Omni-reference mode lets you tag assets directly in the prompt string using @image_file_1, @video_file_1, and @audio_file_1 syntax. This gives you director-level control over which visual reference drives which part of the scene.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
| def omni_reference_video(
prompt: str,
images: list[str], # List of public image URLs
videos: list[str], # List of public video URLs
audios: list[str], # List of public audio URLs
aspect_ratio: str = "16:9",
duration: int = 10,
) -> str:
"""
Generate video using Omni-reference: mixed image, video, and audio inputs.
In the prompt, reference assets as @image_file_1, @image_file_2,
@video_file_1, @audio_file_1 etc. The model uses these to anchor
the corresponding visual and audio elements.
Example prompt:
"The person in @image_file_1 walks into the scene from @video_file_1.
Background music from @audio_file_1."
"""
payload = {
"model": "seedance_2.0",
"prompt": prompt,
"images": images, # Up to 9
"videos": videos, # Up to 3
"audios": audios, # Up to 3
"aspect_ratio": aspect_ratio,
"duration": duration,
"resolution": "1080p",
"audio": True,
}
task_id = submit_job(payload)
print(f"Omni-reference job: {task_id}")
result = poll_job(task_id, max_wait=240)
output_path = f"omni_{task_id[:8]}.mp4"
download_video(result["output"]["url"], output_path)
return output_path
# Example: product demo video combining brand image, reference motion, and voiceover
path = omni_reference_video(
prompt=(
"The product from @image_file_1 rotates on a sleek surface, "
"mimicking the motion style from @video_file_1. "
"Voiceover from @audio_file_1 plays synchronized with the visuals."
),
images=["https://cdn.example.com/product_hero.jpg"],
videos=["https://cdn.example.com/motion_ref.mp4"],
audios=["https://cdn.example.com/voiceover.mp3"],
duration=12,
)
|
Director-Level Parameters#
Beyond the basics, these parameters give you fine-grained control over the cinematic output:
| Parameter | Values | Effect |
|---|
model | seedance_2.0, seedance_2.0_fast | Quality vs. speed tradeoff |
resolution | 480p, 720p, 1080p, 2k | Output resolution |
duration | 4–15 (int) | Clip length in seconds |
aspect_ratio | 16:9, 9:16, 4:3, 3:4, 21:9, 1:1 | Frame dimensions |
audio | true, false | Enable native audio generation |
lipsync_language | en, zh, es, fr, de, ja, ko, pt | Phoneme table for lip sync |
first_frame_image | image URL or data URI | Anchor the opening frame |
last_frame_image | image URL or data URI | Anchor the closing frame |
negative_prompt | text string | What to avoid in generation |
The first_frame_image and last_frame_image parameters are particularly useful for multi-shot productions — you can chain clips by anchoring the last frame of clip N as the first frame of clip N+1, maintaining visual continuity without stitching artifacts.
Batch Generation with Error Handling#
Production workflows need retry logic and concurrent submissions. Here’s a pattern for batch generation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
| import concurrent.futures
from dataclasses import dataclass
@dataclass
class VideoJob:
prompt: str
output_name: str
aspect_ratio: str = "16:9"
duration: int = 5
resolution: str = "1080p"
def run_job(job: VideoJob) -> tuple[str, str]:
"""Submit and await a single video job. Returns (output_name, file_path)."""
payload = {
"model": "seedance_2.0_fast",
"prompt": job.prompt,
"aspect_ratio": job.aspect_ratio,
"duration": job.duration,
"resolution": job.resolution,
"audio": True,
}
try:
task_id = submit_job(payload)
result = poll_job(task_id, max_wait=180)
path = f"{job.output_name}.mp4"
download_video(result["output"]["url"], path)
return job.output_name, path
except Exception as e:
print(f"Failed job '{job.output_name}': {e}")
return job.output_name, ""
def batch_generate(jobs: list[VideoJob], max_workers: int = 3) -> dict[str, str]:
"""
Run multiple video jobs concurrently.
Returns dict of {output_name: file_path}.
"""
results = {}
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(run_job, job): job for job in jobs}
for future in concurrent.futures.as_completed(futures):
name, path = future.result()
results[name] = path
status = "OK" if path else "FAILED"
print(f"[{status}] {name}")
return results
# Example batch run
jobs = [
VideoJob("A chef plating a dish in a Michelin-star kitchen, ambient kitchen sounds.", "scene_01"),
VideoJob("A street musician plays violin in Paris rain, realistic ambient sound.", "scene_02", duration=10),
VideoJob("Time-lapse of city skyline at sunset, traffic hum below.", "scene_03", aspect_ratio="21:9"),
]
results = batch_generate(jobs, max_workers=2)
print(results)
|
Keep max_workers at 2–3 unless you have confirmed higher concurrency limits on your API tier.
Seedance 2.0 vs. Sora 2 vs. Kling vs. Runway Gen-3#
| Feature | Seedance 2.0 | Sora 2 | Kling 1.6 | Runway Gen-3 Alpha |
|---|
| Max resolution | 2K | 1080p | 1080p | 1080p |
| Native audio | Yes (simultaneous) | No (post-process) | No | No |
| Lip sync | Phoneme-level, 8+ langs | No | Separate step | No |
| Max duration | 15s | 20s | 30s | 10s |
| Multimodal input | 9 img + 3 vid + 3 audio | Image + text | Image + text | Image + text |
| Pricing (1080p/min) | ~$0.10–$0.40 | ~$0.15 (est.) | ~$0.08 | ~$0.10 |
| API availability | Feb 24, 2026 (fal.ai now) | Limited access | GA | GA |
The native audio advantage is significant for any workflow involving spoken content. Running Sora or Kling + a separate lip-sync step (SadTalker, Wav2Lip, or a commercial service) typically adds latency, introduces alignment drift, and requires two API budgets. Seedance 2.0 collapses that to a single request.
The 2K ceiling also matters for anything destined for large-format display. Most competitors cap at 1080p.
Common Errors#
429 Too Many Requests — You’ve exceeded your tier’s concurrency limit. Add exponential backoff to poll_job and reduce max_workers in batch runs.
Job stuck in processing state — Omni-reference jobs with multiple large assets can take 3–5 minutes. Increase max_wait to 360 seconds for these jobs.
Lip sync desync on non-English audio — Always set lipsync_language explicitly. The model defaults to English phoneme detection; without this flag, accented speech or non-Latin scripts will show visible mismatch.
resolution: 2k not available — 2K is a paid-tier feature on most proxy platforms. Fall back to 1080p or check your plan limits.
Malformed base64 image — The data URI prefix must match the actual format: data:image/jpeg;base64,... for JPEGs, data:image/png;base64,... for PNGs. Mismatched MIME types cause silent failures.