Generate Audio from a Text Prompt#
Stable Audio Open and AudioLDM 2 are diffusion-based models that turn text descriptions into audio. Both run through Hugging Face’s diffusers library, so setup is straightforward. Here’s the fastest path to generating your first clip.
1
| pip install diffusers transformers accelerate torch scipy soundfile
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import torch
import soundfile as sf
from diffusers import AudioLDM2Pipeline
# Load AudioLDM 2 (~3GB download on first run)
pipe = AudioLDM2Pipeline.from_pretrained(
"cvssp/audioldm2",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
# Generate audio from a text prompt
audio = pipe(
prompt="rain falling on a tin roof with distant thunder",
num_inference_steps=100,
audio_length_in_s=10.0,
).audios[0]
# Save as WAV
sf.write("rain_thunder.wav", audio, samplerate=16000)
print("Saved rain_thunder.wav")
|
AudioLDM 2 outputs a numpy array at 16kHz sample rate. The soundfile library handles WAV writing without any extra dependencies. Ten seconds of audio generates in about 15 seconds on an RTX 3060.
Pick the Right Model#
You have two strong options for text-to-audio generation. They serve different use cases.
| Model | Strengths | VRAM | Sample Rate | Best For |
|---|
cvssp/audioldm2 | General audio, music, speech | ~6GB | 16kHz | Sound effects, ambient audio |
cvssp/audioldm2-music | Music-focused variant | ~6GB | 16kHz | Music generation |
stabilityai/stable-audio-open-1.0 | Longer, higher quality audio | ~8GB | 44.1kHz | High-fidelity music and SFX |
My recommendation: start with AudioLDM 2 for sound effects and environmental audio. If you need music or higher sample rates, use Stable Audio Open. The 44.1kHz output from Stable Audio Open sounds noticeably better when you’re producing content that will be listened to on speakers or headphones.
Set Up Stable Audio Open#
Stable Audio Open produces longer, higher-fidelity clips than AudioLDM 2. It outputs at 44.1kHz and handles durations up to 47 seconds.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| import torch
import soundfile as sf
from diffusers import StableAudioPipeline
pipe = StableAudioPipeline.from_pretrained(
"stabilityai/stable-audio-open-1.0",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
# Generate a 30-second music clip
audio = pipe(
prompt="cinematic orchestral piece, sweeping strings, brass swells, epic mood",
negative_prompt="low quality, distorted, noise, static",
num_inference_steps=100,
audio_length_in_s=30.0,
guidance_scale=7.0,
).audios[0]
# Stable Audio Open outputs at 44.1kHz
sf.write("orchestral.wav", audio, samplerate=44100)
print("Saved orchestral.wav")
|
Stable Audio Open requires you to accept the model license on Hugging Face before downloading. Log in with huggingface-cli login if you get a 403 error.
Tune the Generation Parameters#
Both models respond to the same core parameters, but the sweet spots are different.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| import torch
import soundfile as sf
from diffusers import AudioLDM2Pipeline
pipe = AudioLDM2Pipeline.from_pretrained(
"cvssp/audioldm2",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
audio = pipe(
prompt="acoustic guitar fingerpicking, warm tone, folk style, gentle rhythm",
negative_prompt="drums, bass, electric, distortion, noise",
num_inference_steps=150, # More steps = cleaner output
audio_length_in_s=15.0, # Duration in seconds
guidance_scale=3.5, # How strictly to follow the prompt
num_waveforms_per_prompt=3, # Generate 3 variations at once
).audios
# Save all variations
for i, wav in enumerate(audio):
sf.write(f"guitar_{i}.wav", wav, samplerate=16000)
print(f"Saved guitar_{i}.wav")
|
Here’s what each parameter actually does:
num_inference_steps: 100 is a solid default for AudioLDM 2. Going to 150-200 adds subtle improvements but doubles generation time. Below 50 the audio gets muddy. For Stable Audio Open, 100 steps works well.guidance_scale: This is the most impactful parameter. For AudioLDM 2, keep it between 2.0 and 5.0 – the default of 3.5 is usually right. For Stable Audio Open, 6.0-8.0 works better. Push it past 10 and you’ll hear artifacts.audio_length_in_s: AudioLDM 2 handles up to about 30 seconds before quality drops off. Stable Audio Open goes up to 47 seconds with consistent quality.negative_prompt: Works just like in image diffusion. Tell the model what to avoid. This matters more than you’d think – always include “noise, static, distortion, low quality” at minimum.
Edit Audio with Inpainting#
AudioLDM 2 supports audio inpainting – you mask a section of existing audio and the model regenerates just that part. This is powerful for fixing sections, extending clips, or replacing specific sounds.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| import torch
import numpy as np
import soundfile as sf
from diffusers import AudioLDM2Pipeline
pipe = AudioLDM2Pipeline.from_pretrained(
"cvssp/audioldm2",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
# Load existing audio
original_audio, sr = sf.read("rain_thunder.wav")
# Create a mask: 1.0 = keep original, 0.0 = regenerate
# This masks out seconds 3-6, so the model will regenerate that section
mask = np.ones_like(original_audio)
mask_start = int(3.0 * sr)
mask_end = int(6.0 * sr)
mask[mask_start:mask_end] = 0.0
# Inpaint the masked region with a new description
result = pipe(
prompt="a loud thunderclap with crackling lightning",
audio_length_in_s=len(original_audio) / sr,
num_inference_steps=100,
).audios[0]
# Blend: keep originals where mask is 1, use generated where mask is 0
blended = original_audio * mask + result[:len(original_audio)] * (1.0 - mask)
sf.write("edited_thunder.wav", blended, samplerate=sr)
print("Saved edited_thunder.wav")
|
The approach above is a manual blend. For seamless results, apply a short crossfade at the mask boundaries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| def crossfade_blend(original, generated, mask, crossfade_samples=1600):
"""Blend original and generated audio with smooth crossfades at mask edges."""
result = original.copy()
gen_trimmed = generated[:len(original)]
# Find mask transition points
diff = np.diff(mask)
starts = np.where(diff == -1)[0] # Where mask goes from 1 to 0
ends = np.where(diff == 1)[0] # Where mask goes from 0 to 1
for s in starts:
fade_start = max(0, s - crossfade_samples // 2)
fade_end = min(len(result), s + crossfade_samples // 2)
fade = np.linspace(1.0, 0.0, fade_end - fade_start)
result[fade_start:fade_end] = (
original[fade_start:fade_end] * fade
+ gen_trimmed[fade_start:fade_end] * (1.0 - fade)
)
for e in ends:
fade_start = max(0, e - crossfade_samples // 2)
fade_end = min(len(result), e + crossfade_samples // 2)
fade = np.linspace(0.0, 1.0, fade_end - fade_start)
result[fade_start:fade_end] = (
original[fade_start:fade_end] * fade
+ gen_trimmed[fade_start:fade_end] * (1.0 - fade)
)
# Fill the unmasked interior with generated audio
result[mask == 0.0] = gen_trimmed[mask == 0.0]
return result
|
The models output raw numpy arrays. You’ll often need to convert between formats for different use cases.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| import soundfile as sf
import subprocess
# Save as WAV (lossless, large files)
sf.write("output.wav", audio, samplerate=16000)
# Save as FLAC (lossless, compressed ~50%)
sf.write("output.flac", audio, samplerate=16000)
# Save as OGG Vorbis (lossy, small files)
sf.write("output.ogg", audio, samplerate=16000)
# Convert to MP3 with ffmpeg (best for sharing)
subprocess.run([
"ffmpeg", "-i", "output.wav",
"-codec:libmp3lame", "-qscale:a", "2",
"output.mp3",
], check=True)
# Resample from 16kHz to 44.1kHz for better playback compatibility
import numpy as np
def resample_audio(audio, orig_sr, target_sr):
"""Simple resampling using scipy."""
from scipy.signal import resample
num_samples = int(len(audio) * target_sr / orig_sr)
return resample(audio, num_samples)
audio_44k = resample_audio(audio, 16000, 44100)
sf.write("output_44k.wav", audio_44k, samplerate=44100)
|
Use WAV during development – it’s lossless and every tool reads it. Convert to MP3 or OGG for distribution. If you’re feeding audio into another model downstream, stay in WAV to avoid compression artifacts.
Batch Generation Script#
Here’s a complete script for generating multiple audio clips from a prompt file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| import argparse
import torch
import soundfile as sf
from diffusers import AudioLDM2Pipeline
def main():
parser = argparse.ArgumentParser(description="Batch audio generation with AudioLDM 2")
parser.add_argument("--prompts-file", required=True, help="Text file with one prompt per line")
parser.add_argument("--output-dir", default=".", help="Output directory")
parser.add_argument("--duration", type=float, default=10.0, help="Audio duration in seconds")
parser.add_argument("--steps", type=int, default=100, help="Inference steps")
parser.add_argument("--guidance", type=float, default=3.5, help="Guidance scale")
args = parser.parse_args()
pipe = AudioLDM2Pipeline.from_pretrained(
"cvssp/audioldm2",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
with open(args.prompts_file) as f:
prompts = [line.strip() for line in f if line.strip()]
for i, prompt in enumerate(prompts):
print(f"[{i+1}/{len(prompts)}] Generating: {prompt[:60]}...")
audio = pipe(
prompt=prompt,
negative_prompt="noise, static, distortion, low quality",
num_inference_steps=args.steps,
audio_length_in_s=args.duration,
guidance_scale=args.guidance,
).audios[0]
output_path = f"{args.output_dir}/audio_{i:03d}.wav"
sf.write(output_path, audio, samplerate=16000)
print(f" Saved {output_path}")
# Free VRAM between generations
torch.cuda.empty_cache()
if __name__ == "__main__":
main()
|
Run it with a text file of prompts:
1
| python batch_generate.py --prompts-file prompts.txt --output-dir ./generated --duration 15
|
Common Errors#
RuntimeError: CUDA out of memory
AudioLDM 2 needs about 6GB VRAM in float16. If you’re hitting limits, enable CPU offloading:
1
| pipe.enable_model_cpu_offload()
|
This moves model components to CPU when they’re not needed. It’s slower but cuts VRAM usage roughly in half.
OSError: stabilityai/stable-audio-open-1.0 is not accessible
You need to accept the license on the Hugging Face model page first, then authenticate:
Paste your access token when prompted. Make sure the token has read access.
Generated audio sounds metallic or robotic.
Your guidance_scale is probably too high. Drop it to 3.0-4.0 for AudioLDM 2 or 6.0-7.0 for Stable Audio Open. Also check that you’re using enough inference steps – below 50 the denoising process doesn’t converge properly.
Audio has clicks or pops at segment boundaries.
This happens when you concatenate clips or do inpainting without crossfades. Always apply a crossfade of at least 50ms (800 samples at 16kHz) at edit points. See the crossfade function in the inpainting section above.
ValueError: Audio length is too long
AudioLDM 2 has an upper limit around 30 seconds. Stable Audio Open maxes out at 47 seconds. For longer audio, generate segments with overlapping regions and crossfade them:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| import numpy as np
def generate_long_audio(pipe, prompt, total_duration, segment_duration=10.0, overlap=2.0):
"""Generate audio longer than the model's limit by stitching segments."""
segments = []
offset = 0.0
while offset < total_duration:
dur = min(segment_duration, total_duration - offset)
audio = pipe(
prompt=prompt,
audio_length_in_s=dur,
num_inference_steps=100,
).audios[0]
segments.append(audio)
offset += segment_duration - overlap
# Crossfade overlapping regions
overlap_samples = int(overlap * 16000)
result = segments[0]
for seg in segments[1:]:
fade_out = np.linspace(1.0, 0.0, overlap_samples)
fade_in = np.linspace(0.0, 1.0, overlap_samples)
result[-overlap_samples:] = (
result[-overlap_samples:] * fade_out + seg[:overlap_samples] * fade_in
)
result = np.concatenate([result, seg[overlap_samples:]])
return result
|
ImportError: No module named 'soundfile'
On Linux, soundfile depends on libsndfile:
1
2
| sudo apt install libsndfile1
pip install soundfile
|
On macOS, brew install libsndfile before installing the Python package.