Generate Audio from a Text Prompt

Stable Audio Open and AudioLDM 2 are diffusion-based models that turn text descriptions into audio. Both run through Hugging Face’s diffusers library, so setup is straightforward. Here’s the fastest path to generating your first clip.

1
pip install diffusers transformers accelerate torch scipy soundfile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import soundfile as sf
from diffusers import AudioLDM2Pipeline

# Load AudioLDM 2 (~3GB download on first run)
pipe = AudioLDM2Pipeline.from_pretrained(
    "cvssp/audioldm2",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Generate audio from a text prompt
audio = pipe(
    prompt="rain falling on a tin roof with distant thunder",
    num_inference_steps=100,
    audio_length_in_s=10.0,
).audios[0]

# Save as WAV
sf.write("rain_thunder.wav", audio, samplerate=16000)
print("Saved rain_thunder.wav")

AudioLDM 2 outputs a numpy array at 16kHz sample rate. The soundfile library handles WAV writing without any extra dependencies. Ten seconds of audio generates in about 15 seconds on an RTX 3060.

Pick the Right Model

You have two strong options for text-to-audio generation. They serve different use cases.

ModelStrengthsVRAMSample RateBest For
cvssp/audioldm2General audio, music, speech~6GB16kHzSound effects, ambient audio
cvssp/audioldm2-musicMusic-focused variant~6GB16kHzMusic generation
stabilityai/stable-audio-open-1.0Longer, higher quality audio~8GB44.1kHzHigh-fidelity music and SFX

My recommendation: start with AudioLDM 2 for sound effects and environmental audio. If you need music or higher sample rates, use Stable Audio Open. The 44.1kHz output from Stable Audio Open sounds noticeably better when you’re producing content that will be listened to on speakers or headphones.

Set Up Stable Audio Open

Stable Audio Open produces longer, higher-fidelity clips than AudioLDM 2. It outputs at 44.1kHz and handles durations up to 47 seconds.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
import soundfile as sf
from diffusers import StableAudioPipeline

pipe = StableAudioPipeline.from_pretrained(
    "stabilityai/stable-audio-open-1.0",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Generate a 30-second music clip
audio = pipe(
    prompt="cinematic orchestral piece, sweeping strings, brass swells, epic mood",
    negative_prompt="low quality, distorted, noise, static",
    num_inference_steps=100,
    audio_length_in_s=30.0,
    guidance_scale=7.0,
).audios[0]

# Stable Audio Open outputs at 44.1kHz
sf.write("orchestral.wav", audio, samplerate=44100)
print("Saved orchestral.wav")

Stable Audio Open requires you to accept the model license on Hugging Face before downloading. Log in with huggingface-cli login if you get a 403 error.

Tune the Generation Parameters

Both models respond to the same core parameters, but the sweet spots are different.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
import soundfile as sf
from diffusers import AudioLDM2Pipeline

pipe = AudioLDM2Pipeline.from_pretrained(
    "cvssp/audioldm2",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

audio = pipe(
    prompt="acoustic guitar fingerpicking, warm tone, folk style, gentle rhythm",
    negative_prompt="drums, bass, electric, distortion, noise",
    num_inference_steps=150,       # More steps = cleaner output
    audio_length_in_s=15.0,        # Duration in seconds
    guidance_scale=3.5,            # How strictly to follow the prompt
    num_waveforms_per_prompt=3,    # Generate 3 variations at once
).audios

# Save all variations
for i, wav in enumerate(audio):
    sf.write(f"guitar_{i}.wav", wav, samplerate=16000)
    print(f"Saved guitar_{i}.wav")

Here’s what each parameter actually does:

  • num_inference_steps: 100 is a solid default for AudioLDM 2. Going to 150-200 adds subtle improvements but doubles generation time. Below 50 the audio gets muddy. For Stable Audio Open, 100 steps works well.
  • guidance_scale: This is the most impactful parameter. For AudioLDM 2, keep it between 2.0 and 5.0 – the default of 3.5 is usually right. For Stable Audio Open, 6.0-8.0 works better. Push it past 10 and you’ll hear artifacts.
  • audio_length_in_s: AudioLDM 2 handles up to about 30 seconds before quality drops off. Stable Audio Open goes up to 47 seconds with consistent quality.
  • negative_prompt: Works just like in image diffusion. Tell the model what to avoid. This matters more than you’d think – always include “noise, static, distortion, low quality” at minimum.

Edit Audio with Inpainting

AudioLDM 2 supports audio inpainting – you mask a section of existing audio and the model regenerates just that part. This is powerful for fixing sections, extending clips, or replacing specific sounds.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
import numpy as np
import soundfile as sf
from diffusers import AudioLDM2Pipeline

pipe = AudioLDM2Pipeline.from_pretrained(
    "cvssp/audioldm2",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Load existing audio
original_audio, sr = sf.read("rain_thunder.wav")

# Create a mask: 1.0 = keep original, 0.0 = regenerate
# This masks out seconds 3-6, so the model will regenerate that section
mask = np.ones_like(original_audio)
mask_start = int(3.0 * sr)
mask_end = int(6.0 * sr)
mask[mask_start:mask_end] = 0.0

# Inpaint the masked region with a new description
result = pipe(
    prompt="a loud thunderclap with crackling lightning",
    audio_length_in_s=len(original_audio) / sr,
    num_inference_steps=100,
).audios[0]

# Blend: keep originals where mask is 1, use generated where mask is 0
blended = original_audio * mask + result[:len(original_audio)] * (1.0 - mask)
sf.write("edited_thunder.wav", blended, samplerate=sr)
print("Saved edited_thunder.wav")

The approach above is a manual blend. For seamless results, apply a short crossfade at the mask boundaries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def crossfade_blend(original, generated, mask, crossfade_samples=1600):
    """Blend original and generated audio with smooth crossfades at mask edges."""
    result = original.copy()
    gen_trimmed = generated[:len(original)]

    # Find mask transition points
    diff = np.diff(mask)
    starts = np.where(diff == -1)[0]  # Where mask goes from 1 to 0
    ends = np.where(diff == 1)[0]      # Where mask goes from 0 to 1

    for s in starts:
        fade_start = max(0, s - crossfade_samples // 2)
        fade_end = min(len(result), s + crossfade_samples // 2)
        fade = np.linspace(1.0, 0.0, fade_end - fade_start)
        result[fade_start:fade_end] = (
            original[fade_start:fade_end] * fade
            + gen_trimmed[fade_start:fade_end] * (1.0 - fade)
        )

    for e in ends:
        fade_start = max(0, e - crossfade_samples // 2)
        fade_end = min(len(result), e + crossfade_samples // 2)
        fade = np.linspace(0.0, 1.0, fade_end - fade_start)
        result[fade_start:fade_end] = (
            original[fade_start:fade_end] * fade
            + gen_trimmed[fade_start:fade_end] * (1.0 - fade)
        )

    # Fill the unmasked interior with generated audio
    result[mask == 0.0] = gen_trimmed[mask == 0.0]
    return result

Save and Convert Audio Formats

The models output raw numpy arrays. You’ll often need to convert between formats for different use cases.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import soundfile as sf
import subprocess

# Save as WAV (lossless, large files)
sf.write("output.wav", audio, samplerate=16000)

# Save as FLAC (lossless, compressed ~50%)
sf.write("output.flac", audio, samplerate=16000)

# Save as OGG Vorbis (lossy, small files)
sf.write("output.ogg", audio, samplerate=16000)

# Convert to MP3 with ffmpeg (best for sharing)
subprocess.run([
    "ffmpeg", "-i", "output.wav",
    "-codec:libmp3lame", "-qscale:a", "2",
    "output.mp3",
], check=True)

# Resample from 16kHz to 44.1kHz for better playback compatibility
import numpy as np

def resample_audio(audio, orig_sr, target_sr):
    """Simple resampling using scipy."""
    from scipy.signal import resample
    num_samples = int(len(audio) * target_sr / orig_sr)
    return resample(audio, num_samples)

audio_44k = resample_audio(audio, 16000, 44100)
sf.write("output_44k.wav", audio_44k, samplerate=44100)

Use WAV during development – it’s lossless and every tool reads it. Convert to MP3 or OGG for distribution. If you’re feeding audio into another model downstream, stay in WAV to avoid compression artifacts.

Batch Generation Script

Here’s a complete script for generating multiple audio clips from a prompt file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import argparse
import torch
import soundfile as sf
from diffusers import AudioLDM2Pipeline

def main():
    parser = argparse.ArgumentParser(description="Batch audio generation with AudioLDM 2")
    parser.add_argument("--prompts-file", required=True, help="Text file with one prompt per line")
    parser.add_argument("--output-dir", default=".", help="Output directory")
    parser.add_argument("--duration", type=float, default=10.0, help="Audio duration in seconds")
    parser.add_argument("--steps", type=int, default=100, help="Inference steps")
    parser.add_argument("--guidance", type=float, default=3.5, help="Guidance scale")
    args = parser.parse_args()

    pipe = AudioLDM2Pipeline.from_pretrained(
        "cvssp/audioldm2",
        torch_dtype=torch.float16,
    )
    pipe = pipe.to("cuda")

    with open(args.prompts_file) as f:
        prompts = [line.strip() for line in f if line.strip()]

    for i, prompt in enumerate(prompts):
        print(f"[{i+1}/{len(prompts)}] Generating: {prompt[:60]}...")
        audio = pipe(
            prompt=prompt,
            negative_prompt="noise, static, distortion, low quality",
            num_inference_steps=args.steps,
            audio_length_in_s=args.duration,
            guidance_scale=args.guidance,
        ).audios[0]

        output_path = f"{args.output_dir}/audio_{i:03d}.wav"
        sf.write(output_path, audio, samplerate=16000)
        print(f"  Saved {output_path}")

        # Free VRAM between generations
        torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

Run it with a text file of prompts:

1
python batch_generate.py --prompts-file prompts.txt --output-dir ./generated --duration 15

Common Errors

RuntimeError: CUDA out of memory

AudioLDM 2 needs about 6GB VRAM in float16. If you’re hitting limits, enable CPU offloading:

1
pipe.enable_model_cpu_offload()

This moves model components to CPU when they’re not needed. It’s slower but cuts VRAM usage roughly in half.

OSError: stabilityai/stable-audio-open-1.0 is not accessible

You need to accept the license on the Hugging Face model page first, then authenticate:

1
huggingface-cli login

Paste your access token when prompted. Make sure the token has read access.

Generated audio sounds metallic or robotic.

Your guidance_scale is probably too high. Drop it to 3.0-4.0 for AudioLDM 2 or 6.0-7.0 for Stable Audio Open. Also check that you’re using enough inference steps – below 50 the denoising process doesn’t converge properly.

Audio has clicks or pops at segment boundaries.

This happens when you concatenate clips or do inpainting without crossfades. Always apply a crossfade of at least 50ms (800 samples at 16kHz) at edit points. See the crossfade function in the inpainting section above.

ValueError: Audio length is too long

AudioLDM 2 has an upper limit around 30 seconds. Stable Audio Open maxes out at 47 seconds. For longer audio, generate segments with overlapping regions and crossfade them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np

def generate_long_audio(pipe, prompt, total_duration, segment_duration=10.0, overlap=2.0):
    """Generate audio longer than the model's limit by stitching segments."""
    segments = []
    offset = 0.0
    while offset < total_duration:
        dur = min(segment_duration, total_duration - offset)
        audio = pipe(
            prompt=prompt,
            audio_length_in_s=dur,
            num_inference_steps=100,
        ).audios[0]
        segments.append(audio)
        offset += segment_duration - overlap

    # Crossfade overlapping regions
    overlap_samples = int(overlap * 16000)
    result = segments[0]
    for seg in segments[1:]:
        fade_out = np.linspace(1.0, 0.0, overlap_samples)
        fade_in = np.linspace(0.0, 1.0, overlap_samples)
        result[-overlap_samples:] = (
            result[-overlap_samples:] * fade_out + seg[:overlap_samples] * fade_in
        )
        result = np.concatenate([result, seg[overlap_samples:]])
    return result

ImportError: No module named 'soundfile'

On Linux, soundfile depends on libsndfile:

1
2
sudo apt install libsndfile1
pip install soundfile

On macOS, brew install libsndfile before installing the Python package.