AudioGen is Meta’s text-to-sound-effects model. You give it a text description like “dog barking in the rain” or “car engine starting on a cold morning,” and it generates the corresponding audio. It’s not MusicGen — AudioGen is trained specifically on environmental sounds, foley, and sound effects. That makes it the right tool for game audio, film post-production, and content creation where you need quick, royalty-free SFX.

Here’s the fastest path to a generated sound effect:

1
2
3
4
5
6
7
8
from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5.0)

wav = model.generate(["thunder and heavy rain on a metal roof"])
torchaudio.save("thunder_rain.wav", wav[0].cpu(), sample_rate=16000)

That produces a 5-second WAV file of thunder and rain. The model runs on GPU by default, and the medium checkpoint is about 1.5GB. Everything below builds on this foundation.

Installing AudioCraft

AudioGen lives inside Meta’s audiocraft library alongside MusicGen. You need PyTorch with CUDA support for any reasonable generation speed.

1
pip install audiocraft torch torchaudio

If you’re on a machine without a GPU, it still works — just slowly. For a T4 or better, generation is near real-time.

Load the model once and reuse it across generations:

1
2
3
4
from audiocraft.models import AudioGen

model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5.0)

The facebook/audiogen-medium checkpoint is the only publicly available AudioGen model. It downloads automatically from Hugging Face on first use. The set_generation_params call controls how audio is produced — duration is in seconds, and you can set it anywhere from 0.5 to 30 seconds, though quality degrades past 10–15 seconds.

Generating Sound Effects

The generate method accepts a list of text prompts and returns a tensor of shape (batch, channels, samples). Each prompt produces one audio clip.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5.0)

prompts = [
    "dog barking in rain",
    "car engine starting then idling",
    "crowd cheering in a large stadium",
    "glass shattering on a tile floor",
    "footsteps on gravel path at night",
]

wav = model.generate(prompts)

for i, prompt in enumerate(prompts):
    filename = prompt.replace(" ", "_")[:50] + ".wav"
    torchaudio.save(filename, wav[i].cpu(), sample_rate=16000)
    print(f"Saved {filename}")

AudioGen outputs audio at 16kHz. That’s its native sample rate — pass 16000 to torchaudio.save, not 44100 or 48000. If you need a higher sample rate for your project, resample after generation with torchaudio.transforms.Resample(16000, 48000).

The prompts work best when they’re descriptive but not overly long. “Dog barking” works. “A large golden retriever barking excitedly at a mailman while it rains heavily in a suburban neighborhood” is too specific — the model won’t capture all those details. Stick to the core sound and one or two modifiers.

Batch Processing for Game Assets

For a real project, you don’t want to hand-type prompts in a script. Define your sound effects in a JSON file and process them in batch.

Create a file called sfx_manifest.json:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
  "effects": [
    {"name": "sword_clash", "prompt": "metal swords clashing together", "duration": 2.0},
    {"name": "arrow_fly", "prompt": "arrow flying through the air and hitting wood", "duration": 3.0},
    {"name": "fire_crackle", "prompt": "campfire crackling and popping", "duration": 8.0},
    {"name": "door_creak", "prompt": "old wooden door creaking open slowly", "duration": 4.0},
    {"name": "horse_gallop", "prompt": "horse galloping on dirt road", "duration": 5.0},
    {"name": "rain_forest", "prompt": "rain falling in a dense forest with birds", "duration": 10.0},
    {"name": "explosion_distant", "prompt": "distant explosion with debris falling", "duration": 4.0},
    {"name": "water_splash", "prompt": "large splash in water", "duration": 3.0},
    {"name": "wind_howling", "prompt": "strong wind howling through mountains", "duration": 7.0},
    {"name": "chains_rattle", "prompt": "heavy metal chains rattling and dragging", "duration": 3.0},
    {"name": "wolf_howl", "prompt": "wolf howling at night in the distance", "duration": 5.0},
    {"name": "footsteps_stone", "prompt": "slow footsteps echoing in a stone hallway", "duration": 6.0}
  ]
}

Now process the entire manifest:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import json
import os
from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained("facebook/audiogen-medium")

output_dir = "generated_sfx"
os.makedirs(output_dir, exist_ok=True)

with open("sfx_manifest.json") as f:
    manifest = json.load(f)

batch_size = 4
effects = manifest["effects"]

for i in range(0, len(effects), batch_size):
    batch = effects[i : i + batch_size]
    prompts = [e["prompt"] for e in batch]
    durations = [e["duration"] for e in batch]

    # AudioGen applies duration globally, so group by duration
    # or set to the max duration in the batch and trim later
    max_duration = max(durations)
    model.set_generation_params(duration=max_duration)

    wav = model.generate(prompts)

    for j, effect in enumerate(batch):
        target_samples = int(effect["duration"] * 16000)
        trimmed = wav[j : j + 1, :, :target_samples]
        filepath = os.path.join(output_dir, f"{effect['name']}.wav")
        torchaudio.save(filepath, trimmed[0].cpu(), sample_rate=16000)
        print(f"Generated {filepath} ({effect['duration']}s)")

print(f"Done. {len(effects)} sound effects saved to {output_dir}/")

Batching matters for throughput. Generating 4 prompts at once is significantly faster than running them one at a time because the model parallelizes across the batch dimension on GPU. Adjust batch_size based on your VRAM — 4 is safe for an 8GB card at 5-second duration.

Controlling Generation Quality

AudioGen’s set_generation_params accepts several arguments beyond duration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained("facebook/audiogen-medium")

# Conservative settings — more predictable, less creative
model.set_generation_params(
    duration=5.0,
    temperature=0.8,
    top_k=250,
    top_p=0.0,  # disabled when set to 0
)
wav_conservative = model.generate(["ocean waves crashing on rocks"])

# Creative settings — more variation, sometimes surprising
model.set_generation_params(
    duration=5.0,
    temperature=1.3,
    top_k=0,  # disabled when set to 0
    top_p=0.95,
)
wav_creative = model.generate(["ocean waves crashing on rocks"])

torchaudio.save("waves_conservative.wav", wav_conservative[0].cpu(), sample_rate=16000)
torchaudio.save("waves_creative.wav", wav_creative[0].cpu(), sample_rate=16000)

Here’s what each parameter does:

  • temperature — Controls randomness. Lower values (0.5–0.8) produce more predictable, “clean” sounds. Higher values (1.2–1.5) introduce variation but can sound noisy. Default is 1.0.
  • top_k — Limits the token pool to the top K most likely tokens at each step. 250 is a good starting point. Set to 0 to disable.
  • top_p — Nucleus sampling. Keeps the smallest set of tokens whose cumulative probability exceeds p. 0.9–0.95 works well. Set to 0.0 to disable and use top_k instead.

For production SFX, I’d recommend generating 3–5 variations of each sound and picking the best:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5.0, temperature=1.0, top_k=250)

prompt = "heavy rain on a car windshield"
variations = 5

# Generate all variations in one batch
prompts = [prompt] * variations
wav = model.generate(prompts)

for i in range(variations):
    torchaudio.save(f"rain_windshield_v{i + 1}.wav", wav[i].cpu(), sample_rate=16000)
    print(f"Saved variation {i + 1}")

Listen to all five, pick the one that sounds right, and discard the rest. This is faster than tweaking parameters endlessly.

Common Errors and Fixes

CUDA out of memory when generating long audio:

1
torch.cuda.OutOfMemoryError: CUDA out of memory.

Long durations eat VRAM fast. A 10-second clip with batch size 4 can exceed 8GB. Fix it by reducing batch size, shortening duration, or generating on CPU as a fallback:

1
2
3
4
5
6
7
import torch
from audiocraft.models import AudioGen

if torch.cuda.is_available():
    model = AudioGen.get_pretrained("facebook/audiogen-medium")
else:
    model = AudioGen.get_pretrained("facebook/audiogen-medium", device="cpu")

You can also move a loaded model to CPU with model.to("cpu") if you need to free GPU memory mid-pipeline.

Wrong sample rate produces distorted audio:

If your output sounds like chipmunks or plays in slow motion, you passed the wrong sample rate to torchaudio.save. AudioGen always outputs at 16kHz. Use sample_rate=16000. If you need 48kHz for your project:

1
2
3
4
5
6
import torchaudio
from torchaudio.transforms import Resample

resampler = Resample(orig_freq=16000, new_freq=48000)
wav_48k = resampler(wav[0].cpu())
torchaudio.save("output_48k.wav", wav_48k, sample_rate=48000)

Model download fails or hangs:

The first call to get_pretrained downloads the checkpoint from Hugging Face Hub. If it fails, check your network connection and try setting the cache directory explicitly:

1
export TORCH_HOME=/path/to/your/cache

Or pre-download the model files with huggingface-cli:

1
2
pip install huggingface_hub
huggingface-cli download facebook/audiogen-medium

Tensor shape mismatch when saving:

If you index the batch tensor wrong, torchaudio.save will complain about unexpected dimensions. The output of model.generate() has shape (batch, channels, samples). For a single clip, use wav[0] to get shape (channels, samples), which is what torchaudio.save expects.