AudioGen is Meta’s text-to-sound-effects model. You give it a text description like “dog barking in the rain” or “car engine starting on a cold morning,” and it generates the corresponding audio. It’s not MusicGen — AudioGen is trained specifically on environmental sounds, foley, and sound effects. That makes it the right tool for game audio, film post-production, and content creation where you need quick, royalty-free SFX.
Here’s the fastest path to a generated sound effect:
| |
That produces a 5-second WAV file of thunder and rain. The model runs on GPU by default, and the medium checkpoint is about 1.5GB. Everything below builds on this foundation.
Installing AudioCraft
AudioGen lives inside Meta’s audiocraft library alongside MusicGen. You need PyTorch with CUDA support for any reasonable generation speed.
| |
If you’re on a machine without a GPU, it still works — just slowly. For a T4 or better, generation is near real-time.
Load the model once and reuse it across generations:
| |
The facebook/audiogen-medium checkpoint is the only publicly available AudioGen model. It downloads automatically from Hugging Face on first use. The set_generation_params call controls how audio is produced — duration is in seconds, and you can set it anywhere from 0.5 to 30 seconds, though quality degrades past 10–15 seconds.
Generating Sound Effects
The generate method accepts a list of text prompts and returns a tensor of shape (batch, channels, samples). Each prompt produces one audio clip.
| |
AudioGen outputs audio at 16kHz. That’s its native sample rate — pass 16000 to torchaudio.save, not 44100 or 48000. If you need a higher sample rate for your project, resample after generation with torchaudio.transforms.Resample(16000, 48000).
The prompts work best when they’re descriptive but not overly long. “Dog barking” works. “A large golden retriever barking excitedly at a mailman while it rains heavily in a suburban neighborhood” is too specific — the model won’t capture all those details. Stick to the core sound and one or two modifiers.
Batch Processing for Game Assets
For a real project, you don’t want to hand-type prompts in a script. Define your sound effects in a JSON file and process them in batch.
Create a file called sfx_manifest.json:
| |
Now process the entire manifest:
| |
Batching matters for throughput. Generating 4 prompts at once is significantly faster than running them one at a time because the model parallelizes across the batch dimension on GPU. Adjust batch_size based on your VRAM — 4 is safe for an 8GB card at 5-second duration.
Controlling Generation Quality
AudioGen’s set_generation_params accepts several arguments beyond duration:
| |
Here’s what each parameter does:
- temperature — Controls randomness. Lower values (0.5–0.8) produce more predictable, “clean” sounds. Higher values (1.2–1.5) introduce variation but can sound noisy. Default is 1.0.
- top_k — Limits the token pool to the top K most likely tokens at each step. 250 is a good starting point. Set to 0 to disable.
- top_p — Nucleus sampling. Keeps the smallest set of tokens whose cumulative probability exceeds p. 0.9–0.95 works well. Set to 0.0 to disable and use top_k instead.
For production SFX, I’d recommend generating 3–5 variations of each sound and picking the best:
| |
Listen to all five, pick the one that sounds right, and discard the rest. This is faster than tweaking parameters endlessly.
Common Errors and Fixes
CUDA out of memory when generating long audio:
| |
Long durations eat VRAM fast. A 10-second clip with batch size 4 can exceed 8GB. Fix it by reducing batch size, shortening duration, or generating on CPU as a fallback:
| |
You can also move a loaded model to CPU with model.to("cpu") if you need to free GPU memory mid-pipeline.
Wrong sample rate produces distorted audio:
If your output sounds like chipmunks or plays in slow motion, you passed the wrong sample rate to torchaudio.save. AudioGen always outputs at 16kHz. Use sample_rate=16000. If you need 48kHz for your project:
| |
Model download fails or hangs:
The first call to get_pretrained downloads the checkpoint from Hugging Face Hub. If it fails, check your network connection and try setting the cache directory explicitly:
| |
Or pre-download the model files with huggingface-cli:
| |
Tensor shape mismatch when saving:
If you index the batch tensor wrong, torchaudio.save will complain about unexpected dimensions. The output of model.generate() has shape (batch, channels, samples). For a single clip, use wav[0] to get shape (channels, samples), which is what torchaudio.save expects.
Related Guides
- How to Build AI Sticker and Emoji Generation with Stable Diffusion
- How to Build AI Motion Graphics Generation with Deforum Stable Diffusion
- How to Build AI Scene Generation with Layered Diffusion
- How to Build AI Wallpaper Generation with Stable Diffusion and Tiling
- How to Build AI Seamless Pattern Generation with Stable Diffusion
- How to Generate and Edit Audio with Stable Audio and AudioLDM
- How to Build AI Sketch-to-Image Generation with ControlNet Scribble
- How to Build AI Pixel Art Generation with Stable Diffusion
- How to Build Real-Time Image Generation with StreamDiffusion
- How to Build AI Clothing Try-On with Virtual Diffusion Models