Generate Music from a Text Prompt
AudioCraft is Meta’s open-source library for audio generation. MusicGen, the music model, takes a text description and produces a WAV file. Here’s the fastest path from install to audio.
| |
| |
The generate method returns a tensor of shape (batch, channels, samples). The audio_write helper normalizes volume and saves it. That’s all you need for basic generation.
Pick the Right Model Size
MusicGen ships in four sizes. Bigger models produce more coherent compositions but need more VRAM and time.
| Model | Parameters | VRAM | Quality | Speed (30s clip) |
|---|---|---|---|---|
facebook/musicgen-small | 300M | ~4GB | Good for prototyping | ~10s on RTX 3060 |
facebook/musicgen-medium | 1.5B | ~8GB | Solid for most uses | ~25s on RTX 3060 |
facebook/musicgen-large | 3.3B | ~16GB | Best text-to-music | ~60s on RTX 4090 |
facebook/musicgen-melody | 1.5B | ~8GB | Melody conditioning | ~25s on RTX 3060 |
Start with musicgen-small to iterate on prompts quickly. Switch to musicgen-large when you need production-quality output. The musicgen-melody variant is special – it accepts an audio reference and follows the melody while applying your text description as a style guide.
Control Generation Parameters
MusicGen exposes several parameters that affect the output quality and creativity. Here’s a complete example with everything tuned.
| |
A few things worth knowing about these parameters:
duration: MusicGen handles up to 30 seconds well. Beyond that, quality degrades because the model was trained on 30-second segments. For longer tracks, generate segments and crossfade them.temperature: Keep it between 0.8 and 1.2. Below 0.8 the output gets repetitive. Above 1.3 it starts to fall apart.top_k: 250 is the sweet spot. Lower values (50-100) make the output safer but less interesting. Higher values (500+) introduce more randomness.cfg_coef: Controls how closely the output follows your text description. 3.0 is the default. Push it to 5.0-8.0 if the output doesn’t match your prompt well enough, but going past 8 tends to introduce artifacts.
Use Melody Conditioning
This is one of MusicGen’s best features. You feed it a reference audio file and a text description, and it generates new music that follows the melody of the reference but in the style you describe.
| |
The melody conditioning extracts chroma features (pitch content) from your reference and uses them as a guide. The rhythm and timbre come from the text prompt. This means you can take a pop melody and render it as orchestral, or take a guitar riff and turn it into an electronic synth line.
Generate Sound Effects with AudioGen
AudioGen is AudioCraft’s model for non-music audio: sound effects, ambient noise, environmental sounds. The API mirrors MusicGen almost exactly.
| |
AudioGen is trained on environmental audio, not music. If you ask it for music you’ll get something that sounds like music playing in another room. Use MusicGen for music and AudioGen for everything else.
Build a Simple Music Generation Script
Here’s a complete, production-ready script that handles command-line arguments and error cases.
| |
Save that as musicgen_cli.py and run it:
| |
Common Errors and Fixes
RuntimeError: CUDA out of memory
MusicGen medium needs about 8GB VRAM. If you’re tight on memory, drop to the small model or move to CPU (slow but works):
| |
You can also free VRAM by clearing the cache between generations:
| |
ImportError: No module named 'audiocraft'
The package name on PyPI is audiocraft. If pip can’t find it, install directly from the GitHub repo:
| |
RuntimeError: Expected all tensors to be on the same device
This happens when the melody waveform is on CPU but the model is on GPU. Always move your input tensor to the model’s device:
| |
ValueError: audio_write got an unexpected keyword argument
Older versions of AudioCraft had a different audio_write signature. Update to the latest version:
| |
Generated audio sounds like noise or static.
Check your temperature – anything above 1.5 produces garbage. Also make sure your prompt is descriptive enough. “music” is too vague. “upbeat jazz quartet with walking bass line, tenor sax lead, medium swing tempo” gives the model much more to work with.
Audio cuts off abruptly at the end.
MusicGen doesn’t add fadeouts automatically. Apply one yourself:
| |
GPU Requirements
MusicGen runs on CPU but it’s painfully slow – a 30-second clip takes 10+ minutes on a modern CPU versus 30 seconds on a mid-range GPU. You want a GPU for anything beyond quick tests.
- musicgen-small (300M): Any GPU with 4GB+ VRAM. An RTX 3060 handles it easily.
- musicgen-medium (1.5B): 8GB VRAM minimum. RTX 3070 or better.
- musicgen-large (3.3B): 16GB VRAM. RTX 4090 or an A100.
- musicgen-melody (1.5B): Same as medium, about 8GB VRAM.
All models use float32 by default. Unlike image diffusion models, MusicGen doesn’t currently support float16 inference out of the box without modifications. The VRAM numbers above reflect float32 usage.
Related Guides
- How to Build Real-Time Voice Cloning with OpenVoice and Python
- How to Generate Videos with Stable Video Diffusion
- How to Generate Images with FLUX.2 in Python
- How to Generate and Edit Audio with Stable Audio and AudioLDM
- How to Build AI Background Music Generation with MusicGen
- How to Generate Images with Stable Diffusion in Python
- How to Generate 3D Models from Text and Images with AI
- How to Edit Images with AI Inpainting Using Stable Diffusion
- How to Build AI Clothing Try-On with Virtual Diffusion Models
- How to Control Image Generation with ControlNet and IP-Adapter