Generate Your First Background Track
MusicGen is Meta’s text-to-music model available through Hugging Face Transformers. You describe the music you want, and it generates a waveform you can save as a WAV file. No external API keys, no rate limits – everything runs locally.
Install the dependencies first:
| |
Now generate a 10-second background track:
| |
The max_new_tokens parameter maps roughly to audio duration. At MusicGen’s default frame rate, 256 tokens gives you about 5 seconds, 512 tokens about 10 seconds, and 1024 tokens about 20 seconds. These are approximate – the exact mapping depends on the model’s codec configuration.
Control Generation Parameters
Background music needs to be consistent and non-distracting. You can shape the output by tuning several generation parameters.
| |
Here’s what each parameter does in practice:
guidance_scale: Controls prompt adherence. 3.0 is the default and works well for most prompts. Push it to 5.0-7.0 if the output doesn’t match your description. Above 8.0 you’ll start hearing artifacts.temperature: Stay between 0.8 and 1.2 for background music. Lower values produce more repetitive, predictable loops. Higher values introduce variety but risk incoherence past 1.3.do_sample: Must beTruefor temperature and top_k to have any effect. Without it, the model uses greedy decoding.top_k: 250 is the default sweet spot. Drop to 50-100 for safer, more conservative output. Raise to 500+ for more experimental results.max_new_tokens: The primary way to control track length. MusicGen quality degrades on very long sequences, so keep it under 1500 tokens (~30 seconds). For longer tracks, generate segments and crossfade.
Model Size Options
| Model | Parameters | VRAM | Best For |
|---|---|---|---|
facebook/musicgen-small | 300M | ~4GB | Fast iteration, prototyping |
facebook/musicgen-medium | 1.5B | ~8GB | Production background music |
facebook/musicgen-large | 3.3B | ~16GB | Highest quality output |
facebook/musicgen-melody | 1.5B | ~8GB | Melody-conditioned generation |
Start with musicgen-small for prompt experimentation. Move to musicgen-medium once you’ve nailed the prompt – it produces noticeably better harmonic structure and instrument separation.
Generate Multiple Variations
When you’re building background music for an app or video, you usually want several variations to pick from. MusicGen supports batch generation – pass multiple prompts in a single call.
| |
You can also generate multiple variations of the same prompt by running generation multiple times with do_sample=True. Each run produces a different result because of the stochastic sampling.
| |
Melody-Conditioned Generation
The musicgen-melody variant takes an existing audio file as a style reference. It extracts the melodic contour from your reference and generates new music that follows it while applying the style from your text prompt. This is powerful for creating background music that matches an existing video’s rhythm.
| |
The melody model extracts chroma features (pitch information) from your reference. The rhythm, instrumentation, and timbre come from the text prompt. You could feed it an electronic track and get back an acoustic guitar arrangement that follows the same harmonic progression.
Common Errors and Fixes
RuntimeError: CUDA out of memory
The medium model needs about 8GB VRAM. Drop to the small model or run on CPU:
| |
You can also clear GPU cache between generations:
| |
scipy.io.wavfile.write produces a silent or distorted file
MusicGen outputs float32 audio in roughly the [-1, 1] range. scipy.io.wavfile.write handles float32 directly, but if your audio values are outside that range, the file sounds distorted. Clamp before saving:
| |
ValueError: Attention mask should be of size (batch, 1, tgt_len, src_len)
This usually means your processor inputs aren’t padded correctly for batch generation. Make sure you pass padding=True:
| |
Generated audio sounds like noise or static
Your temperature is probably too high. Keep it at or below 1.2. Also write more descriptive prompts – “music” won’t give you much. Be specific about instruments, tempo, mood, and genre: “mellow ambient background with warm synth pads, 80 bpm, dreamy reverb” works much better than “background music.”
Audio cuts off abruptly
MusicGen doesn’t add fadeouts. Apply one manually:
| |
OSError: facebook/musicgen-small is not a local folder and is not a valid model identifier
You’re probably running an older version of Transformers that doesn’t include MusicGen support. Update to 4.31.0 or later:
| |
Related Guides
- How to Build Real-Time Image Generation with StreamDiffusion
- How to Build AI Sprite Sheet Generation with Stable Diffusion
- How to Build AI Logo Generation with Stable Diffusion and SDXL
- How to Build AI Coloring Book Generation with Line Art Diffusion
- How to Build AI Pixel Art Generation with Stable Diffusion
- How to Build AI Font Generation with Diffusion Models
- How to Build AI Wireframe to UI Generation with Diffusion Models
- How to Generate Music with Meta AudioCraft
- How to Build AI Sticker and Emoji Generation with Stable Diffusion
- How to Build AI Sketch-to-Image Generation with ControlNet Scribble