How to Build AI Background Music Generation with MusicGen

Generate Your First Background Track

MusicGen is Meta’s text-to-music model available through Hugging Face Transformers. You describe the music you want, and it generates a waveform you can save as a WAV file. No external API keys, no rate limits – everything runs locally.

Install the dependencies first:

1
pip install transformers torch scipy

Now generate a 10-second background track:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy.io.wavfile

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=["calm lo-fi beats with soft piano and vinyl crackle, suitable for studying"],
    padding=True,
    return_tensors="pt",
)

# max_new_tokens controls duration: ~50 tokens per second at 32kHz
audio_values = model.generate(**inputs, max_new_tokens=512)

# Output shape: (batch, 1, samples) — squeeze to 1D
audio_data = audio_values[0, 0].cpu().numpy()
sample_rate = model.config.audio_encoder.sampling_rate  # 32000

scipy.io.wavfile.write("lofi_background.wav", sample_rate, audio_data)
print(f"Saved lofi_background.wav ({len(audio_data) / sample_rate:.1f}s)")

The max_new_tokens parameter maps roughly to audio duration. At MusicGen’s default frame rate, 256 tokens gives you about 5 seconds, 512 tokens about 10 seconds, and 1024 tokens about 20 seconds. These are approximate – the exact mapping depends on the model’s codec configuration.

Control Generation Parameters

Background music needs to be consistent and non-distracting. You can shape the output by tuning several generation parameters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy.io.wavfile

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

inputs = processor(
    text=["upbeat electronic background music with soft synth pads and a steady four-on-the-floor beat"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(
    **inputs,
    max_new_tokens=1024,       # ~20 seconds of audio
    guidance_scale=3.0,        # How closely to follow the text prompt (default 3.0)
    temperature=1.0,           # Creativity: lower = safer, higher = wilder
    do_sample=True,            # Enable sampling (required for temperature to take effect)
    top_k=250,                 # Limit sampling to top 250 tokens
    top_p=0.0,                 # Nucleus sampling (0.0 = disabled)
)

audio_data = audio_values[0, 0].cpu().numpy()
sample_rate = model.config.audio_encoder.sampling_rate

scipy.io.wavfile.write("electronic_bg.wav", sample_rate, audio_data)

Here’s what each parameter does in practice:

guidance_scale: Controls prompt adherence. 3.0 is the default and works well for most prompts. Push it to 5.0-7.0 if the output doesn’t match your description. Above 8.0 you’ll start hearing artifacts.
temperature: Stay between 0.8 and 1.2 for background music. Lower values produce more repetitive, predictable loops. Higher values introduce variety but risk incoherence past 1.3.
do_sample: Must be True for temperature and top_k to have any effect. Without it, the model uses greedy decoding.
top_k: 250 is the default sweet spot. Drop to 50-100 for safer, more conservative output. Raise to 500+ for more experimental results.
max_new_tokens: The primary way to control track length. MusicGen quality degrades on very long sequences, so keep it under 1500 tokens (~30 seconds). For longer tracks, generate segments and crossfade.

Model Size Options

Model	Parameters	VRAM	Best For
`facebook/musicgen-small`	300M	~4GB	Fast iteration, prototyping
`facebook/musicgen-medium`	1.5B	~8GB	Production background music
`facebook/musicgen-large`	3.3B	~16GB	Highest quality output
`facebook/musicgen-melody`	1.5B	~8GB	Melody-conditioned generation

Start with musicgen-small for prompt experimentation. Move to musicgen-medium once you’ve nailed the prompt – it produces noticeably better harmonic structure and instrument separation.

Generate Multiple Variations

When you’re building background music for an app or video, you usually want several variations to pick from. MusicGen supports batch generation – pass multiple prompts in a single call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy.io.wavfile

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

prompts = [
    "gentle ambient background music with warm pads and subtle chimes, slow tempo",
    "light acoustic guitar background music, fingerpicking style, relaxed and airy",
    "minimal electronic background music with deep bass and sparse hi-hats, chill",
]

inputs = processor(text=prompts, padding=True, return_tensors="pt")

audio_values = model.generate(**inputs, max_new_tokens=512, do_sample=True)

sample_rate = model.config.audio_encoder.sampling_rate

for i in range(audio_values.shape[0]):
    audio_data = audio_values[i, 0].cpu().numpy()
    filename = f"variation_{i}.wav"
    scipy.io.wavfile.write(filename, sample_rate, audio_data)
    print(f"Saved {filename}")

You can also generate multiple variations of the same prompt by running generation multiple times with do_sample=True. Each run produces a different result because of the stochastic sampling.

1
2
3
4
5
6
7
8
prompt = "soft jazz background music with brushed drums and muted trumpet, lounge feel"

for variation in range(4):
    inputs = processor(text=[prompt], padding=True, return_tensors="pt")
    audio_values = model.generate(**inputs, max_new_tokens=512, do_sample=True)
    audio_data = audio_values[0, 0].cpu().numpy()
    scipy.io.wavfile.write(f"jazz_v{variation}.wav", sample_rate, audio_data)
    print(f"Saved jazz_v{variation}.wav")

Melody-Conditioned Generation

The musicgen-melody variant takes an existing audio file as a style reference. It extracts the melodic contour from your reference and generates new music that follows it while applying the style from your text prompt. This is powerful for creating background music that matches an existing video’s rhythm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torch
import numpy as np
import scipy.io.wavfile

processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-melody")

sample_rate_model = model.config.audio_encoder.sampling_rate  # 32000

# Load a reference melody WAV file
ref_rate, ref_audio = scipy.io.wavfile.read("reference_melody.wav")

# Convert to float32 and normalize to [-1, 1]
if ref_audio.dtype == np.int16:
    ref_audio = ref_audio.astype(np.float32) / 32768.0
elif ref_audio.dtype == np.int32:
    ref_audio = ref_audio.astype(np.float32) / 2147483648.0

# If stereo, take the first channel
if ref_audio.ndim == 2:
    ref_audio = ref_audio[:, 0]

# Convert to tensor: shape (1, num_samples) for the processor
ref_tensor = torch.tensor(ref_audio).unsqueeze(0)

inputs = processor(
    audio=ref_tensor,
    sampling_rate=ref_rate,
    text=["orchestral background music with strings and woodwinds, cinematic and warm"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs, max_new_tokens=512, guidance_scale=3.0)

audio_data = audio_values[0, 0].cpu().numpy()
scipy.io.wavfile.write("orchestral_reinterpretation.wav", sample_rate_model, audio_data)
print("Saved orchestral_reinterpretation.wav")

The melody model extracts chroma features (pitch information) from your reference. The rhythm, instrumentation, and timbre come from the text prompt. You could feed it an electronic track and get back an acoustic guitar arrangement that follows the same harmonic progression.

Common Errors and Fixes

RuntimeError: CUDA out of memory

The medium model needs about 8GB VRAM. Drop to the small model or run on CPU:

1
2
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
model = model.to("cpu")

You can also clear GPU cache between generations:

1
2
import torch
torch.cuda.empty_cache()

scipy.io.wavfile.write produces a silent or distorted file

MusicGen outputs float32 audio in roughly the [-1, 1] range. scipy.io.wavfile.write handles float32 directly, but if your audio values are outside that range, the file sounds distorted. Clamp before saving:

1
2
3
import numpy as np
audio_data = np.clip(audio_data, -1.0, 1.0)
scipy.io.wavfile.write("output.wav", sample_rate, audio_data)

ValueError: Attention mask should be of size (batch, 1, tgt_len, src_len)

This usually means your processor inputs aren’t padded correctly for batch generation. Make sure you pass padding=True:

1
inputs = processor(text=prompts, padding=True, return_tensors="pt")

Generated audio sounds like noise or static

Your temperature is probably too high. Keep it at or below 1.2. Also write more descriptive prompts – “music” won’t give you much. Be specific about instruments, tempo, mood, and genre: “mellow ambient background with warm synth pads, 80 bpm, dreamy reverb” works much better than “background music.”

Audio cuts off abruptly

MusicGen doesn’t add fadeouts. Apply one manually:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import numpy as np

def apply_fadeout(audio, sample_rate, fade_seconds=2.0):
    fade_length = int(fade_seconds * sample_rate)
    fade_curve = np.linspace(1.0, 0.0, fade_length)
    audio[-fade_length:] *= fade_curve
    return audio

audio_data = apply_fadeout(audio_data, sample_rate, fade_seconds=3.0)
scipy.io.wavfile.write("with_fadeout.wav", sample_rate, audio_data)

OSError: facebook/musicgen-small is not a local folder and is not a valid model identifier

You’re probably running an older version of Transformers that doesn’t include MusicGen support. Update to 4.31.0 or later:

1
pip install --upgrade transformers

Generate Your First Background Track#

Control Generation Parameters#

Model Size Options#

Generate Multiple Variations#

Melody-Conditioned Generation#

Common Errors and Fixes#

Related Guides#

About the Author