Generate Music from a Text Prompt

AudioCraft is Meta’s open-source library for audio generation. MusicGen, the music model, takes a text description and produces a WAV file. Here’s the fastest path from install to audio.

1
pip install audiocraft torch torchaudio
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# Load the small model (~1GB download on first run)
model = MusicGen.get_pretrained("facebook/musicgen-small")
model.set_generation_params(duration=10)  # 10 seconds of audio

# Generate from a text prompt
wav = model.generate(["upbeat electronic track with synth pads and a driving beat"])

# Save as WAV file (adds .wav extension automatically)
audio_write("output", wav[0].cpu(), model.sample_rate, strategy="loudness")
print("Saved output.wav")

The generate method returns a tensor of shape (batch, channels, samples). The audio_write helper normalizes volume and saves it. That’s all you need for basic generation.

Pick the Right Model Size

MusicGen ships in four sizes. Bigger models produce more coherent compositions but need more VRAM and time.

ModelParametersVRAMQualitySpeed (30s clip)
facebook/musicgen-small300M~4GBGood for prototyping~10s on RTX 3060
facebook/musicgen-medium1.5B~8GBSolid for most uses~25s on RTX 3060
facebook/musicgen-large3.3B~16GBBest text-to-music~60s on RTX 4090
facebook/musicgen-melody1.5B~8GBMelody conditioning~25s on RTX 3060

Start with musicgen-small to iterate on prompts quickly. Switch to musicgen-large when you need production-quality output. The musicgen-melody variant is special – it accepts an audio reference and follows the melody while applying your text description as a style guide.

Control Generation Parameters

MusicGen exposes several parameters that affect the output quality and creativity. Here’s a complete example with everything tuned.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained("facebook/musicgen-medium")

model.set_generation_params(
    duration=30,          # Length in seconds (max 30 for good quality)
    top_k=250,            # Sample from top 250 tokens (default 250)
    top_p=0.0,            # Nucleus sampling disabled by default
    temperature=1.0,      # Higher = more creative, lower = more predictable
    cfg_coef=3.0,         # Classifier-free guidance strength (default 3.0)
)

descriptions = [
    "ambient piano with soft strings, melancholic mood, slow tempo, cinematic",
    "aggressive metal guitar riff, double bass drums, fast tempo, 160 bpm",
]

# Generate a batch of 2 tracks at once
wav = model.generate(descriptions)

for i, one_wav in enumerate(wav):
    audio_write(f"track_{i}", one_wav.cpu(), model.sample_rate, strategy="loudness")
    print(f"Saved track_{i}.wav")

A few things worth knowing about these parameters:

  • duration: MusicGen handles up to 30 seconds well. Beyond that, quality degrades because the model was trained on 30-second segments. For longer tracks, generate segments and crossfade them.
  • temperature: Keep it between 0.8 and 1.2. Below 0.8 the output gets repetitive. Above 1.3 it starts to fall apart.
  • top_k: 250 is the sweet spot. Lower values (50-100) make the output safer but less interesting. Higher values (500+) introduce more randomness.
  • cfg_coef: Controls how closely the output follows your text description. 3.0 is the default. Push it to 5.0-8.0 if the output doesn’t match your prompt well enough, but going past 8 tends to introduce artifacts.

Use Melody Conditioning

This is one of MusicGen’s best features. You feed it a reference audio file and a text description, and it generates new music that follows the melody of the reference but in the style you describe.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained("facebook/musicgen-melody")
model.set_generation_params(duration=15)

# Load your reference melody
melody_waveform, sr = torchaudio.load("reference_melody.wav")

# If your file isn't at the model's sample rate, resample it
if sr != model.sample_rate:
    resampler = torchaudio.transforms.Resample(sr, model.sample_rate)
    melody_waveform = resampler(melody_waveform)

# Generate with melody conditioning
descriptions = ["jazz piano trio with upright bass and brushed drums, warm tone"]
wav = model.generate_with_chroma(
    descriptions=descriptions,
    melody_wavs=melody_waveform[None].cuda(),  # Add batch dimension
    melody_sample_rate=model.sample_rate,
    progress=True,
)

audio_write("jazz_reinterpretation", wav[0].cpu(), model.sample_rate, strategy="loudness")

The melody conditioning extracts chroma features (pitch content) from your reference and uses them as a guide. The rhythm and timbre come from the text prompt. This means you can take a pop melody and render it as orchestral, or take a guitar riff and turn it into an electronic synth line.

Generate Sound Effects with AudioGen

AudioGen is AudioCraft’s model for non-music audio: sound effects, ambient noise, environmental sounds. The API mirrors MusicGen almost exactly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write

model = AudioGen.get_pretrained("facebook/audiogen-medium")
model.set_generation_params(duration=5)

descriptions = [
    "thunderstorm with heavy rain and distant thunder",
    "footsteps walking on gravel path",
    "busy coffee shop with people talking and espresso machine",
]

wav = model.generate(descriptions)

for i, one_wav in enumerate(wav):
    audio_write(f"sfx_{i}", one_wav.cpu(), model.sample_rate, strategy="loudness")
    print(f"Saved sfx_{i}.wav")

AudioGen is trained on environmental audio, not music. If you ask it for music you’ll get something that sounds like music playing in another room. Use MusicGen for music and AudioGen for everything else.

Build a Simple Music Generation Script

Here’s a complete, production-ready script that handles command-line arguments and error cases.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import argparse
import torch
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

def generate_music(prompt, output, model_name="facebook/musicgen-medium",
                   duration=15, temperature=1.0, top_k=250):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    if device == "cpu":
        print("WARNING: Running on CPU. Generation will be very slow.")

    print(f"Loading {model_name}...")
    model = MusicGen.get_pretrained(model_name, device=device)
    model.set_generation_params(
        duration=duration,
        temperature=temperature,
        top_k=top_k,
    )

    print(f"Generating {duration}s of audio...")
    wav = model.generate([prompt], progress=True)

    audio_write(output, wav[0].cpu(), model.sample_rate, strategy="loudness")
    print(f"Saved {output}.wav")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate music with MusicGen")
    parser.add_argument("prompt", help="Text description of the music")
    parser.add_argument("-o", "--output", default="generated", help="Output filename (without extension)")
    parser.add_argument("-m", "--model", default="facebook/musicgen-medium")
    parser.add_argument("-d", "--duration", type=int, default=15)
    parser.add_argument("-t", "--temperature", type=float, default=1.0)
    parser.add_argument("--top-k", type=int, default=250)
    args = parser.parse_args()

    generate_music(args.prompt, args.output, args.model,
                   args.duration, args.temperature, args.top_k)

Save that as musicgen_cli.py and run it:

1
python musicgen_cli.py "lo-fi hip hop beat with vinyl crackle and mellow piano" -o lofi_beat -d 20

Common Errors and Fixes

RuntimeError: CUDA out of memory

MusicGen medium needs about 8GB VRAM. If you’re tight on memory, drop to the small model or move to CPU (slow but works):

1
model = MusicGen.get_pretrained("facebook/musicgen-small", device="cpu")

You can also free VRAM by clearing the cache between generations:

1
2
import torch
torch.cuda.empty_cache()

ImportError: No module named 'audiocraft'

The package name on PyPI is audiocraft. If pip can’t find it, install directly from the GitHub repo:

1
pip install git+https://github.com/facebookresearch/audiocraft.git

RuntimeError: Expected all tensors to be on the same device

This happens when the melody waveform is on CPU but the model is on GPU. Always move your input tensor to the model’s device:

1
2
3
melody_waveform = melody_waveform.cuda()
# Or more reliably:
melody_waveform = melody_waveform.to(model.device)

ValueError: audio_write got an unexpected keyword argument

Older versions of AudioCraft had a different audio_write signature. Update to the latest version:

1
pip install --upgrade audiocraft

Generated audio sounds like noise or static.

Check your temperature – anything above 1.5 produces garbage. Also make sure your prompt is descriptive enough. “music” is too vague. “upbeat jazz quartet with walking bass line, tenor sax lead, medium swing tempo” gives the model much more to work with.

Audio cuts off abruptly at the end.

MusicGen doesn’t add fadeouts automatically. Apply one yourself:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import torch

def apply_fadeout(wav, sample_rate, fade_duration=2.0):
    fade_samples = int(fade_duration * sample_rate)
    fade = torch.linspace(1.0, 0.0, fade_samples).to(wav.device)
    wav[..., -fade_samples:] *= fade
    return wav

wav = model.generate(["your prompt here"])
wav = apply_fadeout(wav[0], model.sample_rate)
audio_write("with_fadeout", wav.cpu(), model.sample_rate, strategy="loudness")

GPU Requirements

MusicGen runs on CPU but it’s painfully slow – a 30-second clip takes 10+ minutes on a modern CPU versus 30 seconds on a mid-range GPU. You want a GPU for anything beyond quick tests.

  • musicgen-small (300M): Any GPU with 4GB+ VRAM. An RTX 3060 handles it easily.
  • musicgen-medium (1.5B): 8GB VRAM minimum. RTX 3070 or better.
  • musicgen-large (3.3B): 16GB VRAM. RTX 4090 or an A100.
  • musicgen-melody (1.5B): Same as medium, about 8GB VRAM.

All models use float32 by default. Unlike image diffusion models, MusicGen doesn’t currently support float16 inference out of the box without modifications. The VRAM numbers above reflect float32 usage.