How to Build Real-Time Voice Cloning with OpenVoice and Python

OpenVoice V2 from MyShell.ai is the best open-source voice cloning model available right now. Give it a 10-second audio clip of any speaker and it produces remarkably accurate clones. The key insight: it separates tone color (the identity of a voice) from style (speed, emotion, accent), so you can mix and match freely.

Here is the minimal end-to-end pipeline. You generate base speech with MeloTTS, extract the speaker embedding from a reference clip, then apply tone color conversion to stamp the target voice onto the output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from melo.api import TTS

# Load tone color converter with the V2 checkpoint
converter = ToneColorConverter("checkpoints_v2/converter", device="cuda:0")

# Load the base TTS engine (MeloTTS, English)
tts = TTS(language="EN", device="cuda:0")

# Generate base speech from text
speaker_id = tts.hps.data.spk2id["EN-US"]
tts.tts_to_file("OpenVoice makes voice cloning surprisingly easy.", speaker_id, "base_output.wav", speed=1.0)

# Extract tone color from a reference audio clip
target_se, _ = se_extractor.get_se("reference_speaker.wav", converter, vad=False)

# Extract tone color from the base speech
source_se, _ = se_extractor.get_se("base_output.wav", converter, vad=False)

# Apply tone color conversion
converter.convert(
    audio_src_path="base_output.wav",
    src_se=source_se,
    tgt_se=target_se,
    output_path="cloned_output.wav",
)

That produces cloned_output.wav with the target speaker’s voice saying whatever text you passed in. The whole process takes about 2 seconds on a decent GPU.

Installation and Setup

OpenVoice is not on PyPI. You need to clone the repo and install from source. MeloTTS ships as a separate package.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Clone OpenVoice V2
git clone https://github.com/myshell-ai/OpenVoice.git
cd OpenVoice

# Install OpenVoice and MeloTTS
pip install -e .
pip install git+https://github.com/myshell-ai/MeloTTS.git

# Download the V2 checkpoints
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='myshell-ai/OpenVoiceV2', local_dir='checkpoints_v2')
"

You also need ffmpeg installed system-wide for audio processing. On Ubuntu:

1
sudo apt-get install ffmpeg

On macOS:

1
brew install ffmpeg

The V2 checkpoints are about 200MB. They land in checkpoints_v2/converter/ with the config and model files that ToneColorConverter expects.

GPU vs CPU

OpenVoice runs on CPU but it is painfully slow. Stick with CUDA if you can. Pass device="cuda:0" when initializing both the TTS engine and the converter. For Apple Silicon, device="mps" works but expect 3-4x slower inference than CUDA.

Understanding the Pipeline

The architecture has two distinct stages, and understanding them helps you debug problems.

Stage 1: Base speech generation. MeloTTS generates a high-quality speech waveform from text. This is a standard neural TTS model. The output sounds like the built-in English speaker, not your target voice.

Stage 2: Tone color conversion. The converter takes the base waveform and transforms its speaker identity to match the reference clip. It does this by swapping speaker embeddings (SE vectors) while preserving the linguistic content.

This two-stage design is what makes OpenVoice flexible. You can swap in different base TTS languages, adjust speaking speed and emotion at stage 1, then apply any target voice at stage 2.

Supported Languages

MeloTTS supports English, Spanish, French, Chinese, Japanese, and Korean out of the box. Each has its own speaker IDs:

1
2
3
4
5
6
from melo.api import TTS

# List available speakers for each language
for lang in ["EN", "ES", "FR", "ZH", "JP", "KR"]:
    model = TTS(language=lang, device="cuda:0")
    print(f"{lang}: {list(model.hps.data.spk2id.keys())}")

For English, the default speakers are EN-US, EN-BR, EN-INDIA, EN-AU, and EN-Default. Pick the one closest to the accent you want before tone color conversion. The converter handles identity transfer, but starting with a closer accent baseline gives cleaner results.

Processing Your Own Reference Audio

The reference clip quality matters a lot. Aim for 5-30 seconds of clean speech with minimal background noise. Longer is not always better since the SE extractor averages across the clip, so noisy sections drag down quality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import os
from openvoice import se_extractor
from openvoice.api import ToneColorConverter

converter = ToneColorConverter("checkpoints_v2/converter", device="cuda:0")

# Extract speaker embedding from a clean reference clip
# The clip should be 5-30 seconds, mono or stereo, any sample rate
reference_path = "my_voice_sample.wav"

target_se, audio_name = se_extractor.get_se(
    reference_path,
    converter,
    vad=False,  # Set True if clip has silence/pauses you want trimmed
)

print(f"Extracted SE shape: {target_se.shape}")
# Output: Extracted SE shape: torch.Size([1, 256])

The vad parameter enables voice activity detection to strip silence. Turn it on if your reference clip has long pauses or leading/trailing silence. Turn it off for clean clips since VAD can occasionally clip the beginning of words.

Saving and Reusing Speaker Embeddings

Extracting the SE vector every time is wasteful. Save it once and reload:

1
2
3
4
5
6
7
import torch

# Save
torch.save(target_se, "speaker_embeddings/john_doe.pt")

# Load later
target_se = torch.load("speaker_embeddings/john_doe.pt", map_location="cuda:0")

This is particularly useful when building an application where users upload a voice sample once and then generate many clips. Extract on upload, store the tensor, and skip extraction on subsequent requests.

Batch Processing Multiple Texts

For generating multiple clips in the same voice, extract the SE vectors once and loop over your texts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from melo.api import TTS
import os

converter = ToneColorConverter("checkpoints_v2/converter", device="cuda:0")
tts = TTS(language="EN", device="cuda:0")
speaker_id = tts.hps.data.spk2id["EN-US"]

# Extract target voice once
target_se, _ = se_extractor.get_se("reference_speaker.wav", converter, vad=False)

texts = [
    "Voice cloning technology has come a long way in the past year.",
    "You can now clone a voice from just a few seconds of audio.",
    "The quality is good enough for podcast intros and narration.",
]

os.makedirs("output_clips", exist_ok=True)

for i, text in enumerate(texts):
    base_path = f"output_clips/base_{i}.wav"
    final_path = f"output_clips/cloned_{i}.wav"

    # Generate base speech
    tts.tts_to_file(text, speaker_id, base_path, speed=1.0)

    # Extract source SE from base
    source_se, _ = se_extractor.get_se(base_path, converter, vad=False)

    # Convert tone color
    converter.convert(
        audio_src_path=base_path,
        src_se=source_se,
        tgt_se=target_se,
        output_path=final_path,
    )
    print(f"Saved {final_path}")

Adjust speed between 0.5 and 2.0 to control speaking rate. Values around 0.9-1.1 sound the most natural. Going below 0.7 introduces audible artifacts.

Common Errors and Fixes

FileNotFoundError: checkpoints_v2/converter/config.json

You either skipped the checkpoint download or cloned into a different directory. Verify the checkpoint structure:

1
2
ls checkpoints_v2/converter/
# Should contain: config.json, checkpoint.pth

Re-download with the Hugging Face snippet from the installation section if the files are missing.

RuntimeError: CUDA out of memory

OpenVoice is not memory-hungry but MeloTTS can spike on long texts. Split your input into chunks of 2-3 sentences. Generating in shorter segments and concatenating the final audio works well. You can concatenate with pydub:

1
2
3
4
5
6
7
from pydub import AudioSegment

clips = [AudioSegment.from_wav(f"output_clips/cloned_{i}.wav") for i in range(3)]
combined = clips[0]
for clip in clips[1:]:
    combined += clip
combined.export("full_output.wav", format="wav")

ValueError: Audio is too short for voice activity detection

Your reference clip is under 1 second or nearly silent. Use a longer sample. If the audio is valid but quiet, normalize it first with ffmpeg:

1
ffmpeg -i quiet_reference.wav -filter:a loudnorm noisy_reference.wav

ModuleNotFoundError: No module named 'melo'

MeloTTS was not installed. It is a separate package from OpenVoice:

1
pip install git+https://github.com/myshell-ai/MeloTTS.git

Cloned voice sounds robotic or distorted

This almost always means the reference audio has too much background noise or reverb. Record in a quiet room, or clean the audio with a noise reduction tool before extracting the SE. Even light denoising with noisereduce makes a meaningful difference:

1
2
3
4
5
6
import noisereduce as nr
import soundfile as sf

audio, sr = sf.read("noisy_reference.wav")
cleaned = nr.reduce_noise(y=audio, sr=sr)
sf.write("clean_reference.wav", cleaned, sr)

Installation and Setup#

GPU vs CPU#

Understanding the Pipeline#

Supported Languages#

Processing Your Own Reference Audio#

Saving and Reusing Speaker Embeddings#

Batch Processing Multiple Texts#

Common Errors and Fixes#

Related Guides#

About the Author