OpenVoice V2 from MyShell.ai is the best open-source voice cloning model available right now. Give it a 10-second audio clip of any speaker and it produces remarkably accurate clones. The key insight: it separates tone color (the identity of a voice) from style (speed, emotion, accent), so you can mix and match freely.
Here is the minimal end-to-end pipeline. You generate base speech with MeloTTS, extract the speaker embedding from a reference clip, then apply tone color conversion to stamp the target voice onto the output.
| |
That produces cloned_output.wav with the target speaker’s voice saying whatever text you passed in. The whole process takes about 2 seconds on a decent GPU.
Installation and Setup
OpenVoice is not on PyPI. You need to clone the repo and install from source. MeloTTS ships as a separate package.
| |
You also need ffmpeg installed system-wide for audio processing. On Ubuntu:
| |
On macOS:
| |
The V2 checkpoints are about 200MB. They land in checkpoints_v2/converter/ with the config and model files that ToneColorConverter expects.
GPU vs CPU
OpenVoice runs on CPU but it is painfully slow. Stick with CUDA if you can. Pass device="cuda:0" when initializing both the TTS engine and the converter. For Apple Silicon, device="mps" works but expect 3-4x slower inference than CUDA.
Understanding the Pipeline
The architecture has two distinct stages, and understanding them helps you debug problems.
Stage 1: Base speech generation. MeloTTS generates a high-quality speech waveform from text. This is a standard neural TTS model. The output sounds like the built-in English speaker, not your target voice.
Stage 2: Tone color conversion. The converter takes the base waveform and transforms its speaker identity to match the reference clip. It does this by swapping speaker embeddings (SE vectors) while preserving the linguistic content.
This two-stage design is what makes OpenVoice flexible. You can swap in different base TTS languages, adjust speaking speed and emotion at stage 1, then apply any target voice at stage 2.
Supported Languages
MeloTTS supports English, Spanish, French, Chinese, Japanese, and Korean out of the box. Each has its own speaker IDs:
| |
For English, the default speakers are EN-US, EN-BR, EN-INDIA, EN-AU, and EN-Default. Pick the one closest to the accent you want before tone color conversion. The converter handles identity transfer, but starting with a closer accent baseline gives cleaner results.
Processing Your Own Reference Audio
The reference clip quality matters a lot. Aim for 5-30 seconds of clean speech with minimal background noise. Longer is not always better since the SE extractor averages across the clip, so noisy sections drag down quality.
| |
The vad parameter enables voice activity detection to strip silence. Turn it on if your reference clip has long pauses or leading/trailing silence. Turn it off for clean clips since VAD can occasionally clip the beginning of words.
Saving and Reusing Speaker Embeddings
Extracting the SE vector every time is wasteful. Save it once and reload:
| |
This is particularly useful when building an application where users upload a voice sample once and then generate many clips. Extract on upload, store the tensor, and skip extraction on subsequent requests.
Batch Processing Multiple Texts
For generating multiple clips in the same voice, extract the SE vectors once and loop over your texts:
| |
Adjust speed between 0.5 and 2.0 to control speaking rate. Values around 0.9-1.1 sound the most natural. Going below 0.7 introduces audible artifacts.
Common Errors and Fixes
FileNotFoundError: checkpoints_v2/converter/config.json
You either skipped the checkpoint download or cloned into a different directory. Verify the checkpoint structure:
| |
Re-download with the Hugging Face snippet from the installation section if the files are missing.
RuntimeError: CUDA out of memory
OpenVoice is not memory-hungry but MeloTTS can spike on long texts. Split your input into chunks of 2-3 sentences. Generating in shorter segments and concatenating the final audio works well. You can concatenate with pydub:
| |
ValueError: Audio is too short for voice activity detection
Your reference clip is under 1 second or nearly silent. Use a longer sample. If the audio is valid but quiet, normalize it first with ffmpeg:
| |
ModuleNotFoundError: No module named 'melo'
MeloTTS was not installed. It is a separate package from OpenVoice:
| |
Cloned voice sounds robotic or distorted
This almost always means the reference audio has too much background noise or reverb. Record in a quiet room, or clean the audio with a noise reduction tool before extracting the SE. Even light denoising with noisereduce makes a meaningful difference:
| |
Related Guides
- How to Build AI Image Upscaling with Real-ESRGAN and SwinIR
- How to Generate Music with Meta AudioCraft
- How to Clone Voices with OpenAI TTS and ElevenLabs API
- How to Build AI Clothing Try-On with Virtual Diffusion Models
- How to Generate Images with FLUX.2 in Python
- How to Build AI Architectural Rendering with ControlNet and Stable Diffusion
- How to Build AI Sketch-to-Image Generation with ControlNet Scribble
- How to Build AI Wallpaper Generation with Stable Diffusion and Tiling
- How to Generate Images with Stable Diffusion in Python
- How to Build AI Motion Graphics Generation with Deforum Stable Diffusion