How to Clone Voices with OpenAI TTS and ElevenLabs API

OpenAI TTS: Steerable Speech in Five Lines

OpenAI’s gpt-4o-mini-tts model doesn’t do voice cloning in the traditional sense – you can’t upload a sample and replicate someone’s voice. What it does offer is steerable speech: you pick one of 13 built-in voices and control tone, pacing, and emotion through natural language instructions. That makes it the fastest way to get high-quality, customizable speech from an API.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from openai import OpenAI
from pathlib import Path

client = OpenAI()  # uses OPENAI_API_KEY env var

response = client.audio.speech.create(
    model="gpt-4o-mini-tts",
    voice="coral",
    input="The quarterly results exceeded expectations across every metric.",
    instructions="Speak in a calm, measured tone like a seasoned news anchor. "
                 "Pause briefly after key numbers. Keep the pace steady.",
)

output_path = Path("quarterly_report.mp3")
response.stream_to_file(output_path)
print(f"Saved to {output_path}")

The instructions parameter is what separates gpt-4o-mini-tts from the older tts-1 and tts-1-hd models. You can specify affect, accent hints, emotional delivery, and pacing – all in plain English. The older models ignore this parameter entirely.

Available Voices and Models

OpenAI offers three TTS models:

Model	Cost	Instructions	Best For
`tts-1`	$15/1M chars	No	Fast previews, low latency
`tts-1-hd`	$30/1M chars	No	High-fidelity final output
`gpt-4o-mini-tts`	$12/1M output tokens	Yes	Steerable, expressive speech

The 13 voices – alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse, marin, and cedar – each have a distinct character. coral and nova work well for conversational content. onyx sounds deeper and more authoritative. Try a few with the same text before committing.

ElevenLabs: Actual Voice Cloning

ElevenLabs is where you go for real voice cloning. Their Instant Voice Cloning (IVC) feature takes audio samples of a specific voice and creates a reusable voice ID you can pass to their TTS endpoint. You don’t need hours of training data – a few minutes of clean audio is enough.

1
pip install elevenlabs python-dotenv

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import os
from dotenv import load_dotenv
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings

load_dotenv()
client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

# Step 1: Create the cloned voice from audio samples
voice = client.voices.ivc.create(
    name="Alex Narrator",
    description="Male, mid-30s, American English, calm and authoritative",
    files=["./samples/sample_01.mp3", "./samples/sample_02.mp3"],
)
print(f"Created voice: {voice.voice_id}")

# Step 2: Generate speech with the cloned voice
audio = client.text_to_speech.convert(
    text="This audio was generated using a cloned voice via the ElevenLabs API.",
    voice_id=voice.voice_id,
    model_id="eleven_multilingual_v2",
    output_format="mp3_44100_128",
    voice_settings=VoiceSettings(
        stability=0.7,
        similarity_boost=0.75,
        style=0.3,
        use_speaker_boost=True,
    ),
)

with open("cloned_output.mp3", "wb") as f:
    for chunk in audio:
        if chunk:
            f.write(chunk)

print("Saved cloned_output.mp3")

Voice Settings Explained

The VoiceSettings object controls how the cloned voice behaves:

stability (0.0-1.0): Higher values produce more consistent, predictable output. Lower values add expressiveness but can introduce artifacts. Start at 0.7.
similarity_boost (0.0-1.0): How closely the output matches the original voice. Push this higher (0.75+) for cloned voices – that’s the whole point.
style (0.0-1.0): Amplifies the voice’s stylistic characteristics. Keep this under 0.5 to avoid distortion.
use_speaker_boost: Sharpens the voice profile. Leave this on for cloned voices.

Audio Sample Requirements

The quality of your clone depends almost entirely on the input. Bad samples produce bad clones.

Record at 192 kbps or higher
Aim for at least 60 seconds of continuous speech per file
No background music, echo, or ambient noise
One speaker only – no conversations
Consistent microphone distance and volume
MP3, WAV, or M4A formats work

Two to three high-quality samples will outperform ten noisy ones every time.

Common Errors and Fixes

OpenAI: `InvalidRequestError: Instructions are not supported`

This happens when you pass instructions to tts-1 or tts-1-hd. Only gpt-4o-mini-tts supports the instructions parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Wrong -- tts-1 does not accept instructions
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world",
    instructions="Speak softly",  # causes InvalidRequestError
)

# Fix: switch to gpt-4o-mini-tts
response = client.audio.speech.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    input="Hello world",
    instructions="Speak softly",
)

ElevenLabs: `voice_not_found` or `401 Unauthorized`

A 401 means your API key is wrong or expired. A voice_not_found error usually means the voice ID was deleted or belongs to a different account.

1
2
3
4
5
6
# Verify your key works
try:
    voices = client.voices.get_all()
    print(f"Found {len(voices.voices)} voices")
except Exception as e:
    print(f"Auth failed: {e}")

ElevenLabs: Poor Clone Quality

If your clone sounds robotic or nothing like the source, the issue is almost always the audio samples. Re-record with a condenser mic in a quiet room. Remove silence gaps and normalize volume before uploading. Even slight room reverb degrades the clone significantly.

Pricing Comparison

Both APIs charge per usage, but the models differ:

Provider	Model	Cost	Cloning
OpenAI	`tts-1`	$15/1M chars	No
OpenAI	`gpt-4o-mini-tts`	~$0.015/min output	No (steerable)
ElevenLabs	Starter plan	From $5/mo (30K chars)	Instant cloning
ElevenLabs	Pro plan	$99/mo (500K chars)	Instant + Professional

ElevenLabs charges from a monthly character quota. Once you exceed it, overage rates kick in and they’re steep. Track your usage with client.user.get_subscription() to avoid surprises.

Voice cloning is not available on ElevenLabs’ free tier. You need at least the Starter plan for instant cloning.

Ethical and Legal Guardrails

Voice cloning carries real legal risk. Several US states now have laws specifically covering synthetic voice use. Tennessee’s ELVIS Act criminalizes unauthorized digital replication of a person’s voice. The proposed Federal AI Voice Act (enforcement expected in 2026) will require explicit written consent for any commercial use of cloned voice models.

Practical rules to follow:

Only clone your own voice or voices you have documented, written consent to use
Never clone a public figure’s voice for content that could be mistaken as real
Disclose when audio is AI-generated, especially in commercial or public contexts
Both OpenAI and ElevenLabs have usage policies that prohibit deceptive impersonation
Keep consent documentation – scope, duration, and permitted uses should be explicit

Both platforms watermark or fingerprint their audio output to some degree, so synthetic speech can often be traced back to the generating account.

OpenAI TTS: Steerable Speech in Five Lines#

Available Voices and Models#

ElevenLabs: Actual Voice Cloning#

Voice Settings Explained#

Audio Sample Requirements#

Common Errors and Fixes#

OpenAI: InvalidRequestError: Instructions are not supported#

ElevenLabs: voice_not_found or 401 Unauthorized#

ElevenLabs: Poor Clone Quality#

Pricing Comparison#

Ethical and Legal Guardrails#

Related Guides#

About the Author