OpenAI TTS: Steerable Speech in Five Lines

OpenAI’s gpt-4o-mini-tts model doesn’t do voice cloning in the traditional sense – you can’t upload a sample and replicate someone’s voice. What it does offer is steerable speech: you pick one of 13 built-in voices and control tone, pacing, and emotion through natural language instructions. That makes it the fastest way to get high-quality, customizable speech from an API.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from openai import OpenAI
from pathlib import Path

client = OpenAI()  # uses OPENAI_API_KEY env var

response = client.audio.speech.create(
    model="gpt-4o-mini-tts",
    voice="coral",
    input="The quarterly results exceeded expectations across every metric.",
    instructions="Speak in a calm, measured tone like a seasoned news anchor. "
                 "Pause briefly after key numbers. Keep the pace steady.",
)

output_path = Path("quarterly_report.mp3")
response.stream_to_file(output_path)
print(f"Saved to {output_path}")

The instructions parameter is what separates gpt-4o-mini-tts from the older tts-1 and tts-1-hd models. You can specify affect, accent hints, emotional delivery, and pacing – all in plain English. The older models ignore this parameter entirely.

Available Voices and Models

OpenAI offers three TTS models:

ModelCostInstructionsBest For
tts-1$15/1M charsNoFast previews, low latency
tts-1-hd$30/1M charsNoHigh-fidelity final output
gpt-4o-mini-tts$12/1M output tokensYesSteerable, expressive speech

The 13 voices – alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse, marin, and cedar – each have a distinct character. coral and nova work well for conversational content. onyx sounds deeper and more authoritative. Try a few with the same text before committing.

ElevenLabs: Actual Voice Cloning

ElevenLabs is where you go for real voice cloning. Their Instant Voice Cloning (IVC) feature takes audio samples of a specific voice and creates a reusable voice ID you can pass to their TTS endpoint. You don’t need hours of training data – a few minutes of clean audio is enough.

1
pip install elevenlabs python-dotenv
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import os
from dotenv import load_dotenv
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings

load_dotenv()
client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

# Step 1: Create the cloned voice from audio samples
voice = client.voices.ivc.create(
    name="Alex Narrator",
    description="Male, mid-30s, American English, calm and authoritative",
    files=["./samples/sample_01.mp3", "./samples/sample_02.mp3"],
)
print(f"Created voice: {voice.voice_id}")

# Step 2: Generate speech with the cloned voice
audio = client.text_to_speech.convert(
    text="This audio was generated using a cloned voice via the ElevenLabs API.",
    voice_id=voice.voice_id,
    model_id="eleven_multilingual_v2",
    output_format="mp3_44100_128",
    voice_settings=VoiceSettings(
        stability=0.7,
        similarity_boost=0.75,
        style=0.3,
        use_speaker_boost=True,
    ),
)

with open("cloned_output.mp3", "wb") as f:
    for chunk in audio:
        if chunk:
            f.write(chunk)

print("Saved cloned_output.mp3")

Voice Settings Explained

The VoiceSettings object controls how the cloned voice behaves:

  • stability (0.0-1.0): Higher values produce more consistent, predictable output. Lower values add expressiveness but can introduce artifacts. Start at 0.7.
  • similarity_boost (0.0-1.0): How closely the output matches the original voice. Push this higher (0.75+) for cloned voices – that’s the whole point.
  • style (0.0-1.0): Amplifies the voice’s stylistic characteristics. Keep this under 0.5 to avoid distortion.
  • use_speaker_boost: Sharpens the voice profile. Leave this on for cloned voices.

Audio Sample Requirements

The quality of your clone depends almost entirely on the input. Bad samples produce bad clones.

  • Record at 192 kbps or higher
  • Aim for at least 60 seconds of continuous speech per file
  • No background music, echo, or ambient noise
  • One speaker only – no conversations
  • Consistent microphone distance and volume
  • MP3, WAV, or M4A formats work

Two to three high-quality samples will outperform ten noisy ones every time.

Common Errors and Fixes

OpenAI: InvalidRequestError: Instructions are not supported

This happens when you pass instructions to tts-1 or tts-1-hd. Only gpt-4o-mini-tts supports the instructions parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Wrong -- tts-1 does not accept instructions
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world",
    instructions="Speak softly",  # causes InvalidRequestError
)

# Fix: switch to gpt-4o-mini-tts
response = client.audio.speech.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    input="Hello world",
    instructions="Speak softly",
)

ElevenLabs: voice_not_found or 401 Unauthorized

A 401 means your API key is wrong or expired. A voice_not_found error usually means the voice ID was deleted or belongs to a different account.

1
2
3
4
5
6
# Verify your key works
try:
    voices = client.voices.get_all()
    print(f"Found {len(voices.voices)} voices")
except Exception as e:
    print(f"Auth failed: {e}")

ElevenLabs: Poor Clone Quality

If your clone sounds robotic or nothing like the source, the issue is almost always the audio samples. Re-record with a condenser mic in a quiet room. Remove silence gaps and normalize volume before uploading. Even slight room reverb degrades the clone significantly.

Pricing Comparison

Both APIs charge per usage, but the models differ:

ProviderModelCostCloning
OpenAItts-1$15/1M charsNo
OpenAIgpt-4o-mini-tts~$0.015/min outputNo (steerable)
ElevenLabsStarter planFrom $5/mo (30K chars)Instant cloning
ElevenLabsPro plan$99/mo (500K chars)Instant + Professional

ElevenLabs charges from a monthly character quota. Once you exceed it, overage rates kick in and they’re steep. Track your usage with client.user.get_subscription() to avoid surprises.

Voice cloning is not available on ElevenLabs’ free tier. You need at least the Starter plan for instant cloning.

Voice cloning carries real legal risk. Several US states now have laws specifically covering synthetic voice use. Tennessee’s ELVIS Act criminalizes unauthorized digital replication of a person’s voice. The proposed Federal AI Voice Act (enforcement expected in 2026) will require explicit written consent for any commercial use of cloned voice models.

Practical rules to follow:

  • Only clone your own voice or voices you have documented, written consent to use
  • Never clone a public figure’s voice for content that could be mistaken as real
  • Disclose when audio is AI-generated, especially in commercial or public contexts
  • Both OpenAI and ElevenLabs have usage policies that prohibit deceptive impersonation
  • Keep consent documentation – scope, duration, and permitted uses should be explicit

Both platforms watermark or fingerprint their audio output to some degree, so synthetic speech can often be traced back to the generating account.