OpenAI TTS: Steerable Speech in Five Lines
OpenAI’s gpt-4o-mini-tts model doesn’t do voice cloning in the traditional sense – you can’t upload a sample and replicate someone’s voice. What it does offer is steerable speech: you pick one of 13 built-in voices and control tone, pacing, and emotion through natural language instructions. That makes it the fastest way to get high-quality, customizable speech from an API.
| |
The instructions parameter is what separates gpt-4o-mini-tts from the older tts-1 and tts-1-hd models. You can specify affect, accent hints, emotional delivery, and pacing – all in plain English. The older models ignore this parameter entirely.
Available Voices and Models
OpenAI offers three TTS models:
| Model | Cost | Instructions | Best For |
|---|---|---|---|
tts-1 | $15/1M chars | No | Fast previews, low latency |
tts-1-hd | $30/1M chars | No | High-fidelity final output |
gpt-4o-mini-tts | $12/1M output tokens | Yes | Steerable, expressive speech |
The 13 voices – alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse, marin, and cedar – each have a distinct character. coral and nova work well for conversational content. onyx sounds deeper and more authoritative. Try a few with the same text before committing.
ElevenLabs: Actual Voice Cloning
ElevenLabs is where you go for real voice cloning. Their Instant Voice Cloning (IVC) feature takes audio samples of a specific voice and creates a reusable voice ID you can pass to their TTS endpoint. You don’t need hours of training data – a few minutes of clean audio is enough.
| |
| |
Voice Settings Explained
The VoiceSettings object controls how the cloned voice behaves:
- stability (0.0-1.0): Higher values produce more consistent, predictable output. Lower values add expressiveness but can introduce artifacts. Start at 0.7.
- similarity_boost (0.0-1.0): How closely the output matches the original voice. Push this higher (0.75+) for cloned voices – that’s the whole point.
- style (0.0-1.0): Amplifies the voice’s stylistic characteristics. Keep this under 0.5 to avoid distortion.
- use_speaker_boost: Sharpens the voice profile. Leave this on for cloned voices.
Audio Sample Requirements
The quality of your clone depends almost entirely on the input. Bad samples produce bad clones.
- Record at 192 kbps or higher
- Aim for at least 60 seconds of continuous speech per file
- No background music, echo, or ambient noise
- One speaker only – no conversations
- Consistent microphone distance and volume
- MP3, WAV, or M4A formats work
Two to three high-quality samples will outperform ten noisy ones every time.
Common Errors and Fixes
OpenAI: InvalidRequestError: Instructions are not supported
This happens when you pass instructions to tts-1 or tts-1-hd. Only gpt-4o-mini-tts supports the instructions parameter.
| |
ElevenLabs: voice_not_found or 401 Unauthorized
A 401 means your API key is wrong or expired. A voice_not_found error usually means the voice ID was deleted or belongs to a different account.
| |
ElevenLabs: Poor Clone Quality
If your clone sounds robotic or nothing like the source, the issue is almost always the audio samples. Re-record with a condenser mic in a quiet room. Remove silence gaps and normalize volume before uploading. Even slight room reverb degrades the clone significantly.
Pricing Comparison
Both APIs charge per usage, but the models differ:
| Provider | Model | Cost | Cloning |
|---|---|---|---|
| OpenAI | tts-1 | $15/1M chars | No |
| OpenAI | gpt-4o-mini-tts | ~$0.015/min output | No (steerable) |
| ElevenLabs | Starter plan | From $5/mo (30K chars) | Instant cloning |
| ElevenLabs | Pro plan | $99/mo (500K chars) | Instant + Professional |
ElevenLabs charges from a monthly character quota. Once you exceed it, overage rates kick in and they’re steep. Track your usage with client.user.get_subscription() to avoid surprises.
Voice cloning is not available on ElevenLabs’ free tier. You need at least the Starter plan for instant cloning.
Ethical and Legal Guardrails
Voice cloning carries real legal risk. Several US states now have laws specifically covering synthetic voice use. Tennessee’s ELVIS Act criminalizes unauthorized digital replication of a person’s voice. The proposed Federal AI Voice Act (enforcement expected in 2026) will require explicit written consent for any commercial use of cloned voice models.
Practical rules to follow:
- Only clone your own voice or voices you have documented, written consent to use
- Never clone a public figure’s voice for content that could be mistaken as real
- Disclose when audio is AI-generated, especially in commercial or public contexts
- Both OpenAI and ElevenLabs have usage policies that prohibit deceptive impersonation
- Keep consent documentation – scope, duration, and permitted uses should be explicit
Both platforms watermark or fingerprint their audio output to some degree, so synthetic speech can often be traced back to the generating account.
Related Guides
- How to Build Real-Time Voice Cloning with OpenVoice and Python
- How to Edit Images with AI Inpainting Using Stable Diffusion
- How to Generate Videos with Stable Video Diffusion
- How to Build AI Clothing Try-On with Virtual Diffusion Models
- How to Control Image Generation with ControlNet and IP-Adapter
- How to Generate Music with Meta AudioCraft
- How to Generate Images with FLUX.2 in Python
- How to Build AI Architectural Rendering with ControlNet and Stable Diffusion
- How to Generate and Edit Audio with Stable Audio and AudioLDM
- How to Build AI Sketch-to-Image Generation with ControlNet Scribble