The Full Pipeline

You give it a topic. It writes a two-person conversation. It speaks both parts with different voices. You get an MP3 file that sounds like a real podcast episode.

The stack is straightforward: OpenAI’s GPT-4o generates the dialogue script, OpenAI’s TTS API renders each line with a distinct voice, and pydub stitches the audio segments together with intro/outro music.

Install the dependencies first:

1
pip install openai pydub

You also need ffmpeg on your system. On Ubuntu: sudo apt install ffmpeg. On macOS: brew install ffmpeg.

Generating the Dialogue Script

The trick to a good AI podcast is the system prompt. You need to tell the model exactly what kind of conversation you want – casual, back-and-forth, with real personality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from openai import OpenAI
import json

client = OpenAI()

def generate_script(topic: str, num_exchanges: int = 8) -> list[dict]:
    """Generate a two-person podcast script on a given topic."""
    system_prompt = """You are a podcast script writer. Write a natural conversation
between two hosts: Alex and Sam.

Rules:
- Alex is enthusiastic and asks good follow-up questions.
- Sam is the expert who explains things clearly with analogies.
- Keep each line under 3 sentences. Podcast listeners lose focus on long monologues.
- Include natural filler like "Right", "Exactly", "That's a great point" sparingly.
- Make it sound like two friends talking, not a lecture.

Return a JSON array of objects with "speaker" and "text" fields.
Example: [{"speaker": "Alex", "text": "So tell me about..."}, {"speaker": "Sam", "text": "Sure, so basically..."}]
Return ONLY valid JSON. No markdown, no explanation."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": f"Write a podcast conversation with {num_exchanges} exchanges per speaker about: {topic}",
            },
        ],
        temperature=0.9,
        max_tokens=4000,
    )

    script = json.loads(response.choices[0].message.content)
    return script


topic = "Why retrieval-augmented generation is replacing fine-tuning for most use cases"
script = generate_script(topic)

for line in script:
    print(f"[{line['speaker']}]: {line['text']}")

Setting temperature=0.9 keeps the dialogue varied and natural. Lower values make both speakers sound the same.

Converting Script Lines to Audio

OpenAI’s TTS API gives you six voices: alloy, echo, fable, nova, onyx, and shimmer. Assign a different voice to each speaker so listeners can tell them apart.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from pathlib import Path

VOICE_MAP = {
    "Alex": "nova",   # higher, energetic
    "Sam": "onyx",    # deeper, authoritative
}

OUTPUT_DIR = Path("segments")
OUTPUT_DIR.mkdir(exist_ok=True)


def generate_audio_segment(text: str, voice: str, output_path: Path) -> Path:
    """Convert a single line of dialogue to an MP3 file."""
    response = client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,
        input=text,
        response_format="mp3",
    )
    response.stream_to_file(str(output_path))
    return output_path


# Generate all segments
segment_files = []
for i, line in enumerate(script):
    voice = VOICE_MAP[line["speaker"]]
    path = OUTPUT_DIR / f"segment_{i:03d}.mp3"
    generate_audio_segment(line["text"], voice, path)
    segment_files.append(path)
    print(f"Generated segment {i}: [{line['speaker']}] {line['text'][:50]}...")

Use tts-1-hd for podcast-quality audio. The standard tts-1 model is faster but noticeably lower quality – fine for previews, not for a finished episode.

Stitching Audio with Pydub

Now combine every segment into one continuous file. Add short pauses between lines so it doesn’t sound like two robots speed-reading at each other.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from pydub import AudioSegment

def assemble_podcast(
    segment_files: list[Path],
    intro_music_path: str | None = None,
    outro_music_path: str | None = None,
    pause_ms: int = 600,
    music_volume_db: float = -15.0,
) -> AudioSegment:
    """Combine audio segments into a full podcast episode."""
    pause = AudioSegment.silent(duration=pause_ms)
    podcast = AudioSegment.empty()

    # Add intro music if provided
    if intro_music_path:
        intro = AudioSegment.from_file(intro_music_path)
        intro = intro + music_volume_db  # lower the music volume
        intro = intro[:8000]  # first 8 seconds only
        intro = intro.fade_out(2000)
        podcast += intro + pause

    # Concatenate all dialogue segments with pauses
    for seg_path in segment_files:
        segment = AudioSegment.from_mp3(str(seg_path))
        podcast += segment + pause

    # Add outro music if provided
    if outro_music_path:
        outro = AudioSegment.from_file(outro_music_path)
        outro = outro + music_volume_db
        outro = outro[:10000].fade_in(2000).fade_out(3000)
        podcast += outro

    return podcast


final_audio = assemble_podcast(segment_files)
final_audio.export("podcast_episode.mp3", format="mp3", bitrate="192k")
print(f"Exported podcast_episode.mp3 ({len(final_audio) / 1000:.1f}s)")

The pause_ms=600 gap between lines mimics natural conversational pacing. Increase it to 800-1000ms if the topic is dense and you want the listener to absorb each point.

Adding Background Music

If you have intro/outro music files (royalty-free, obviously), pass them in:

1
2
3
4
5
6
7
8
final_audio = assemble_podcast(
    segment_files,
    intro_music_path="assets/intro_jingle.mp3",
    outro_music_path="assets/outro_jingle.mp3",
    pause_ms=700,
    music_volume_db=-18.0,
)
final_audio.export("podcast_episode.mp3", format="mp3", bitrate="192k")

The music_volume_db=-18.0 keeps music quiet enough that it doesn’t compete with speech. Adjust to taste.

Putting It All Together

Here is the complete pipeline as a single callable function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def create_podcast(topic: str, output_file: str = "podcast_episode.mp3") -> str:
    """Full pipeline: topic string in, MP3 file out."""
    print(f"Generating script for: {topic}")
    script = generate_script(topic, num_exchanges=8)
    print(f"Script has {len(script)} lines")

    segment_files = []
    OUTPUT_DIR.mkdir(exist_ok=True)
    for i, line in enumerate(script):
        voice = VOICE_MAP.get(line["speaker"], "alloy")
        path = OUTPUT_DIR / f"segment_{i:03d}.mp3"
        generate_audio_segment(line["text"], voice, path)
        segment_files.append(path)

    final_audio = assemble_podcast(segment_files, pause_ms=650)
    final_audio.export(output_file, format="mp3", bitrate="192k")

    # Clean up individual segments
    for f in segment_files:
        f.unlink()

    duration = len(final_audio) / 1000
    print(f"Done! {output_file} ({duration:.1f}s)")
    return output_file


create_podcast("The future of open-source AI models in 2026")

A typical 8-exchange script produces a 3-5 minute episode. Bump num_exchanges to 15-20 for a full 10-minute episode.

Tuning Voice Quality

A few things that make the output sound significantly better:

  • Use tts-1-hd over tts-1. The HD model has less static and more natural prosody. It costs twice as much but the difference is obvious.
  • Keep lines short. The TTS model handles 1-3 sentences much better than long paragraphs. Long inputs get monotone.
  • Vary sentence structure in the prompt. If every line starts with “Well,” the TTS will produce the same intonation pattern on repeat. The system prompt should encourage variety.
  • Match voice to personality. nova and shimmer sound younger and more upbeat. onyx and echo sound more authoritative. Pick voices that match the character’s role in the conversation.

Common Errors and Fixes

FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe'

Pydub needs ffmpeg installed system-wide. Install it:

1
2
3
4
5
6
7
8
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows (with chocolatey)
choco install ffmpeg

json.JSONDecodeError when parsing the script

The LLM sometimes wraps JSON in markdown code fences. Strip them before parsing:

1
2
3
4
5
6
raw = response.choices[0].message.content
raw = raw.strip()
if raw.startswith("```"):
    raw = raw.split("\n", 1)[1]  # remove first line
    raw = raw.rsplit("```", 1)[0]  # remove last fence
script = json.loads(raw)

openai.RateLimitError during segment generation

You are making one API call per dialogue line. For long scripts, add a delay:

1
2
3
4
5
6
7
8
import time

for i, line in enumerate(script):
    voice = VOICE_MAP[line["speaker"]]
    path = OUTPUT_DIR / f"segment_{i:03d}.mp3"
    generate_audio_segment(line["text"], voice, path)
    segment_files.append(path)
    time.sleep(0.5)  # avoid rate limits

Audio segments have different volume levels

Normalize each segment before stitching:

1
2
3
4
5
from pydub.effects import normalize

segment = AudioSegment.from_mp3(str(seg_path))
segment = normalize(segment)
podcast += segment + pause

KeyError on speaker name not in VOICE_MAP

The LLM might use different names than expected. Use .get() with a fallback:

1
voice = VOICE_MAP.get(line["speaker"], "alloy")

This is already handled in the create_podcast function above, but make sure your standalone code does the same.

Cost Estimates

For reference, generating a 5-minute episode costs roughly:

  • Script generation: ~$0.01-0.03 (GPT-4o, ~2K tokens)
  • TTS-1-HD audio: ~$0.24-0.30 per episode (16 segments, ~100 words each at $30/1M characters)

That is under $0.35 per episode. Cheap enough to generate daily content.