Nobody reads meeting notes. But everybody needs them. The fix is an agent that watches your recordings, transcribes them with Whisper, and hands you a structured summary – action items, decisions, who said what – without you lifting a finger.
Here’s the full flow: audio file goes in, Whisper transcribes it, an LLM agent with tool calling extracts structured data, and you get a clean JSON summary out the other end. The whole thing runs in about 30 lines of core logic.
Transcribing Audio with Whisper#
Start with the transcription step. OpenAI’s Whisper API handles most audio formats (mp3, mp4, wav, m4a) up to 25 MB per request.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| from openai import OpenAI
from pathlib import Path
client = OpenAI()
def transcribe_audio(file_path: str) -> str:
"""Transcribe an audio file using OpenAI Whisper API."""
audio_path = Path(file_path)
if not audio_path.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
with open(audio_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"],
)
return transcript.text
# Quick test
text = transcribe_audio("weekly-standup.mp3")
print(text[:500])
|
The verbose_json response format gives you segment-level timestamps, which is useful if you want to attribute quotes to specific parts of the meeting. For plain text output, swap to response_format="text".
For files larger than 25 MB, split them first with pydub:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from pydub import AudioSegment
def split_audio(file_path: str, chunk_minutes: int = 10) -> list[str]:
"""Split audio into chunks under the 25 MB API limit."""
audio = AudioSegment.from_file(file_path)
chunk_ms = chunk_minutes * 60 * 1000
chunks = []
for i, start in enumerate(range(0, len(audio), chunk_ms)):
chunk = audio[start : start + chunk_ms]
chunk_path = f"/tmp/chunk_{i}.mp3"
chunk.export(chunk_path, format="mp3", bitrate="128k")
chunks.append(chunk_path)
return chunks
|
The agent uses two tools: one for transcription and one for structured summarization. The LLM decides which tools to call and in what order. This matters because it lets you extend the agent later – add a tool to email the summary, file a Jira ticket, or update a Notion page – without rewriting the orchestration logic.
Define the tools:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| import json
tools = [
{
"type": "function",
"function": {
"name": "transcribe_meeting",
"description": "Transcribe a meeting audio file to text using Whisper",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Path to the audio file (mp3, wav, m4a, mp4)",
}
},
"required": ["file_path"],
},
},
},
{
"type": "function",
"function": {
"name": "extract_meeting_summary",
"description": "Extract structured summary from meeting transcript including action items, decisions, and key topics",
"parameters": {
"type": "object",
"properties": {
"transcript": {
"type": "string",
"description": "The full meeting transcript text",
}
},
"required": ["transcript"],
},
},
},
]
|
Now wire up the agent loop. This handles multiple rounds of tool calls so the agent can transcribe first, then summarize:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| def run_agent(audio_file: str) -> str:
"""Run the meeting summarization agent."""
messages = [
{
"role": "system",
"content": (
"You are a meeting summarization agent. When given an audio file, "
"first transcribe it, then extract a structured summary with action items, "
"decisions, and key discussion topics. Always call the tools in order: "
"transcribe first, then extract the summary."
),
},
{
"role": "user",
"content": f"Summarize the meeting recorded in: {audio_file}",
},
]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto",
)
msg = response.choices[0].message
messages.append(msg)
if msg.tool_calls is None:
return msg.content
for tool_call in msg.tool_calls:
args = json.loads(tool_call.function.arguments)
name = tool_call.function.name
if name == "transcribe_meeting":
result = transcribe_audio(args["file_path"])
elif name == "extract_meeting_summary":
result = extract_structured_summary(args["transcript"])
else:
result = json.dumps({"error": f"Unknown tool: {name}"})
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": result if isinstance(result, str) else json.dumps(result),
}
)
|
Raw summaries are fine for reading, but you probably want machine-readable output for downstream automation. Use Pydantic models to define the schema and parse the LLM’s response:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| from pydantic import BaseModel
class ActionItem(BaseModel):
description: str
assignee: str
deadline: str | None = None
class Decision(BaseModel):
description: str
made_by: str
class MeetingSummary(BaseModel):
title: str
date: str
attendees: list[str]
key_topics: list[str]
action_items: list[ActionItem]
decisions: list[Decision]
follow_up_date: str | None = None
|
The extract_structured_summary function sends the transcript to the LLM with strict instructions to return JSON matching the schema:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| def extract_structured_summary(transcript: str) -> str:
"""Extract structured meeting data from a transcript."""
schema_json = MeetingSummary.model_json_schema()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Extract a structured summary from this meeting transcript. "
"Return ONLY valid JSON matching this schema:\n"
f"{json.dumps(schema_json, indent=2)}\n\n"
"Rules:\n"
"- List every action item with a clear assignee\n"
"- Capture all decisions that were made\n"
"- Identify attendees from the conversation\n"
"- Keep topic descriptions concise"
),
},
{"role": "user", "content": transcript},
],
response_format={"type": "json_object"},
temperature=0.1,
)
raw_json = response.choices[0].message.content
summary = MeetingSummary.model_validate_json(raw_json)
return summary.model_dump_json(indent=2)
|
Setting temperature=0.1 keeps the output deterministic. The response_format={"type": "json_object"} flag forces the model to output valid JSON – no markdown fences, no preamble.
Complete Pipeline#
Here’s everything wired together as a single script you can run from the command line:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
| #!/usr/bin/env python3
"""Meeting summarization agent: audio -> transcript -> structured summary."""
import json
import sys
from pathlib import Path
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class ActionItem(BaseModel):
description: str
assignee: str
deadline: str | None = None
class Decision(BaseModel):
description: str
made_by: str
class MeetingSummary(BaseModel):
title: str
date: str
attendees: list[str]
key_topics: list[str]
action_items: list[ActionItem]
decisions: list[Decision]
follow_up_date: str | None = None
def transcribe_audio(file_path: str) -> str:
audio_path = Path(file_path)
if not audio_path.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
with open(audio_path, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1", file=f, response_format="text"
)
return transcript
def extract_structured_summary(transcript: str) -> str:
schema_json = MeetingSummary.model_json_schema()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Extract a structured summary from this meeting transcript. "
"Return ONLY valid JSON matching this schema:\n"
f"{json.dumps(schema_json, indent=2)}"
),
},
{"role": "user", "content": transcript},
],
response_format={"type": "json_object"},
temperature=0.1,
)
raw_json = response.choices[0].message.content
summary = MeetingSummary.model_validate_json(raw_json)
return summary.model_dump_json(indent=2)
tools = [
{
"type": "function",
"function": {
"name": "transcribe_meeting",
"description": "Transcribe a meeting audio file to text",
"parameters": {
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Path to audio file"}
},
"required": ["file_path"],
},
},
},
{
"type": "function",
"function": {
"name": "extract_meeting_summary",
"description": "Extract structured summary from transcript",
"parameters": {
"type": "object",
"properties": {
"transcript": {"type": "string", "description": "Meeting transcript"}
},
"required": ["transcript"],
},
},
},
]
TOOL_MAP = {
"transcribe_meeting": lambda args: transcribe_audio(args["file_path"]),
"extract_meeting_summary": lambda args: extract_structured_summary(args["transcript"]),
}
def run_agent(audio_file: str) -> str:
messages = [
{
"role": "system",
"content": (
"You are a meeting summarization agent. Transcribe the audio first, "
"then extract a structured summary with action items and decisions."
),
},
{"role": "user", "content": f"Summarize the meeting: {audio_file}"},
]
for _ in range(5): # max iterations to prevent infinite loops
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools, tool_choice="auto"
)
msg = response.choices[0].message
messages.append(msg)
if not msg.tool_calls:
return msg.content
for tool_call in msg.tool_calls:
args = json.loads(tool_call.function.arguments)
handler = TOOL_MAP.get(tool_call.function.name)
if handler is None:
result = f"Unknown tool: {tool_call.function.name}"
else:
result = handler(args)
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": result if isinstance(result, str) else json.dumps(result),
}
)
return "Agent did not produce a final response within iteration limit."
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python meeting_agent.py <audio_file>")
sys.exit(1)
result = run_agent(sys.argv[1])
print(result)
|
Save this as meeting_agent.py, set your OPENAI_API_KEY environment variable, and run it:
1
2
3
| pip install openai pydantic pydub
export OPENAI_API_KEY="sk-..."
python meeting_agent.py weekly-standup.mp3
|
The agent calls transcribe_meeting first, gets the text back, then calls extract_meeting_summary with the transcript. You get a JSON summary with attendees, action items, decisions, and topics.
Improving Accuracy for Long Meetings#
Meetings over 30 minutes produce long transcripts that can cause the LLM to miss details. Two techniques that help:
Chunked summarization – split the transcript into segments, summarize each one, then do a final merge pass:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| def chunked_summary(transcript: str, chunk_size: int = 4000) -> str:
"""Summarize long transcripts in chunks, then merge."""
words = transcript.split()
chunks = [
" ".join(words[i : i + chunk_size])
for i in range(0, len(words), chunk_size)
]
partial_summaries = []
for i, chunk in enumerate(chunks):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
f"Summarize this section ({i+1}/{len(chunks)}) of a meeting transcript. "
"List action items, decisions, and key topics discussed."
),
},
{"role": "user", "content": chunk},
],
temperature=0.1,
)
partial_summaries.append(response.choices[0].message.content)
# Merge pass
combined = "\n\n---\n\n".join(partial_summaries)
return extract_structured_summary(combined)
|
Speaker diarization prompt – if your audio has multiple speakers, add a prompt hint to Whisper:
1
2
3
4
5
6
| transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
prompt="Meeting between Alice, Bob, and Carol. They discuss project timelines.",
)
|
This nudges the model toward recognizing speaker names and context, which improves the downstream summary quality.
Common Errors and Fixes#
openai.BadRequestError: Invalid file format
Whisper supports mp3, mp4, mpeg, mpga, m4a, wav, and webm. If you have an unsupported format, convert it first:
1
| ffmpeg -i meeting.ogg -ar 16000 -ac 1 meeting.mp3
|
openai.BadRequestError: Maximum content size limit (26214400) exceeded
Your file is over 25 MB. Use the split_audio function from above, transcribe each chunk, and concatenate the results.
pydantic.ValidationError: field required
The LLM returned JSON that doesn’t match your schema. This happens with shorter transcripts where the model can’t find certain fields. Add default values to your Pydantic models (like deadline: str | None = None) and make your system prompt more explicit about returning empty lists instead of omitting fields.
json.decoder.JSONDecodeError
You’re not using response_format={"type": "json_object"}. Without it, the model might wrap JSON in markdown fences or add explanatory text. Always set the response format when you expect JSON output.
Tool call loop / agent never finishes
The agent keeps calling tools without producing a final response. The iteration cap (for _ in range(5)) prevents infinite loops. If the agent legitimately needs more steps, increase the limit. Also check that your tool results are being appended with the correct tool_call_id – mismatched IDs cause the model to retry.
openai.RateLimitError on Whisper API
Whisper has lower rate limits than chat completions. Add retry logic with exponential backoff, or batch your transcription requests with a short delay between them.