Every LLM API call carries overhead: system prompt tokens, connection latency, rate limit budget. If you’re making 50 individual calls to classify 50 items, you’re paying for that system prompt 50 times. Batching multiple tasks into a single call eliminates redundant tokens and can cut costs by 40-60% on prompt-heavy workloads.

Here’s the simplest version. Instead of one call per task, pack them together with clear delimiters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from openai import OpenAI

client = OpenAI()

# 10 product reviews to classify
reviews = [
    "Battery dies after 2 hours. Terrible.",
    "Best purchase I've made this year!",
    "It works fine. Nothing special.",
    "Arrived broken, customer service ghosted me.",
    "Exceeded expectations, great build quality.",
    "Okay for the price, but feels cheap.",
    "Absolute garbage. Returned immediately.",
    "Solid product, fast shipping.",
    "Not worth the hype but decent.",
    "Five stars, would recommend to anyone.",
]

# Batch all 10 into one call
numbered_reviews = "\n".join(
    f"[{i+1}] {review}" for i, review in enumerate(reviews)
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": (
                "Classify each numbered review as POSITIVE, NEGATIVE, or NEUTRAL. "
                "Return one classification per line in the format: [N] LABEL"
            ),
        },
        {"role": "user", "content": numbered_reviews},
    ],
    temperature=0,
)

print(response.choices[0].message.content)

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
[1] NEGATIVE
[2] POSITIVE
[3] NEUTRAL
[4] NEGATIVE
[5] POSITIVE
[6] NEUTRAL
[7] NEGATIVE
[8] POSITIVE
[9] NEUTRAL
[10] POSITIVE

That’s 10 classifications for the cost of one API call. The system prompt tokens get shared across all items, and the per-item overhead is just the review text plus a line number.

Structured Outputs for Reliable Batch Parsing

Plain text works for simple labels, but it gets fragile fast. If you’re extracting entities, scores, or multi-field results, use function calling with the tools parameter to force structured JSON back.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import json
from openai import OpenAI

client = OpenAI()

emails = [
    "Hi, I'd like to cancel my subscription for account #4821. Thanks, Maria",
    "URGENT: Server is down in us-east-1, need immediate escalation. - James",
    "Can you update my billing address to 123 Oak St, Denver CO? - Alex",
]

numbered_emails = "\n---\n".join(
    f"[Email {i+1}]\n{email}" for i, email in enumerate(emails)
)

batch_tool = {
    "type": "function",
    "function": {
        "name": "classify_emails",
        "description": "Classify a batch of customer emails",
        "parameters": {
            "type": "object",
            "properties": {
                "classifications": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "email_number": {"type": "integer"},
                            "category": {
                                "type": "string",
                                "enum": [
                                    "cancellation",
                                    "technical_issue",
                                    "billing",
                                    "general_inquiry",
                                ],
                            },
                            "priority": {
                                "type": "string",
                                "enum": ["low", "medium", "high"],
                            },
                            "summary": {"type": "string"},
                        },
                        "required": [
                            "email_number",
                            "category",
                            "priority",
                            "summary",
                        ],
                    },
                }
            },
            "required": ["classifications"],
        },
    },
}

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Classify each email in the batch. Extract category, priority, and a one-line summary.",
        },
        {"role": "user", "content": numbered_emails},
    ],
    tools=[batch_tool],
    tool_choice={"type": "function", "function": {"name": "classify_emails"}},
    temperature=0,
)

results = json.loads(response.choices[0].message.tool_calls[0].function.arguments)

for item in results["classifications"]:
    print(
        f"Email {item['email_number']}: "
        f"{item['category']} ({item['priority']}) - {item['summary']}"
    )

The tools parameter guarantees you get valid JSON matching your schema. No regex parsing, no hoping the model follows your format instructions. Each item in the array maps cleanly to a source email.

Cost and Latency: Batched vs. Individual Calls

The math is straightforward. Say your system prompt is 500 tokens and each task adds 50 tokens of input. For 20 tasks:

ApproachInput tokensAPI callsApprox cost (GPT-4o)
Individual (20 calls)20 x (500 + 50) = 11,00020~$0.028
Batched (1 call)500 + (20 x 50) = 1,5001~$0.004

That’s roughly a 7x cost reduction. The savings scale linearly with system prompt size and batch count. Output tokens stay about the same either way, but input tokens drop dramatically.

Latency is more nuanced. A single batched call takes longer than any individual call – the model has to process everything at once. But total wall-clock time is almost always lower because you’re not making 20 sequential round trips. If you’re already parallelizing individual calls with asyncio, the latency advantage shrinks, but the cost advantage stays.

The sweet spot is 5-20 items per batch for classification and extraction tasks. Beyond 20, you start hitting context window pressure and quality degradation.

Async Batching with the OpenAI Batch API

For truly large workloads – thousands of requests – the OpenAI Batch API processes them asynchronously at a 50% discount. You submit a JSONL file, and results come back within 24 hours.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import json
import time
from openai import OpenAI

client = OpenAI()

# Build a batch of 100 classification requests
tasks = [f"Review {i}: This product is {'great' if i % 2 == 0 else 'terrible'}." for i in range(100)]

# Write JSONL input file
requests = []
for i, task in enumerate(tasks):
    requests.append({
        "custom_id": f"task-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o",
            "messages": [
                {"role": "system", "content": "Classify as POSITIVE or NEGATIVE. Reply with just the label."},
                {"role": "user", "content": task},
            ],
            "temperature": 0,
        },
    })

with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload and submit
input_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")

batch = client.batches.create(
    input_file_id=input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.status}")

# Poll for completion
while batch.status not in ("completed", "failed", "expired"):
    time.sleep(30)
    batch = client.batches.retrieve(batch.id)
    print(f"Status: {batch.status} ({batch.request_counts.completed}/{batch.request_counts.total})")

# Download results
if batch.status == "completed":
    output = client.files.content(batch.output_file_id)
    results = [json.loads(line) for line in output.text.strip().split("\n")]

    for result in results[:5]:
        content = result["response"]["body"]["choices"][0]["message"]["content"]
        print(f"{result['custom_id']}: {content}")

The Batch API is not for interactive use cases. Response times range from minutes to hours. But at half price with no rate limit pressure, it’s the right choice for bulk classification, embedding generation, or data enrichment jobs.

You can combine both strategies: use in-prompt batching (multiple items per request) in your JSONL entries to get the 50% Batch API discount plus shared system prompt savings. That stacks.

When Batching Hurts Quality

Batching is not free. The model has a finite attention budget, and cramming 50 tasks into one prompt can degrade results. Here’s when to watch out:

  • Complex reasoning tasks. If each item requires multi-step thinking, the model tends to take shortcuts on later items. Classification is fine. Chain-of-thought analysis is not.
  • Order effects. Models give better answers to items near the beginning and end of a batch. Items in the middle get less attention. Shuffle your batch order across runs if you need consistent quality.
  • Cross-contamination. The model might let context from one item leak into its answer for another. If item 3 is about dogs and item 4 is about cats, the cat answer might reference dogs.
  • Long batches. Past 15-20 items, accuracy drops measurably on most tasks. Test this with your specific workload.

To detect quality degradation, run a simple A/B comparison:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
from openai import OpenAI

client = OpenAI()

test_items = [
    "The restaurant was noisy but the food was excellent.",
    "Worst meal I've ever had. Cold pasta, rude waiter.",
    "Average experience. Nothing memorable.",
    "Incredible atmosphere and perfect steak.",
    "Food poisoning the next day. Never again.",
]

# Individual calls (ground truth)
individual_results = []
for item in test_items:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL. Reply with just the label."},
            {"role": "user", "content": item},
        ],
        temperature=0,
    )
    individual_results.append(resp.choices[0].message.content.strip())

# Batched call
numbered = "\n".join(f"[{i+1}] {item}" for i, item in enumerate(test_items))
batch_resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Classify each numbered review as POSITIVE, NEGATIVE, or NEUTRAL. Return one per line: [N] LABEL"},
        {"role": "user", "content": numbered},
    ],
    temperature=0,
)

batch_lines = batch_resp.choices[0].message.content.strip().split("\n")
batch_results = [line.split("] ")[-1].strip() for line in batch_lines]

# Compare
mismatches = 0
for i, (ind, bat) in enumerate(zip(individual_results, batch_results)):
    match = "OK" if ind == bat else "MISMATCH"
    if ind != bat:
        mismatches += 1
    print(f"Item {i+1}: individual={ind}, batched={bat} [{match}]")

accuracy = (len(test_items) - mismatches) / len(test_items) * 100
print(f"\nBatch agreement: {accuracy:.0f}%")

If batch agreement drops below 95% on your task, reduce batch size or switch to individual calls for that particular workload. For simple classification, you’ll typically see 98-100% agreement up to batch sizes of 15-20.

Common Errors and Fixes

Batch API returns failed status. Check batch.errors for details. The most common cause is malformed JSONL – each line must be valid JSON with the exact fields custom_id, method, url, and body. Validate your file before uploading:

1
2
3
4
5
6
7
with open("batch_input.jsonl") as f:
    for i, line in enumerate(f):
        try:
            obj = json.loads(line)
            assert all(k in obj for k in ("custom_id", "method", "url", "body"))
        except (json.JSONDecodeError, AssertionError) as e:
            print(f"Line {i+1} is invalid: {e}")

Model returns fewer items than you sent. This happens when your batch is too large or the delimiter format is ambiguous. Always use unique, consistent delimiters like [N] numbering. If items are long, add a clear separator like --- between them.

Tool call arguments fail JSON parsing. Occasionally the model produces slightly malformed JSON even with tools. Wrap your json.loads() in a try/except and retry with temperature=0 and a smaller batch. Setting temperature=0 makes this extremely rare.

Token limit exceeded on batched prompt. Sum your system prompt tokens plus all item tokens before sending. If you’re close to the model’s context window, split into multiple smaller batches. A simple approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import tiktoken

def chunk_items(items, max_tokens=3000, model="gpt-4o"):
    """Split items into chunks that fit within token budget."""
    enc = tiktoken.encoding_for_model(model)
    chunks, current_chunk, current_tokens = [], [], 0
    for item in items:
        item_tokens = len(enc.encode(item))
        if current_tokens + item_tokens > max_tokens and current_chunk:
            chunks.append(current_chunk)
            current_chunk, current_tokens = [], 0
        current_chunk.append(item)
        current_tokens += item_tokens
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

Rate limit errors during polling. When polling the Batch API status, use 30-60 second intervals. Polling every second will get you rate-limited on the API management endpoints.