How to Use the Anthropic PDF Processing API for Document Analysis

Claude can read PDFs natively. No OCR libraries, no text extraction pipelines, no preprocessing. You send a base64-encoded PDF in a message, and Claude processes the pages directly – text, tables, charts, and layout included. This works through the standard Messages API with a document content block.

Here’s how to set it up and put it to work.

Sending a PDF to Claude

Install the SDK and set your API key:

1
2
pip install anthropic
export ANTHROPIC_API_KEY="your-api-key-here"

The core pattern is straightforward. Read the PDF, base64-encode it, and pass it as a document content block alongside your text prompt.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import anthropic
import base64

client = anthropic.Anthropic()

with open("report.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Summarize this document in 3 bullet points.",
                },
            ],
        }
    ],
)

print(message.content[0].text)

The media_type must be application/pdf. The source.type is base64. That’s the only encoding method supported for direct file uploads. You can also pass PDFs via URL if they’re publicly accessible, but base64 is the most reliable approach for local files.

Extracting Structured Data from Documents

The real power shows up when you need structured output. Invoices, receipts, tax forms – Claude can pull fields out of messy layouts and return clean JSON.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import anthropic
import base64
import json

client = anthropic.Anthropic()

with open("invoice.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

extraction_prompt = """Extract the following fields from this invoice as JSON:
- invoice_number
- date
- vendor_name
- vendor_address
- line_items (array of {description, quantity, unit_price, total})
- subtotal
- tax
- total_due

Return ONLY valid JSON, no markdown formatting."""

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    },
                },
                {
                    "type": "text",
                    "text": extraction_prompt,
                },
            ],
        }
    ],
)

invoice_data = json.loads(message.content[0].text)
print(json.dumps(invoice_data, indent=2))

A few tips for reliable extraction. Be explicit about the schema you want. List every field name. Specify array structures. Tell Claude to return only JSON – otherwise you’ll get markdown code fences wrapped around it, which breaks json.loads.

For forms with checkboxes or handwritten entries, Claude handles those too. It reads the visual layout, not just embedded text. So even scanned PDFs with no selectable text layer work well.

Multi-Page Summarization and Comparison

Long documents work the same way. Claude processes all pages in the PDF, so a 50-page contract gets the same treatment as a single-page memo.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import anthropic
import base64

client = anthropic.Anthropic()

with open("contract.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Provide a section-by-section summary of this contract. For each section, list the key obligations and any deadlines mentioned.",
                },
            ],
        }
    ],
)

print(message.content[0].text)

You can also compare two PDFs in a single request. Send both as separate document blocks in the same message:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import anthropic
import base64

client = anthropic.Anthropic()

with open("contract_v1.pdf", "rb") as f:
    pdf_v1 = base64.standard_b64encode(f.read()).decode("utf-8")

with open("contract_v2.pdf", "rb") as f:
    pdf_v2 = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_v1,
                    },
                },
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_v2,
                    },
                },
                {
                    "type": "text",
                    "text": "Compare these two contract versions. List every change: added clauses, removed clauses, and modified terms. Format as a table with columns: Section, Change Type, Details.",
                },
            ],
        }
    ],
)

print(message.content[0].text)

This is useful for redline reviews, policy updates, or tracking changes between document revisions.

Building a Batch PDF Processing Pipeline

When you have a folder of PDFs to process, wrap the core pattern in a loop. Here’s a pipeline that extracts key metadata from every PDF in a directory and writes the results to a CSV.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import anthropic
import base64
import json
import csv
from pathlib import Path

client = anthropic.Anthropic()

pdf_dir = Path("./documents")
output_file = "extracted_metadata.csv"

extraction_prompt = """Extract these fields from the document as JSON:
- document_type (e.g., invoice, receipt, contract, report)
- date (ISO format if found, null otherwise)
- primary_entity (company or person name)
- total_amount (if applicable, null otherwise)
- page_count_estimate (your best guess)
- one_line_summary

Return ONLY valid JSON."""

results = []

for pdf_path in sorted(pdf_dir.glob("*.pdf")):
    print(f"Processing: {pdf_path.name}")

    with open(pdf_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

    try:
        message = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=1024,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "document",
                            "source": {
                                "type": "base64",
                                "media_type": "application/pdf",
                                "data": pdf_data,
                            },
                        },
                        {
                            "type": "text",
                            "text": extraction_prompt,
                        },
                    ],
                }
            ],
        )

        data = json.loads(message.content[0].text)
        data["filename"] = pdf_path.name
        results.append(data)

    except anthropic.BadRequestError as e:
        print(f"  Skipped {pdf_path.name}: {e}")
    except json.JSONDecodeError:
        print(f"  Failed to parse JSON for {pdf_path.name}")

if results:
    fieldnames = results[0].keys()
    with open(output_file, "w", newline="") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(results)

    print(f"\nWrote {len(results)} records to {output_file}")

For higher throughput, consider the Anthropic Batch API instead of sequential calls. It lets you submit up to 10,000 requests at once with a 50% cost discount, though results arrive asynchronously within 24 hours.

Common Errors and Fixes

“Could not process document” / 400 Bad Request

The PDF is likely too large. Claude supports PDFs up to 100 pages and roughly 32 MB after base64 encoding. If your file exceeds this, split it with a tool like PyPDF2:

1
2
3
4
5
6
7
8
9
from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("large_document.pdf")
for i in range(0, len(reader.pages), 50):
    writer = PdfWriter()
    for page in reader.pages[i:i + 50]:
        writer.add_page(page)
    with open(f"chunk_{i // 50}.pdf", "wb") as f:
        writer.write(f)

Base64 encoding issues

Use base64.standard_b64encode, not base64.b64encode with manual padding. The standard function handles padding correctly. Also make sure you call .decode("utf-8") on the result – the API expects a string, not bytes.

“Invalid content type”

Double-check that media_type is set to application/pdf. Sending application/octet-stream or omitting it entirely will fail. The type field on the content block must be "document", not "image" or "file".

Truncated or incomplete responses

If Claude’s response cuts off mid-sentence, increase max_tokens. PDF analysis often produces long responses, especially for multi-page documents. Start with 4096 for summaries and 8192 for detailed extractions.

Rate limits on batch processing

The API has per-minute rate limits on both requests and tokens. For large batches, add a simple delay between calls:

1
2
3
4
5
import time

for pdf_path in pdf_files:
    # ... process PDF ...
    time.sleep(1)  # 1 second between requests

Or switch to the Batch API for true high-volume work.

Password-protected PDFs

Claude cannot process encrypted or password-protected PDFs. You need to decrypt them first. Use pikepdf for this:

1
2
3
4
import pikepdf

with pikepdf.open("protected.pdf", password="secret") as pdf:
    pdf.save("decrypted.pdf")

Then send the decrypted file to the API.

Sending a PDF to Claude#

Extracting Structured Data from Documents#

Multi-Page Summarization and Comparison#

Building a Batch PDF Processing Pipeline#

Common Errors and Fixes#

Related Guides#

About the Author

Sending a PDF to Claude

Extracting Structured Data from Documents

Multi-Page Summarization and Comparison

Building a Batch PDF Processing Pipeline

Common Errors and Fixes

Related Guides