How to Build a Contract Analysis Agent with LLMs and PDF Parsing

Contracts are dense, repetitive, and nobody reads them cover to cover. But an LLM agent with the right tools can parse a 40-page PDF, pull out the clauses that matter, and answer pointed questions about liability caps, termination windows, and payment schedules in seconds.

The architecture is straightforward: PyMuPDF extracts the raw text, a clause extractor splits it into labeled sections, and an OpenAI agent with function calling tools queries specific parts on demand. You get a contract Q&A system that actually knows where its answers come from.

Install Dependencies

1
pip install pymupdf openai

PyMuPDF installs as fitz. It is fast, handles most PDF layouts well, and does not need Java or external binaries like some alternatives.

Extract Text from a PDF Contract

First, get the raw text out of the PDF and split it into pages. PyMuPDF handles this in a few lines.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import fitz  # PyMuPDF
from dataclasses import dataclass


@dataclass
class ContractPage:
    page_number: int
    text: str


def extract_contract_text(pdf_path: str) -> list[ContractPage]:
    """Extract text from every page of a PDF contract."""
    pages = []
    doc = fitz.open(pdf_path)
    for i, page in enumerate(doc):
        text = page.get_text()
        if text.strip():
            pages.append(ContractPage(page_number=i + 1, text=text.strip()))
    doc.close()
    return pages


def get_full_text(pages: list[ContractPage]) -> str:
    """Combine all pages into a single string with page markers."""
    sections = []
    for p in pages:
        sections.append(f"--- Page {p.page_number} ---\n{p.text}")
    return "\n\n".join(sections)

The ContractPage dataclass keeps page numbers attached to the text. This matters when you want to cite which page a clause came from.

Extract Key Clauses

Contracts follow predictable patterns. Payment terms, termination conditions, liability limits, confidentiality, and indemnification clauses appear in almost every commercial agreement. A keyword-based section extractor catches these reliably.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import re

CLAUSE_PATTERNS = {
    "payment_terms": [
        r"payment\s+terms?", r"compensation", r"invoic(?:e|ing)",
        r"net\s+\d+", r"due\s+(?:date|within|upon)"
    ],
    "termination": [
        r"terminat(?:ion|e|ing)", r"cancellat(?:ion|e)",
        r"expir(?:ation|y|e)", r"right\s+to\s+terminate"
    ],
    "liability": [
        r"liabilit(?:y|ies)", r"limitation\s+of\s+liability",
        r"damages", r"indemnif(?:y|ication)", r"hold\s+harmless"
    ],
    "confidentiality": [
        r"confidential(?:ity)?", r"non-disclosure", r"proprietary\s+information"
    ],
    "governing_law": [
        r"governing\s+law", r"jurisdiction", r"arbitration", r"dispute\s+resolution"
    ],
}


def extract_clauses(full_text: str) -> dict[str, list[str]]:
    """Find paragraphs matching each clause category."""
    paragraphs = re.split(r"\n{2,}", full_text)
    clauses: dict[str, list[str]] = {key: [] for key in CLAUSE_PATTERNS}

    for para in paragraphs:
        para_lower = para.lower()
        for clause_type, patterns in CLAUSE_PATTERNS.items():
            for pattern in patterns:
                if re.search(pattern, para_lower):
                    clauses[clause_type].append(para.strip())
                    break  # avoid duplicating the same paragraph

    return clauses

This gives you a dictionary keyed by clause type, each containing a list of matching paragraphs. Not perfect for every contract format, but it handles the 80% case well and gives the LLM focused context instead of the entire document.

Build the Agent with Function Calling

Now wire everything into an OpenAI agent that can call tools to look up specific clauses or search the raw text. The tools parameter (not the deprecated functions parameter) defines what the agent can do.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import json
from openai import OpenAI

client = OpenAI()

# Parse the contract once at startup
contract_pages = extract_contract_text("contract.pdf")
contract_full_text = get_full_text(contract_pages)
contract_clauses = extract_clauses(contract_full_text)

# Define the tools the agent can call
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_clause",
            "description": "Retrieve extracted paragraphs for a specific clause type from the contract. Available types: payment_terms, termination, liability, confidentiality, governing_law.",
            "parameters": {
                "type": "object",
                "properties": {
                    "clause_type": {
                        "type": "string",
                        "enum": ["payment_terms", "termination", "liability", "confidentiality", "governing_law"],
                        "description": "The type of clause to retrieve"
                    }
                },
                "required": ["clause_type"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_contract",
            "description": "Search the full contract text for a specific keyword or phrase. Returns matching paragraphs with page numbers.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Keyword or phrase to search for in the contract"
                    }
                },
                "required": ["query"]
            }
        }
    }
]


def handle_get_clause(clause_type: str) -> str:
    """Return extracted clause paragraphs as a JSON string."""
    matches = contract_clauses.get(clause_type, [])
    if not matches:
        return json.dumps({"clause_type": clause_type, "found": False, "paragraphs": []})
    return json.dumps({"clause_type": clause_type, "found": True, "paragraphs": matches})


def handle_search_contract(query: str) -> str:
    """Search contract text for a keyword and return matching paragraphs with page info."""
    results = []
    for page in contract_pages:
        if query.lower() in page.text.lower():
            # Extract the paragraph containing the match
            for para in page.text.split("\n\n"):
                if query.lower() in para.lower():
                    results.append({"page": page.page_number, "text": para.strip()})
    if not results:
        return json.dumps({"query": query, "found": False, "results": []})
    return json.dumps({"query": query, "found": True, "results": results[:10]})


def dispatch_tool_call(name: str, arguments: dict) -> str:
    """Route a tool call to the correct handler."""
    if name == "get_clause":
        return handle_get_clause(arguments["clause_type"])
    elif name == "search_contract":
        return handle_search_contract(arguments["query"])
    else:
        return json.dumps({"error": f"Unknown tool: {name}"})

Run the Agent Loop

The agent loop sends the user question, checks if the model wants to call tools, executes them, and feeds results back until the model produces a final answer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def ask_contract_agent(question: str) -> str:
    """Run the contract analysis agent with a user question."""
    messages = [
        {
            "role": "system",
            "content": (
                "You are a contract analysis assistant. You have access to a parsed contract. "
                "Use the get_clause tool to retrieve specific clause types, or search_contract "
                "to find specific terms. Always cite which section or page your answer comes from. "
                "Be precise about dates, amounts, and obligations."
            )
        },
        {"role": "user", "content": question}
    ]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        message = response.choices[0].message

        # If the model has a final text response, return it
        if message.tool_calls is None:
            return message.content

        # Process each tool call
        messages.append(message)
        for tool_call in message.tool_calls:
            arguments = json.loads(tool_call.function.arguments)
            result = dispatch_tool_call(tool_call.function.name, arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })


# Ask questions about the contract
answer = ask_contract_agent("What are the payment terms and when are invoices due?")
print(answer)

answer = ask_contract_agent("What happens if either party wants to terminate early?")
print(answer)

answer = ask_contract_agent("Is there a liability cap? What's the maximum exposure?")
print(answer)

The agent will call get_clause with payment_terms, termination, or liability as needed, read the extracted paragraphs, and produce a focused answer. If the clause extractor missed something, the agent falls back to search_contract for a keyword search across the full text.

Handling Multi-Page Contracts

For long contracts (50+ pages), sending the entire text to the LLM burns tokens and can degrade answer quality. A better approach is chunking by section headings and only sending relevant chunks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def chunk_by_sections(full_text: str, max_chunk_size: int = 3000) -> list[dict]:
    """Split contract text into chunks based on section headings."""
    # Match common contract section patterns like "1.", "1.1", "ARTICLE I", "Section 2"
    section_pattern = r"(?:^|\n)(?:(?:ARTICLE|Section|SECTION)\s+[IVXLCDM\d]+[\.\:]?\s|(?:\d+\.)+\s)"
    parts = re.split(section_pattern, full_text)

    chunks = []
    current_chunk = ""
    for part in parts:
        if len(current_chunk) + len(part) > max_chunk_size and current_chunk:
            chunks.append({"index": len(chunks), "text": current_chunk.strip()})
            current_chunk = part
        else:
            current_chunk += part
    if current_chunk.strip():
        chunks.append({"index": len(chunks), "text": current_chunk.strip()})

    return chunks

You can add a third tool that retrieves a specific chunk by index, or pre-filter chunks using embedding similarity before passing them to the agent. That keeps token usage low even on 100-page agreements.

Common Errors and Fixes

fitz.FileDataError: cannot open broken document – The PDF is encrypted or corrupted. Try opening it with a password first: fitz.open(pdf_path, password="your_password"). If there is no password and the file is genuinely broken, re-export it from the source application.

Empty text extraction – Some PDFs are scanned images with no text layer. Check with page.get_text() returning empty strings on every page. You need OCR for these. Add pytesseract and convert pages to images first:

1
2
3
4
5
6
7
import fitz

doc = fitz.open("scanned_contract.pdf")
for page in doc:
    pix = page.get_pixmap(dpi=300)
    pix.save(f"page_{page.number}.png")
    # Then run pytesseract.image_to_string() on each image

openai.BadRequestError: ... tool_calls ... – You forgot to append the assistant message containing tool_calls before appending the tool result. The message sequence must be: user -> assistant (with tool_calls) -> tool (with tool_call_id) -> next turn. Missing any step breaks the conversation format.

Clause extractor returns empty lists – The contract uses unusual formatting or legal jargon that does not match the regex patterns. Expand CLAUSE_PATTERNS with terms specific to your contract type, or fall back to sending the full text to the LLM and asking it to identify clause boundaries.

Token limit exceeded on long contracts – If the contract text plus conversation exceeds the model context window, use the chunking approach from the previous section. For gpt-4o, you have a 128k token context, which covers roughly 300 pages of text, but shorter context windows produce more focused answers.

Install Dependencies#

Extract Text from a PDF Contract#

Extract Key Clauses#

Build the Agent with Function Calling#

Run the Agent Loop#

Handling Multi-Page Contracts#

Common Errors and Fixes#

Related Guides#

About the Author