You have a stack of PDFs and someone asks a question about them. You could open each file, Ctrl+F your way through, and hope you find the right section. Or you could build an agent that does it for you — one that extracts the text, searches it semantically, and returns an answer with the exact page number where it found the information.

That’s what we’re building here. A document QA agent that parses PDFs with PyMuPDF, chunks and embeds the text with sentence-transformers, and uses OpenAI’s tool-calling API to search and answer in a loop. The whole thing runs in a single Python script with no vector database required.

Extract Text from PDFs

PyMuPDF (imported as fitz) is the fastest pure-Python PDF library. It extracts text page by page, which is exactly what we need for page-level citations.

Install the dependencies first:

1
pip install PyMuPDF sentence-transformers openai numpy

Now extract text from a PDF, keeping track of which page each block of text came from:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import fitz  # PyMuPDF


def extract_pages(pdf_path: str) -> list[dict]:
    """Extract text from each page of a PDF."""
    pages = []
    doc = fitz.open(pdf_path)
    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text("text")
        if text.strip():
            pages.append({
                "page": page_num + 1,
                "text": text.strip(),
                "source": pdf_path,
            })
    doc.close()
    return pages


# Extract from multiple PDFs
pdf_files = ["report_q4.pdf", "product_spec.pdf"]
all_pages = []
for pdf in pdf_files:
    all_pages.extend(extract_pages(pdf))

print(f"Extracted {len(all_pages)} pages from {len(pdf_files)} documents")

The get_text("text") call returns plain text in reading order. For scanned PDFs you’d need OCR, but for born-digital PDFs this handles tables, headers, and body text well. Each page entry carries its source file and page number so we can cite them later.

Chunk and Embed the Documents

Full pages are often too long for a single embedding. We need to split them into smaller chunks while preserving the page metadata. A simple fixed-size chunking approach with overlap works well here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from sentence_transformers import SentenceTransformer
import numpy as np


def chunk_pages(pages: list[dict], chunk_size: int = 500, overlap: int = 100) -> list[dict]:
    """Split page text into overlapping chunks with metadata."""
    chunks = []
    for page in pages:
        text = page["text"]
        words = text.split()
        if len(words) <= chunk_size:
            chunks.append({
                "text": text,
                "page": page["page"],
                "source": page["source"],
            })
        else:
            for i in range(0, len(words), chunk_size - overlap):
                chunk_words = words[i:i + chunk_size]
                if len(chunk_words) < 50:
                    continue  # skip tiny trailing chunks
                chunks.append({
                    "text": " ".join(chunk_words),
                    "page": page["page"],
                    "source": page["source"],
                })
    return chunks


# Chunk all pages
chunks = chunk_pages(all_pages, chunk_size=500, overlap=100)
print(f"Created {len(chunks)} chunks")

# Embed with sentence-transformers
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [c["text"] for c in chunks]
embeddings = model.encode(texts, show_progress_bar=True, normalize_embeddings=True)

print(f"Embeddings shape: {embeddings.shape}")

We use all-MiniLM-L6-v2 because it’s small (80MB), fast, and accurate enough for document retrieval. The normalize_embeddings=True flag means we can use dot product instead of cosine similarity later, which is slightly faster. Each chunk keeps its page number and source file so we can trace answers back to exact locations.

Build the Search Tool

The search tool takes a query string, embeds it, and returns the top-k most similar chunks. This is what the agent will call when it needs to look something up.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def search_documents(query: str, top_k: int = 5) -> list[dict]:
    """Search document chunks by semantic similarity."""
    query_embedding = model.encode([query], normalize_embeddings=True)
    scores = np.dot(embeddings, query_embedding.T).squeeze()
    top_indices = np.argsort(scores)[::-1][:top_k]

    results = []
    for idx in top_indices:
        results.append({
            "text": chunks[idx]["text"],
            "page": chunks[idx]["page"],
            "source": chunks[idx]["source"],
            "score": float(scores[idx]),
        })
    return results

No vector database needed. For a few hundred PDFs, numpy dot product on normalized embeddings is fast enough. If you’re dealing with millions of chunks, swap in FAISS or a proper vector store, but for most document QA workflows this scales fine.

Test it with a quick query:

1
2
3
4
results = search_documents("What were Q4 revenue figures?")
for r in results[:3]:
    print(f"[{r['source']} p.{r['page']}] (score: {r['score']:.3f})")
    print(f"  {r['text'][:150]}...\n")

Wire Up the Agent Loop

Now the interesting part. We define the search function as a tool for the OpenAI API and run an agent loop that calls it as needed. The agent decides when to search, reads the results, and formulates an answer with citations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import json
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_documents",
            "description": "Search through parsed PDF documents to find relevant passages. Returns text chunks with page numbers and source file names.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query to find relevant document passages",
                    },
                    "top_k": {
                        "type": "integer",
                        "description": "Number of results to return (default 5)",
                    },
                },
                "required": ["query"],
            },
        },
    }
]

SYSTEM_PROMPT = """You are a document QA assistant. You answer questions based on PDF documents.
Always search the documents before answering. Cite your sources with the filename and page number.
Format citations like: (source.pdf, p.12). If the documents don't contain the answer, say so."""


def run_agent(question: str, max_turns: int = 5) -> str:
    """Run the document QA agent loop."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]

    for turn in range(max_turns):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )

        message = response.choices[0].message
        messages.append(message)

        # If no tool calls, the agent is done
        if not message.tool_calls:
            return message.content

        # Process each tool call
        for tool_call in message.tool_calls:
            if tool_call.function.name == "search_documents":
                args = json.loads(tool_call.function.arguments)
                results = search_documents(
                    query=args["query"],
                    top_k=args.get("top_k", 5),
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(results, indent=2),
                })

    return messages[-1].content if messages[-1].role == "assistant" else "Max turns reached."


# Ask a question
answer = run_agent("What was the total revenue in Q4 and which products contributed most?")
print(answer)

A few things to note about this agent loop. The tool_choice="auto" parameter lets the model decide when to search. It might search multiple times for complex questions — once for revenue numbers, once for product breakdowns. The loop runs up to max_turns iterations, and each tool call result gets appended as a tool role message with the matching tool_call_id. That’s how OpenAI’s API links results back to specific function calls.

The agent will naturally cite sources because the search results include page numbers and filenames, and the system prompt tells it to use them. You’ll get answers like “Q4 revenue was $12.3M (report_q4.pdf, p.4), with the enterprise product line contributing 62% of total revenue (report_q4.pdf, p.7).”

Common Errors and Fixes

RuntimeError: No module named 'frontend' when importing fitz

This happens when you have both PyMuPDF and the old fitz package installed. They conflict. Uninstall both and reinstall only PyMuPDF:

1
2
pip uninstall fitz PyMuPDF
pip install PyMuPDF

openai.BadRequestError: ... 'functions' is not allowed

You’re mixing old and new API parameters. The functions and function_call parameters were deprecated. Use tools and tool_choice instead, as shown in the agent loop above. Also make sure you’re on openai>=1.0.0:

1
pip install --upgrade openai

numpy.AxisError: axis 1 is out of bounds for array of dimension 1

This happens when embeddings is 1-D, usually because you only embedded a single chunk. The np.dot call expects a 2-D array. Fix it by ensuring your embeddings array always has the right shape:

1
2
3
# Force 2D shape even with single document
if embeddings.ndim == 1:
    embeddings = embeddings.reshape(1, -1)

Empty text extraction from scanned PDFs

If page.get_text("text") returns empty strings, the PDF contains scanned images rather than selectable text. You need OCR. Add pymupdf OCR support or preprocess with Tesseract:

1
pip install pytesseract Pillow
1
2
3
4
5
6
7
8
import pytesseract
from PIL import Image
import io

def extract_page_ocr(page):
    pix = page.get_pixmap(dpi=300)
    img = Image.open(io.BytesIO(pix.tobytes("png")))
    return pytesseract.image_to_string(img)