You have a stack of PDFs and someone asks a question about them. You could open each file, Ctrl+F your way through, and hope you find the right section. Or you could build an agent that does it for you — one that extracts the text, searches it semantically, and returns an answer with the exact page number where it found the information.
That’s what we’re building here. A document QA agent that parses PDFs with PyMuPDF, chunks and embeds the text with sentence-transformers, and uses OpenAI’s tool-calling API to search and answer in a loop. The whole thing runs in a single Python script with no vector database required.
Extract Text from PDFs
PyMuPDF (imported as fitz) is the fastest pure-Python PDF library. It extracts text page by page, which is exactly what we need for page-level citations.
Install the dependencies first:
| |
Now extract text from a PDF, keeping track of which page each block of text came from:
| |
The get_text("text") call returns plain text in reading order. For scanned PDFs you’d need OCR, but for born-digital PDFs this handles tables, headers, and body text well. Each page entry carries its source file and page number so we can cite them later.
Chunk and Embed the Documents
Full pages are often too long for a single embedding. We need to split them into smaller chunks while preserving the page metadata. A simple fixed-size chunking approach with overlap works well here.
| |
We use all-MiniLM-L6-v2 because it’s small (80MB), fast, and accurate enough for document retrieval. The normalize_embeddings=True flag means we can use dot product instead of cosine similarity later, which is slightly faster. Each chunk keeps its page number and source file so we can trace answers back to exact locations.
Build the Search Tool
The search tool takes a query string, embeds it, and returns the top-k most similar chunks. This is what the agent will call when it needs to look something up.
| |
No vector database needed. For a few hundred PDFs, numpy dot product on normalized embeddings is fast enough. If you’re dealing with millions of chunks, swap in FAISS or a proper vector store, but for most document QA workflows this scales fine.
Test it with a quick query:
| |
Wire Up the Agent Loop
Now the interesting part. We define the search function as a tool for the OpenAI API and run an agent loop that calls it as needed. The agent decides when to search, reads the results, and formulates an answer with citations.
| |
A few things to note about this agent loop. The tool_choice="auto" parameter lets the model decide when to search. It might search multiple times for complex questions — once for revenue numbers, once for product breakdowns. The loop runs up to max_turns iterations, and each tool call result gets appended as a tool role message with the matching tool_call_id. That’s how OpenAI’s API links results back to specific function calls.
The agent will naturally cite sources because the search results include page numbers and filenames, and the system prompt tells it to use them. You’ll get answers like “Q4 revenue was $12.3M (report_q4.pdf, p.4), with the enterprise product line contributing 62% of total revenue (report_q4.pdf, p.7).”
Common Errors and Fixes
RuntimeError: No module named 'frontend' when importing fitz
This happens when you have both PyMuPDF and the old fitz package installed. They conflict. Uninstall both and reinstall only PyMuPDF:
| |
openai.BadRequestError: ... 'functions' is not allowed
You’re mixing old and new API parameters. The functions and function_call parameters were deprecated. Use tools and tool_choice instead, as shown in the agent loop above. Also make sure you’re on openai>=1.0.0:
| |
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
This happens when embeddings is 1-D, usually because you only embedded a single chunk. The np.dot call expects a 2-D array. Fix it by ensuring your embeddings array always has the right shape:
| |
Empty text extraction from scanned PDFs
If page.get_text("text") returns empty strings, the PDF contains scanned images rather than selectable text. You need OCR. Add pymupdf OCR support or preprocess with Tesseract:
| |
| |
Related Guides
- How to Build a Contract Analysis Agent with LLMs and PDF Parsing
- How to Build a Tool-Calling Agent with Claude and MCP
- How to Build a Customer Support Agent with RAG and Tool Calling
- How to Build a SQL Query Agent with LLMs and Tool Calling
- How to Build a Slack Bot Agent with LLMs and Bolt
- How to Build a Log Analysis Agent with LLMs and Regex Tools
- How to Build a Financial Analysis Agent with LLMs and Market Data
- How to Build a Monitoring Agent with Prometheus Alerts and LLM Diagnosis
- How to Build an API Testing Agent with LLMs and Requests
- How to Build a Scheduling Agent with Calendar and Email Tools