The Quick Version
You feed unstructured text to an LLM, ask it to extract (subject, predicate, object) triples, parse the output into structured data with Pydantic, and load those triples into a graph. NetworkX works for prototyping. Neo4j works for production.
| |
| |
The client.beta.chat.completions.parse method with response_format forces the model to return JSON matching your Pydantic schema. No regex parsing, no hoping the model returns valid JSON. This is the single biggest improvement over older extraction pipelines.
Why Pydantic for Structured Output
Older approaches used prompts like “return JSON in this format…” and then crossed their fingers. The model would sometimes return markdown-wrapped JSON, sometimes add commentary, sometimes hallucinate extra fields. OpenAI’s structured output feature (and Anthropic’s equivalent) guarantees the response matches your schema.
The Triple model is intentionally simple. You could extend it with confidence scores, source spans, or entity types:
| |
For most use cases, the basic triple is enough. Add entity types when you need to filter or style nodes differently in visualization.
Building the Graph with NetworkX
NetworkX is the fastest way to go from triples to a queryable graph. Good for prototyping, exploratory analysis, and datasets under ~100K triples.
| |
NetworkX stores everything in memory. Once you cross ~500K edges or need concurrent access, move to Neo4j.
Storing in Neo4j for Production
Neo4j gives you persistent storage, Cypher queries, and a visual graph explorer. The Python driver handles the connection.
| |
MERGE is critical here – it prevents duplicate nodes and edges when you process overlapping documents. If you used CREATE instead, you would get duplicate “OpenAI” nodes every time a new document mentions the company.
Querying the Graph with Cypher
Once the triples are loaded, Cypher lets you ask questions that would be painful with SQL joins.
| |
Two-hop and three-hop traversals are where knowledge graphs pay off. Finding “companies that Microsoft invested in, and the products those companies released” is a single Cypher query. In a relational database, that is multiple joins across tables you would have to design upfront.
Processing Documents Incrementally
Real-world graphs get built document by document. Here is a pipeline that processes a batch of texts and deduplicates entities.
| |
The sources counter tracks how many documents confirm each relationship. Higher counts mean higher confidence. You can filter out triples that only appear once to reduce noise.
Prompting Strategies That Actually Matter
The extraction prompt is where most people get mediocre results. Three things make the biggest difference:
Be specific about predicate vocabulary. If you let the model freestyle, you will get “founded”, “was founded by”, “co-founded”, and “started” as separate predicates for the same relationship. Provide a predicate list in the system prompt, or add a normalization pass.
Set entity granularity explicitly. Should “San Francisco” and “SF” be the same node? Should “GPT-4” and “GPT-4 Turbo” merge? Tell the model your rules. Otherwise you get a fragmented graph.
Chunk long documents. Models lose extraction quality past ~2000 tokens of input. Split documents into overlapping chunks of 500-800 tokens and extract from each chunk separately. The MERGE strategy in Neo4j (or the dedup logic in NetworkX) handles the overlap.
| |
Common Errors and Fixes
openai.BadRequestError: ... response_format is not supported
Structured outputs require gpt-4o-2024-08-06 or later. If you are on an older model snapshot, either update the model string or fall back to JSON mode with manual parsing.
Neo4j ServiceUnavailable: Failed to establish connection
The Neo4j server is not running. Start it with docker run -d -p 7687:7687 -p 7474:7474 -e NEO4J_AUTH=neo4j/password neo4j:latest. Wait 10-15 seconds for it to initialize before connecting.
Duplicate nodes with slightly different names
The model extracts “Sam Altman” from one document and “Altman” from another. Add a normalization step after extraction that maps entity aliases to canonical names. A simple approach is to use the LLM itself:
| |
Hallucinated triples
The model invents relationships not present in the text. This happens more with creative or ambiguous text. Add an instruction to the system prompt: “Only extract relationships explicitly stated in the text. Do not infer or guess.” It does not eliminate the problem entirely, but reduces it significantly.
NetworkXError: Node X is not in the digraph
You are querying for a node name that does not match exactly. Entity names are case-sensitive in NetworkX. Normalize everything to .title() or .lower() before insertion and querying.
Related Guides
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Build Prefix Tuning for LLMs with PEFT and PyTorch
- How to Build Multi-Turn Chatbots with Conversation Memory
- How to Build Structured Output Parsers with Pydantic and LLMs
- How to Build Agentic RAG with Query Routing and Self-Reflection
- How to Build Prompt Chains with Async LLM Calls and Batching
- How to Fine-Tune LLMs with LoRA and Unsloth
- How to Fine-Tune LLMs with DPO and RLHF
- How to Build Prompt Fallback Chains with Automatic Model Switching
- How to Fine-Tune LLMs on Custom Datasets with Axolotl