How to Build a Knowledge Graph from Text with LLMs

The Quick Version

You feed unstructured text to an LLM, ask it to extract (subject, predicate, object) triples, parse the output into structured data with Pydantic, and load those triples into a graph. NetworkX works for prototyping. Neo4j works for production.

1
pip install openai pydantic networkx neo4j

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class Triple(BaseModel):
    subject: str
    predicate: str
    object: str

class ExtractionResult(BaseModel):
    triples: list[Triple]

def extract_triples(text: str) -> list[Triple]:
    """Extract entity-relation triples from text using GPT-4o."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a knowledge graph extraction engine. "
                    "Extract all factual relationships from the text as "
                    "(subject, predicate, object) triples. "
                    "Normalize entity names to their canonical form. "
                    "Use lowercase for predicates (e.g., 'founded', 'located_in', 'acquired')."
                ),
            },
            {"role": "user", "content": text},
        ],
        response_format=ExtractionResult,
    )
    return response.choices[0].message.parsed.triples

# Test it
text = """
OpenAI was founded by Sam Altman and Elon Musk in 2015. The company is
headquartered in San Francisco. In 2023, OpenAI released GPT-4, which powers
ChatGPT. Microsoft invested $10 billion in OpenAI in January 2023.
"""

triples = extract_triples(text)
for t in triples:
    print(f"  ({t.subject}, {t.predicate}, {t.object})")

# (OpenAI, founded_by, Sam Altman)
# (OpenAI, founded_by, Elon Musk)
# (OpenAI, founded_in, 2015)
# (OpenAI, headquartered_in, San Francisco)
# (OpenAI, released, GPT-4)
# (GPT-4, powers, ChatGPT)
# (Microsoft, invested_in, OpenAI)

The client.beta.chat.completions.parse method with response_format forces the model to return JSON matching your Pydantic schema. No regex parsing, no hoping the model returns valid JSON. This is the single biggest improvement over older extraction pipelines.

Why Pydantic for Structured Output

Older approaches used prompts like “return JSON in this format…” and then crossed their fingers. The model would sometimes return markdown-wrapped JSON, sometimes add commentary, sometimes hallucinate extra fields. OpenAI’s structured output feature (and Anthropic’s equivalent) guarantees the response matches your schema.

The Triple model is intentionally simple. You could extend it with confidence scores, source spans, or entity types:

1
2
3
4
5
6
7
8
9
from typing import Optional

class TypedTriple(BaseModel):
    subject: str
    subject_type: str  # "Person", "Organization", "Location", etc.
    predicate: str
    object: str
    object_type: str
    confidence: Optional[float] = None

For most use cases, the basic triple is enough. Add entity types when you need to filter or style nodes differently in visualization.

Building the Graph with NetworkX

NetworkX is the fastest way to go from triples to a queryable graph. Good for prototyping, exploratory analysis, and datasets under ~100K triples.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import networkx as nx

def build_graph(triples: list[Triple]) -> nx.DiGraph:
    """Build a directed graph from extracted triples."""
    G = nx.DiGraph()
    for t in triples:
        G.add_node(t.subject)
        G.add_node(t.object)
        G.add_edge(t.subject, t.object, relation=t.predicate)
    return G

G = build_graph(triples)

# Query: what do we know about OpenAI?
for _, target, data in G.out_edges("OpenAI", data=True):
    print(f"OpenAI --[{data['relation']}]--> {target}")

# Find all paths between two entities
for path in nx.all_simple_paths(G, "Microsoft", "ChatGPT"):
    print(" -> ".join(path))

NetworkX stores everything in memory. Once you cross ~500K edges or need concurrent access, move to Neo4j.

Storing in Neo4j for Production

Neo4j gives you persistent storage, Cypher queries, and a visual graph explorer. The Python driver handles the connection.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def load_triples_to_neo4j(triples: list[Triple]):
    """Merge triples into Neo4j, creating nodes and relationships."""
    with driver.session() as session:
        for t in triples:
            session.run(
                """
                MERGE (s:Entity {name: $subject})
                MERGE (o:Entity {name: $object})
                MERGE (s)-[r:RELATION {type: $predicate}]->(o)
                """,
                subject=t.subject,
                object=t.object,
                predicate=t.predicate,
            )

load_triples_to_neo4j(triples)

MERGE is critical here – it prevents duplicate nodes and edges when you process overlapping documents. If you used CREATE instead, you would get duplicate “OpenAI” nodes every time a new document mentions the company.

Querying the Graph with Cypher

Once the triples are loaded, Cypher lets you ask questions that would be painful with SQL joins.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def query_graph(question_cypher: str) -> list[dict]:
    """Run a Cypher query and return results as dicts."""
    with driver.session() as session:
        result = session.run(question_cypher)
        return [record.data() for record in result]

# Who founded OpenAI?
founders = query_graph("""
    MATCH (company:Entity {name: 'OpenAI'})<-[r:RELATION {type: 'founded_by'}]-(founder)
    RETURN founder.name AS founder
""")

# Two-hop queries: what is connected to entities that Microsoft invested in?
chain = query_graph("""
    MATCH (m:Entity {name: 'Microsoft'})-[:RELATION]->(target)-[:RELATION]->(downstream)
    RETURN target.name, downstream.name
""")

Two-hop and three-hop traversals are where knowledge graphs pay off. Finding “companies that Microsoft invested in, and the products those companies released” is a single Cypher query. In a relational database, that is multiple joins across tables you would have to design upfront.

Processing Documents Incrementally

Real-world graphs get built document by document. Here is a pipeline that processes a batch of texts and deduplicates entities.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def process_documents(documents: list[str]) -> nx.DiGraph:
    """Extract triples from multiple documents and merge into one graph."""
    G = nx.DiGraph()
    all_triples = []

    for i, doc in enumerate(documents):
        print(f"Processing document {i + 1}/{len(documents)}...")
        triples = extract_triples(doc)
        all_triples.extend(triples)

        for t in triples:
            # Normalize to lowercase for dedup
            subj = t.subject.strip().title()
            obj = t.object.strip().title()
            pred = t.predicate.strip().lower()

            G.add_node(subj)
            G.add_node(obj)

            # If edge exists, track multiple sources
            if G.has_edge(subj, obj):
                existing = G[subj][obj].get("sources", 1)
                G[subj][obj]["sources"] = existing + 1
            else:
                G.add_edge(subj, obj, relation=pred, sources=1)

    print(f"Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
    return G

The sources counter tracks how many documents confirm each relationship. Higher counts mean higher confidence. You can filter out triples that only appear once to reduce noise.

Prompting Strategies That Actually Matter

The extraction prompt is where most people get mediocre results. Three things make the biggest difference:

Be specific about predicate vocabulary. If you let the model freestyle, you will get “founded”, “was founded by”, “co-founded”, and “started” as separate predicates for the same relationship. Provide a predicate list in the system prompt, or add a normalization pass.

Set entity granularity explicitly. Should “San Francisco” and “SF” be the same node? Should “GPT-4” and “GPT-4 Turbo” merge? Tell the model your rules. Otherwise you get a fragmented graph.

Chunk long documents. Models lose extraction quality past ~2000 tokens of input. Split documents into overlapping chunks of 500-800 tokens and extract from each chunk separately. The MERGE strategy in Neo4j (or the dedup logic in NetworkX) handles the overlap.

1
2
3
4
5
6
7
8
9
def chunk_text(text: str, chunk_size: int = 800, overlap: int = 100) -> list[str]:
    """Split text into overlapping word-level chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

Common Errors and Fixes

openai.BadRequestError: ... response_format is not supported

Structured outputs require gpt-4o-2024-08-06 or later. If you are on an older model snapshot, either update the model string or fall back to JSON mode with manual parsing.

Neo4j ServiceUnavailable: Failed to establish connection

The Neo4j server is not running. Start it with docker run -d -p 7687:7687 -p 7474:7474 -e NEO4J_AUTH=neo4j/password neo4j:latest. Wait 10-15 seconds for it to initialize before connecting.

Duplicate nodes with slightly different names

The model extracts “Sam Altman” from one document and “Altman” from another. Add a normalization step after extraction that maps entity aliases to canonical names. A simple approach is to use the LLM itself:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def normalize_entities(triples: list[Triple]) -> list[Triple]:
    """Use the LLM to map entity aliases to canonical names."""
    entities = set()
    for t in triples:
        entities.add(t.subject)
        entities.add(t.object)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": (
                f"Given these entities: {sorted(entities)}\n"
                "Return a JSON mapping where keys are aliases and values "
                "are the canonical name. Only include entries that need merging."
            ),
        }],
        response_format={"type": "json_object"},
    )
    # Apply the mapping to all triples
    import json
    mapping = json.loads(response.choices[0].message.content)
    normalized = []
    for t in triples:
        normalized.append(Triple(
            subject=mapping.get(t.subject, t.subject),
            predicate=t.predicate,
            object=mapping.get(t.object, t.object),
        ))
    return normalized

Hallucinated triples

The model invents relationships not present in the text. This happens more with creative or ambiguous text. Add an instruction to the system prompt: “Only extract relationships explicitly stated in the text. Do not infer or guess.” It does not eliminate the problem entirely, but reduces it significantly.

NetworkXError: Node X is not in the digraph

You are querying for a node name that does not match exactly. Entity names are case-sensitive in NetworkX. Normalize everything to .title() or .lower() before insertion and querying.

The Quick Version#

Why Pydantic for Structured Output#

Building the Graph with NetworkX#

Storing in Neo4j for Production#

Querying the Graph with Cypher#

Processing Documents Incrementally#

Prompting Strategies That Actually Matter#

Common Errors and Fixes#

Related Guides#

About the Author