How to Build a Text-to-Knowledge-Graph Pipeline with SpaCy and NetworkX

Unstructured text is everywhere – news articles, research papers, internal docs – but you can’t query a paragraph. A knowledge graph turns that text into a structured network of entities and relationships you can traverse, filter, and visualize. The core loop is straightforward: extract entities with SpaCy, figure out which entities relate to each other, build edges in NetworkX, and query the result.

Here’s the full pipeline in one shot so you can see where we’re headed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import spacy
import networkx as nx

nlp = spacy.load("en_core_web_sm")

text = """
Elon Musk founded SpaceX in 2002. SpaceX launched the Falcon 9 rocket from
Cape Canaveral. NASA awarded SpaceX a $2.6 billion contract for the Artemis
program. Jim Bridenstine, the former NASA administrator, praised SpaceX for
reducing launch costs. Boeing also competes for NASA contracts through the
Starliner program. Jeff Bezos founded Blue Origin, which rivals SpaceX in
the commercial space industry.
"""

doc = nlp(text)

G = nx.DiGraph()

for ent in doc.ents:
    G.add_node(ent.text, label=ent.label_)

for sent in doc.sents:
    ents_in_sent = list(sent.ents)
    for i in range(len(ents_in_sent)):
        for j in range(i + 1, len(ents_in_sent)):
            e1 = ents_in_sent[i]
            e2 = ents_in_sent[j]
            G.add_edge(e1.text, e2.text, relation="co_occurs", sentence=sent.text.strip())

print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")
for u, v, data in G.edges(data=True):
    print(f"  {u} -> {v}")

That gets you a working knowledge graph in about 30 lines. Now let’s break each step down.

Extracting Entities with SpaCy NER

SpaCy’s named entity recognition identifies people, organizations, locations, dates, and monetary amounts out of the box. The en_core_web_sm model is fast and good enough for prototyping. Switch to en_core_web_trf (transformer-backed) when accuracy matters more than speed.

1
2
pip install spacy networkx matplotlib
python -m spacy download en_core_web_sm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import spacy

nlp = spacy.load("en_core_web_sm")

text = """
Elon Musk founded SpaceX in 2002. SpaceX launched the Falcon 9 rocket from
Cape Canaveral. NASA awarded SpaceX a $2.6 billion contract for the Artemis
program. Jim Bridenstine, the former NASA administrator, praised SpaceX for
reducing launch costs. Boeing also competes for NASA contracts through the
Starliner program. Jeff Bezos founded Blue Origin, which rivals SpaceX in
the commercial space industry.
"""

doc = nlp(text)

entities = [(ent.text, ent.label_) for ent in doc.ents]
for text_span, label in entities:
    print(f"{text_span:25s} {label}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Elon Musk                 PERSON
SpaceX                    ORG
2002                      DATE
SpaceX                    ORG
Falcon 9                  PRODUCT
Cape Canaveral            GPE
NASA                      ORG
SpaceX                    ORG
2.6 billion               MONEY
Artemis                   ORG
Jim Bridenstine           PERSON
NASA                      ORG
SpaceX                    ORG
Boeing                    ORG
NASA                      ORG
Starliner                 ORG
Jeff Bezos                PERSON
Blue Origin               ORG
SpaceX                    ORG

Notice that SpaCy returns duplicate mentions of the same entity. That’s fine – when we add nodes to NetworkX, duplicates just update the existing node rather than creating a new one.

Extracting Relationships Between Entities

The simplest relationship extraction strategy: if two entities appear in the same sentence, they’re related. This co-occurrence approach won’t give you labeled relationships like “founded” or “awarded,” but it captures structural connections that are surprisingly useful for downstream querying.

For each sentence, we grab all entity pairs and create directed edges. You can enrich the edge data with the source sentence so you can trace back to the original text later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import spacy
import networkx as nx

nlp = spacy.load("en_core_web_sm")

text = """
Elon Musk founded SpaceX in 2002. SpaceX launched the Falcon 9 rocket from
Cape Canaveral. NASA awarded SpaceX a $2.6 billion contract for the Artemis
program. Jim Bridenstine, the former NASA administrator, praised SpaceX for
reducing launch costs. Boeing also competes for NASA contracts through the
Starliner program. Jeff Bezos founded Blue Origin, which rivals SpaceX in
the commercial space industry.
"""

doc = nlp(text)

G = nx.DiGraph()

# Add entity nodes with their NER labels
for ent in doc.ents:
    G.add_node(ent.text, label=ent.label_)

# Add edges for entities co-occurring in the same sentence
for sent in doc.sents:
    ents_in_sent = list(sent.ents)
    for i in range(len(ents_in_sent)):
        for j in range(i + 1, len(ents_in_sent)):
            e1 = ents_in_sent[i]
            e2 = ents_in_sent[j]
            if G.has_edge(e1.text, e2.text):
                # Append sentence to existing edge
                existing = G[e1.text][e2.text]["sentences"]
                existing.append(sent.text.strip())
            else:
                G.add_edge(
                    e1.text,
                    e2.text,
                    relation="co_occurs",
                    sentences=[sent.text.strip()],
                )

print(f"Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

If you want labeled relationships (e.g., “Elon Musk –founded–> SpaceX”), you need dependency parsing to extract the verb connecting two entities. Here’s a basic version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def extract_relation(sent, ent1, ent2):
    """Find the root verb between two entities in a sentence."""
    root = sent.root
    if root.pos_ == "VERB":
        return root.lemma_
    # Fall back to any verb in the sentence
    for token in sent:
        if token.pos_ == "VERB":
            return token.lemma_
    return "related_to"

for sent in doc.sents:
    ents_in_sent = list(sent.ents)
    for i in range(len(ents_in_sent)):
        for j in range(i + 1, len(ents_in_sent)):
            rel = extract_relation(sent, ents_in_sent[i], ents_in_sent[j])
            print(f"{ents_in_sent[i].text} --{rel}--> {ents_in_sent[j].text}")

This gives you output like Elon Musk --found--> SpaceX and NASA --award--> SpaceX. It’s not perfect – dependency-based relation extraction is a hard problem – but it gets you surprisingly far for exploratory analysis.

Querying the Knowledge Graph

Once your graph is built, NetworkX gives you a full graph query toolkit. Here are the queries you’ll use most.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Find all neighbors of an entity
spacex_neighbors = list(G.neighbors("SpaceX"))
print(f"SpaceX connects to: {spacex_neighbors}")

# Find shortest path between two entities
try:
    path = nx.shortest_path(G, source="Elon Musk", target="NASA")
    print(f"Path from Elon Musk to NASA: {' -> '.join(path)}")
except nx.NetworkXNoPath:
    print("No path found")

# Get all connected components (clusters of related entities)
undirected = G.to_undirected()
components = list(nx.connected_components(undirected))
for i, comp in enumerate(components):
    print(f"Cluster {i}: {comp}")

# Find the most connected entities (by degree)
degree_sorted = sorted(G.degree(), key=lambda x: x[1], reverse=True)
print("\nMost connected entities:")
for entity, degree in degree_sorted[:5]:
    print(f"  {entity}: {degree} connections")

# Get all edges for a specific entity with their data
for u, v, data in G.edges("SpaceX", data=True):
    print(f"  {u} -> {v}: {data.get('relation', 'unknown')}")

The shortest_path query is especially useful for answering questions like “how is entity A connected to entity B?” in large graphs with hundreds of entities.

Visualizing the Graph

Matplotlib plus NetworkX’s drawing utilities produce a quick visualization. For production dashboards you’d use something like pyvis or Gephi, but for exploration this works.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import matplotlib.pyplot as plt
import networkx as nx

# Color nodes by entity type
color_map = {
    "PERSON": "#4ade80",
    "ORG": "#60a5fa",
    "GPE": "#f97316",
    "DATE": "#a78bfa",
    "MONEY": "#facc15",
    "PRODUCT": "#f472b6",
}

node_colors = []
for node in G.nodes():
    label = G.nodes[node].get("label", "")
    node_colors.append(color_map.get(label, "#9ca3af"))

plt.figure(figsize=(14, 10))
pos = nx.spring_layout(G, seed=42, k=2)

nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=1500, alpha=0.9)
nx.draw_networkx_labels(G, pos, font_size=8, font_weight="bold")
nx.draw_networkx_edges(G, pos, edge_color="#555555", arrows=True, arrowsize=15, alpha=0.6)

plt.title("Knowledge Graph: Entities and Relationships")
plt.axis("off")
plt.tight_layout()
plt.savefig("knowledge_graph.png", dpi=150, bbox_inches="tight")
plt.show()

Each entity type gets its own color, making it easy to spot clusters of organizations vs. people at a glance. The spring_layout algorithm positions connected nodes closer together, so tightly related entities naturally group.

Common Errors and Fixes

OSError: Can't find model 'en_core_web_sm' – You need to download the model first. Run python -m spacy download en_core_web_sm. If you’re in a Docker container or CI, add it to your build step.

NetworkXError: Node X is not in the graph – This happens when you query for an entity that wasn’t extracted by SpaCy. Check your entity list with list(G.nodes()) before querying. Entity text is case-sensitive – “NASA” and “Nasa” are different nodes.

Duplicate entities with slightly different text – SpaCy might extract “SpaceX” and “SpaceX’s” as separate entities. Normalize entity text before adding nodes:

1
2
3
4
5
6
7
8
import re

def normalize_entity(text):
    text = re.sub(r"'s$", "", text)  # Remove possessives
    return text.strip()

for ent in doc.ents:
    G.add_node(normalize_entity(ent.text), label=ent.label_)

Graph is too dense to read – With lots of entities, the co-occurrence approach creates many edges. Filter by entity type to keep only the relationships you care about:

1
2
3
4
5
# Only connect PERSON and ORG entities
relevant_types = {"PERSON", "ORG"}
for sent in doc.sents:
    ents_in_sent = [e for e in sent.ents if e.label_ in relevant_types]
    # ... same edge-building logic

matplotlib backend errors in headless environments – If you’re running on a server without a display, add this before importing matplotlib:

1
2
3
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

Extracting Entities with SpaCy NER#

Extracting Relationships Between Entities#

Querying the Knowledge Graph#

Visualizing the Graph#

Common Errors and Fixes#

Related Guides#

About the Author

Extracting Entities with SpaCy NER

Extracting Relationships Between Entities

Querying the Knowledge Graph

Visualizing the Graph

Common Errors and Fixes

Related Guides