Named entity linking (NEL) bridges the gap between raw text and structured knowledge. When your NER model spots “Apple” in a sentence, is it Apple Inc., the fruit, or Apple Records? Entity linking resolves that ambiguity by mapping each mention to its corresponding Wikipedia (or Wikidata) entry. The pipeline has three stages: extract entities, generate candidates from Wikipedia, and disambiguate using context.
Here is what you need installed:
1
2
| pip install spacy sentence-transformers wikipedia-api
python -m spacy download en_core_web_sm
|
A minimal taste of what the final pipeline produces:
1
2
3
4
5
6
7
8
9
10
| from entity_linker import EntityLinker
linker = EntityLinker()
results = linker.link("Tim Cook announced new Apple products in Cupertino.")
for entity in results:
print(f"{entity['mention']} -> {entity['url']}")
# Tim Cook -> https://en.wikipedia.org/wiki/Tim_Cook
# Apple -> https://en.wikipedia.org/wiki/Apple_Inc.
# Cupertino -> https://en.wikipedia.org/wiki/Cupertino,_California
|
We will build that EntityLinker class step by step.
The first stage pulls named entities out of raw text. spaCy handles this well out of the box, and you get entity labels (PERSON, ORG, GPE) that help narrow down candidates later.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| import spacy
nlp = spacy.load("en_core_web_sm")
def extract_entities(text: str) -> list[dict]:
"""Extract named entities with their labels and surrounding context."""
doc = nlp(text)
entities = []
for ent in doc.ents:
# Grab a context window around the entity for disambiguation
start = max(0, ent.start - 10)
end = min(len(doc), ent.end + 10)
context = doc[start:end].text
entities.append({
"mention": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
"context": context,
})
return entities
text = "Steve Jobs co-founded Apple in a garage in Los Altos."
entities = extract_entities(text)
for e in entities:
print(f"{e['mention']} ({e['label']})")
# Steve Jobs (PERSON)
# Apple (ORG)
# Los Altos (GPE)
|
The context window is important. You need the surrounding words to tell “Apple the company” from “apple the fruit” in the disambiguation step. Ten tokens on each side is a reasonable default.
Filtering Entity Types#
Not every entity type is worth linking. DATE, CARDINAL, and MONEY mentions rarely map to useful Wikipedia pages. Filter them early:
1
2
3
4
| LINKABLE_LABELS = {"PERSON", "ORG", "GPE", "LOC", "FAC", "NORP", "EVENT", "WORK_OF_ART"}
def filter_linkable(entities: list[dict]) -> list[dict]:
return [e for e in entities if e["label"] in LINKABLE_LABELS]
|
Candidate Generation with Wikipedia API#
For each entity mention, you need a shortlist of plausible Wikipedia articles. The Wikipedia search API returns ranked results based on title and content relevance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| import wikipediaapi
wiki = wikipediaapi.Wikipedia(
user_agent="EntityLinker/1.0 ([email protected])",
language="en",
)
def get_candidates(mention: str, top_k: int = 5) -> list[dict]:
"""Fetch candidate Wikipedia pages for a given mention."""
# wikipediaapi doesn't have a search endpoint, so we use the MediaWiki API directly
import requests
params = {
"action": "query",
"list": "search",
"srsearch": mention,
"srlimit": top_k,
"format": "json",
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params, timeout=10)
data = resp.json()
candidates = []
for result in data.get("query", {}).get("search", []):
title = result["title"]
page = wiki.page(title)
if page.exists():
candidates.append({
"title": title,
"url": page.fullurl,
"summary": page.summary[:300],
})
return candidates
# Example
candidates = get_candidates("Apple")
for c in candidates:
print(f" {c['title']}: {c['url']}")
|
This gives you a mix of candidates: Apple Inc., Apple (fruit), Apple Records, and so on. The summary field is what you pass to the cross-encoder for ranking.
Speeding Up Candidate Retrieval#
The Wikipedia API adds latency per entity. Two tricks help:
- Batch requests with
concurrent.futures.ThreadPoolExecutor to query multiple entities in parallel. - Cache results with
functools.lru_cache or a local SQLite store. Entity mentions repeat across documents far more than you would expect.
1
2
3
4
5
6
7
| from concurrent.futures import ThreadPoolExecutor
def get_candidates_batch(mentions: list[str], top_k: int = 5) -> dict[str, list[dict]]:
"""Fetch candidates for multiple mentions in parallel."""
with ThreadPoolExecutor(max_workers=8) as executor:
futures = {mention: executor.submit(get_candidates, mention, top_k) for mention in mentions}
return {mention: future.result() for mention, future in futures.items()}
|
Disambiguation with Cross-Encoders#
Now you have entities and their candidate pages. The cross-encoder scores each (context, candidate summary) pair and picks the best match. Cross-encoders are slower than bi-encoders but much better at this kind of pairwise comparison task.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| from sentence_transformers import CrossEncoder
# This model is trained on NLI data and works well for semantic similarity scoring
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def disambiguate(entity: dict, candidates: list[dict]) -> dict | None:
"""Score each candidate against the entity context and return the best match."""
if not candidates:
return None
# Build pairs of (entity context, candidate summary)
pairs = [(entity["context"], c["summary"]) for c in candidates]
scores = cross_encoder.predict(pairs)
best_idx = scores.argmax()
best_score = float(scores[best_idx])
# Threshold to avoid linking to irrelevant pages
if best_score < 0.5:
return None
winner = candidates[best_idx]
winner["score"] = best_score
return winner
|
The threshold of 0.5 is a starting point. Tune it on your data. If you are working with a specific domain (biomedical, legal), you may want a higher threshold since general Wikipedia candidates are more likely to be noise.
Why Cross-Encoders Over Bi-Encoders#
Bi-encoders encode the query and candidates independently, which is fast but misses interaction signals. Cross-encoders process the pair jointly through all transformer layers. For entity linking, you need that joint reasoning because the context phrase “Apple announced a new iPhone” and the candidate summary “Apple Inc. is a technology company” share semantic connections that only show up with cross-attention.
Building the Full Pipeline#
Here is the complete pipeline class that ties everything together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
| import spacy
import requests
import wikipediaapi
from sentence_transformers import CrossEncoder
from concurrent.futures import ThreadPoolExecutor
LINKABLE_LABELS = {"PERSON", "ORG", "GPE", "LOC", "FAC", "NORP", "EVENT", "WORK_OF_ART"}
class EntityLinker:
def __init__(
self,
spacy_model: str = "en_core_web_sm",
cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
top_k: int = 5,
score_threshold: float = 0.5,
):
self.nlp = spacy.load(spacy_model)
self.cross_encoder = CrossEncoder(cross_encoder_model)
self.wiki = wikipediaapi.Wikipedia(
user_agent="EntityLinker/1.0 ([email protected])",
language="en",
)
self.top_k = top_k
self.score_threshold = score_threshold
def _extract_entities(self, text: str) -> list[dict]:
doc = self.nlp(text)
entities = []
for ent in doc.ents:
if ent.label_ not in LINKABLE_LABELS:
continue
start = max(0, ent.start - 10)
end = min(len(doc), ent.end + 10)
context = doc[start:end].text
entities.append({
"mention": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
"context": context,
})
return entities
def _get_candidates(self, mention: str) -> list[dict]:
params = {
"action": "query",
"list": "search",
"srsearch": mention,
"srlimit": self.top_k,
"format": "json",
}
resp = requests.get(
"https://en.wikipedia.org/w/api.php", params=params, timeout=10
)
data = resp.json()
candidates = []
for result in data.get("query", {}).get("search", []):
title = result["title"]
page = self.wiki.page(title)
if page.exists():
candidates.append({
"title": title,
"url": page.fullurl,
"summary": page.summary[:300],
})
return candidates
def _disambiguate(self, entity: dict, candidates: list[dict]) -> dict | None:
if not candidates:
return None
pairs = [(entity["context"], c["summary"]) for c in candidates]
scores = self.cross_encoder.predict(pairs)
best_idx = scores.argmax()
best_score = float(scores[best_idx])
if best_score < self.score_threshold:
return None
winner = candidates[best_idx]
winner["score"] = best_score
return winner
def link(self, text: str) -> list[dict]:
"""Run the full entity linking pipeline on input text."""
entities = self._extract_entities(text)
# Fetch candidates in parallel
unique_mentions = list({e["mention"] for e in entities})
with ThreadPoolExecutor(max_workers=8) as executor:
futures = {
m: executor.submit(self._get_candidates, m) for m in unique_mentions
}
candidate_map = {m: f.result() for m, f in futures.items()}
# Disambiguate each entity
results = []
for entity in entities:
candidates = candidate_map.get(entity["mention"], [])
best = self._disambiguate(entity, candidates)
if best:
results.append({
"mention": entity["mention"],
"label": entity["label"],
"start": entity["start"],
"end": entity["end"],
"title": best["title"],
"url": best["url"],
"score": best["score"],
})
return results
# Usage
if __name__ == "__main__":
linker = EntityLinker()
text = "Elon Musk discussed Tesla's new factory in Berlin with Angela Merkel."
linked = linker.link(text)
for item in linked:
print(f"{item['mention']} ({item['label']}) -> {item['title']} (score: {item['score']:.3f})")
print(f" {item['url']}")
|
This prints something like:
1
2
3
4
5
6
7
8
| Elon Musk (PERSON) -> Elon Musk (score: 3.241)
https://en.wikipedia.org/wiki/Elon_Musk
Tesla (ORG) -> Tesla, Inc. (score: 2.874)
https://en.wikipedia.org/wiki/Tesla,_Inc.
Berlin (GPE) -> Berlin (score: 4.102)
https://en.wikipedia.org/wiki/Berlin
Angela Merkel (PERSON) -> Angela Merkel (score: 5.337)
https://en.wikipedia.org/wiki/Angela_Merkel
|
Note that the ms-marco-MiniLM-L-6-v2 cross-encoder outputs raw logits, not probabilities between 0 and 1. Higher is better, but the scale is not bounded. Adjust your score_threshold accordingly. For this model, values above 0 generally indicate a reasonable match.
Common Errors and Fixes#
Wikipedia API rate limiting#
1
| requests.exceptions.ConnectionError: HTTPSConnectionPool(host='en.wikipedia.org'): Max retries exceeded
|
The MediaWiki API rate-limits at roughly 200 requests per second for unauthenticated users. Add a retry strategy with exponential backoff:
1
2
3
4
5
6
| from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.5, status_forcelist=[429, 500, 502, 503])
session.mount("https://", HTTPAdapter(max_retries=retries))
|
Then use session.get(...) instead of requests.get(...) in the pipeline.
spaCy model not found#
1
| OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
|
You need to download the model separately after installing spaCy:
1
| python -m spacy download en_core_web_sm
|
If you want better NER accuracy at the cost of speed, use en_core_web_trf (the transformer-based model) instead.
Cross-encoder returns unexpected shapes#
1
| ValueError: too many dimensions 'str'
|
This happens when you pass strings instead of a list of tuples to cross_encoder.predict(). The input must be a list of (str, str) pairs:
1
2
3
4
5
| # Wrong
scores = cross_encoder.predict("query", "document")
# Right
scores = cross_encoder.predict([("query", "document")])
|
Also make sure your candidate summaries are not empty strings. Filter those out before scoring:
1
| pairs = [(entity["context"], c["summary"]) for c in candidates if c["summary"]]
|