How to Build a Named Entity Linking Pipeline with Wikipedia and Transformers

Named entity linking (NEL) bridges the gap between raw text and structured knowledge. When your NER model spots “Apple” in a sentence, is it Apple Inc., the fruit, or Apple Records? Entity linking resolves that ambiguity by mapping each mention to its corresponding Wikipedia (or Wikidata) entry. The pipeline has three stages: extract entities, generate candidates from Wikipedia, and disambiguate using context.

Here is what you need installed:

1
2
pip install spacy sentence-transformers wikipedia-api
python -m spacy download en_core_web_sm

A minimal taste of what the final pipeline produces:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from entity_linker import EntityLinker

linker = EntityLinker()
results = linker.link("Tim Cook announced new Apple products in Cupertino.")

for entity in results:
    print(f"{entity['mention']} -> {entity['url']}")
# Tim Cook -> https://en.wikipedia.org/wiki/Tim_Cook
# Apple -> https://en.wikipedia.org/wiki/Apple_Inc.
# Cupertino -> https://en.wikipedia.org/wiki/Cupertino,_California

We will build that EntityLinker class step by step.

Extracting Entities with spaCy

The first stage pulls named entities out of raw text. spaCy handles this well out of the box, and you get entity labels (PERSON, ORG, GPE) that help narrow down candidates later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_entities(text: str) -> list[dict]:
    """Extract named entities with their labels and surrounding context."""
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        # Grab a context window around the entity for disambiguation
        start = max(0, ent.start - 10)
        end = min(len(doc), ent.end + 10)
        context = doc[start:end].text

        entities.append({
            "mention": ent.text,
            "label": ent.label_,
            "start": ent.start_char,
            "end": ent.end_char,
            "context": context,
        })
    return entities

text = "Steve Jobs co-founded Apple in a garage in Los Altos."
entities = extract_entities(text)
for e in entities:
    print(f"{e['mention']} ({e['label']})")
# Steve Jobs (PERSON)
# Apple (ORG)
# Los Altos (GPE)

The context window is important. You need the surrounding words to tell “Apple the company” from “apple the fruit” in the disambiguation step. Ten tokens on each side is a reasonable default.

Filtering Entity Types

Not every entity type is worth linking. DATE, CARDINAL, and MONEY mentions rarely map to useful Wikipedia pages. Filter them early:

1
2
3
4
LINKABLE_LABELS = {"PERSON", "ORG", "GPE", "LOC", "FAC", "NORP", "EVENT", "WORK_OF_ART"}

def filter_linkable(entities: list[dict]) -> list[dict]:
    return [e for e in entities if e["label"] in LINKABLE_LABELS]

Candidate Generation with Wikipedia API

For each entity mention, you need a shortlist of plausible Wikipedia articles. The Wikipedia search API returns ranked results based on title and content relevance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import wikipediaapi

wiki = wikipediaapi.Wikipedia(
    user_agent="EntityLinker/1.0 ([email protected])",
    language="en",
)

def get_candidates(mention: str, top_k: int = 5) -> list[dict]:
    """Fetch candidate Wikipedia pages for a given mention."""
    # wikipediaapi doesn't have a search endpoint, so we use the MediaWiki API directly
    import requests

    params = {
        "action": "query",
        "list": "search",
        "srsearch": mention,
        "srlimit": top_k,
        "format": "json",
    }
    resp = requests.get("https://en.wikipedia.org/w/api.php", params=params, timeout=10)
    data = resp.json()

    candidates = []
    for result in data.get("query", {}).get("search", []):
        title = result["title"]
        page = wiki.page(title)
        if page.exists():
            candidates.append({
                "title": title,
                "url": page.fullurl,
                "summary": page.summary[:300],
            })
    return candidates

# Example
candidates = get_candidates("Apple")
for c in candidates:
    print(f"  {c['title']}: {c['url']}")

This gives you a mix of candidates: Apple Inc., Apple (fruit), Apple Records, and so on. The summary field is what you pass to the cross-encoder for ranking.

Speeding Up Candidate Retrieval

The Wikipedia API adds latency per entity. Two tricks help:

Batch requests with concurrent.futures.ThreadPoolExecutor to query multiple entities in parallel.
Cache results with functools.lru_cache or a local SQLite store. Entity mentions repeat across documents far more than you would expect.

1
2
3
4
5
6
7
from concurrent.futures import ThreadPoolExecutor

def get_candidates_batch(mentions: list[str], top_k: int = 5) -> dict[str, list[dict]]:
    """Fetch candidates for multiple mentions in parallel."""
    with ThreadPoolExecutor(max_workers=8) as executor:
        futures = {mention: executor.submit(get_candidates, mention, top_k) for mention in mentions}
        return {mention: future.result() for mention, future in futures.items()}

Disambiguation with Cross-Encoders

Now you have entities and their candidate pages. The cross-encoder scores each (context, candidate summary) pair and picks the best match. Cross-encoders are slower than bi-encoders but much better at this kind of pairwise comparison task.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sentence_transformers import CrossEncoder

# This model is trained on NLI data and works well for semantic similarity scoring
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def disambiguate(entity: dict, candidates: list[dict]) -> dict | None:
    """Score each candidate against the entity context and return the best match."""
    if not candidates:
        return None

    # Build pairs of (entity context, candidate summary)
    pairs = [(entity["context"], c["summary"]) for c in candidates]
    scores = cross_encoder.predict(pairs)

    best_idx = scores.argmax()
    best_score = float(scores[best_idx])

    # Threshold to avoid linking to irrelevant pages
    if best_score < 0.5:
        return None

    winner = candidates[best_idx]
    winner["score"] = best_score
    return winner

The threshold of 0.5 is a starting point. Tune it on your data. If you are working with a specific domain (biomedical, legal), you may want a higher threshold since general Wikipedia candidates are more likely to be noise.

Why Cross-Encoders Over Bi-Encoders

Bi-encoders encode the query and candidates independently, which is fast but misses interaction signals. Cross-encoders process the pair jointly through all transformer layers. For entity linking, you need that joint reasoning because the context phrase “Apple announced a new iPhone” and the candidate summary “Apple Inc. is a technology company” share semantic connections that only show up with cross-attention.

Building the Full Pipeline

Here is the complete pipeline class that ties everything together:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import spacy
import requests
import wikipediaapi
from sentence_transformers import CrossEncoder
from concurrent.futures import ThreadPoolExecutor

LINKABLE_LABELS = {"PERSON", "ORG", "GPE", "LOC", "FAC", "NORP", "EVENT", "WORK_OF_ART"}


class EntityLinker:
    def __init__(
        self,
        spacy_model: str = "en_core_web_sm",
        cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_k: int = 5,
        score_threshold: float = 0.5,
    ):
        self.nlp = spacy.load(spacy_model)
        self.cross_encoder = CrossEncoder(cross_encoder_model)
        self.wiki = wikipediaapi.Wikipedia(
            user_agent="EntityLinker/1.0 ([email protected])",
            language="en",
        )
        self.top_k = top_k
        self.score_threshold = score_threshold

    def _extract_entities(self, text: str) -> list[dict]:
        doc = self.nlp(text)
        entities = []
        for ent in doc.ents:
            if ent.label_ not in LINKABLE_LABELS:
                continue
            start = max(0, ent.start - 10)
            end = min(len(doc), ent.end + 10)
            context = doc[start:end].text
            entities.append({
                "mention": ent.text,
                "label": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char,
                "context": context,
            })
        return entities

    def _get_candidates(self, mention: str) -> list[dict]:
        params = {
            "action": "query",
            "list": "search",
            "srsearch": mention,
            "srlimit": self.top_k,
            "format": "json",
        }
        resp = requests.get(
            "https://en.wikipedia.org/w/api.php", params=params, timeout=10
        )
        data = resp.json()
        candidates = []
        for result in data.get("query", {}).get("search", []):
            title = result["title"]
            page = self.wiki.page(title)
            if page.exists():
                candidates.append({
                    "title": title,
                    "url": page.fullurl,
                    "summary": page.summary[:300],
                })
        return candidates

    def _disambiguate(self, entity: dict, candidates: list[dict]) -> dict | None:
        if not candidates:
            return None
        pairs = [(entity["context"], c["summary"]) for c in candidates]
        scores = self.cross_encoder.predict(pairs)
        best_idx = scores.argmax()
        best_score = float(scores[best_idx])
        if best_score < self.score_threshold:
            return None
        winner = candidates[best_idx]
        winner["score"] = best_score
        return winner

    def link(self, text: str) -> list[dict]:
        """Run the full entity linking pipeline on input text."""
        entities = self._extract_entities(text)

        # Fetch candidates in parallel
        unique_mentions = list({e["mention"] for e in entities})
        with ThreadPoolExecutor(max_workers=8) as executor:
            futures = {
                m: executor.submit(self._get_candidates, m) for m in unique_mentions
            }
            candidate_map = {m: f.result() for m, f in futures.items()}

        # Disambiguate each entity
        results = []
        for entity in entities:
            candidates = candidate_map.get(entity["mention"], [])
            best = self._disambiguate(entity, candidates)
            if best:
                results.append({
                    "mention": entity["mention"],
                    "label": entity["label"],
                    "start": entity["start"],
                    "end": entity["end"],
                    "title": best["title"],
                    "url": best["url"],
                    "score": best["score"],
                })
        return results


# Usage
if __name__ == "__main__":
    linker = EntityLinker()
    text = "Elon Musk discussed Tesla's new factory in Berlin with Angela Merkel."
    linked = linker.link(text)
    for item in linked:
        print(f"{item['mention']} ({item['label']}) -> {item['title']} (score: {item['score']:.3f})")
        print(f"  {item['url']}")

This prints something like:

1
2
3
4
5
6
7
8
Elon Musk (PERSON) -> Elon Musk (score: 3.241)
  https://en.wikipedia.org/wiki/Elon_Musk
Tesla (ORG) -> Tesla, Inc. (score: 2.874)
  https://en.wikipedia.org/wiki/Tesla,_Inc.
Berlin (GPE) -> Berlin (score: 4.102)
  https://en.wikipedia.org/wiki/Berlin
Angela Merkel (PERSON) -> Angela Merkel (score: 5.337)
  https://en.wikipedia.org/wiki/Angela_Merkel

Note that the ms-marco-MiniLM-L-6-v2 cross-encoder outputs raw logits, not probabilities between 0 and 1. Higher is better, but the scale is not bounded. Adjust your score_threshold accordingly. For this model, values above 0 generally indicate a reasonable match.

Common Errors and Fixes

Wikipedia API rate limiting

1
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='en.wikipedia.org'): Max retries exceeded

The MediaWiki API rate-limits at roughly 200 requests per second for unauthenticated users. Add a retry strategy with exponential backoff:

1
2
3
4
5
6
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=3, backoff_factor=0.5, status_forcelist=[429, 500, 502, 503])
session.mount("https://", HTTPAdapter(max_retries=retries))

Then use session.get(...) instead of requests.get(...) in the pipeline.

spaCy model not found

1
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

You need to download the model separately after installing spaCy:

1
python -m spacy download en_core_web_sm

If you want better NER accuracy at the cost of speed, use en_core_web_trf (the transformer-based model) instead.

Cross-encoder returns unexpected shapes

1
ValueError: too many dimensions 'str'

This happens when you pass strings instead of a list of tuples to cross_encoder.predict(). The input must be a list of (str, str) pairs:

1
2
3
4
5
# Wrong
scores = cross_encoder.predict("query", "document")

# Right
scores = cross_encoder.predict([("query", "document")])

Also make sure your candidate summaries are not empty strings. Filter those out before scoring:

1
pairs = [(entity["context"], c["summary"]) for c in candidates if c["summary"]]

Extracting Entities with spaCy#

Filtering Entity Types#

Candidate Generation with Wikipedia API#

Speeding Up Candidate Retrieval#

Disambiguation with Cross-Encoders#

Why Cross-Encoders Over Bi-Encoders#

Building the Full Pipeline#

Common Errors and Fixes#

Wikipedia API rate limiting#

spaCy model not found#

Cross-encoder returns unexpected shapes#

Related Guides#

About the Author