How to Build a Named Entity Recognition Pipeline with spaCy and Transformers

Named entity recognition (NER) pulls structured data out of unstructured text – person names, organizations, dates, monetary amounts, locations. It is one of the most practical NLP tasks because nearly every document processing pipeline needs it. You can get a working NER system in under 10 lines of Python with spaCy, scale to custom domains with Hugging Face Transformers, or skip training entirely with GLiNER’s zero-shot approach.

This guide covers all three paths and shows you when to pick each one.

Quick Start: spaCy Pretrained NER

spaCy ships with pretrained models that handle the standard entity types (PERSON, ORG, GPE, DATE, MONEY, etc.) out of the box. For general-purpose entity extraction, start here.

1
2
pip install spacy
python -m spacy download en_core_web_trf

The en_core_web_trf model uses a RoBERTa transformer backbone and is the most accurate English model spaCy offers. If you need speed over accuracy, use en_core_web_sm (small, CPU-friendly) instead.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import spacy

nlp = spacy.load("en_core_web_trf")

text = """
Apple Inc. reported $94.8 billion in revenue for Q1 2026.
CEO Tim Cook announced the results from their headquarters in Cupertino, California
on January 30th.
"""

doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:30s} {ent.label_:10s} {ent.start_char}-{ent.end_char}")

Output:

1
2
3
4
5
6
Apple Inc.                     ORG        1-11
$94.8 billion                  MONEY      21-34
Q1 2026                        DATE       49-56
Tim Cook                       PERSON     62-70
Cupertino, California          GPE        123-144
January 30th                   DATE       148-160

That is the entire pipeline. Load a model, pass text, iterate over doc.ents. Each entity has the extracted text, a label, and character offsets you can use to map back to the original document.

When spaCy Pretrained Models Fall Short

The pretrained models only recognize the entity types they were trained on (18 types for OntoNotes). If you need to extract domain-specific entities – drug names, gene symbols, legal citations, product SKUs – the pretrained model will miss them entirely. That is where custom training or zero-shot approaches come in.

Hugging Face Transformers NER Pipeline

Hugging Face gives you access to thousands of fine-tuned NER models through a single API. The token-classification pipeline handles tokenization, inference, and subword aggregation automatically.

1
pip install transformers torch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple",
)

results = ner("Elon Musk founded SpaceX in Hawthorne, California.")

for entity in results:
    print(f"{entity['word']:20s} {entity['entity_group']:10s} {entity['score']:.4f}")

Output:

1
2
3
4
Elon Musk            PER        0.9987
SpaceX               ORG        0.9983
Hawthorne            LOC        0.9951
California           LOC        0.9989

The aggregation_strategy="simple" parameter is critical. Without it, you get individual subword tokens instead of merged entities. A name like “Hawthorne” might appear as ["Haw", "##thorne"] with separate predictions for each piece.

Picking the Right Model

The model you choose matters more than any hyperparameter. Here are the most reliable options on Hugging Face:

Model	Entity Types	Best For
`dslim/bert-base-NER`	PER, ORG, LOC, MISC	General purpose, fast
`Jean-Baptiste/camembert-ner`	PER, ORG, LOC, MISC	French text
`d4data/biomedical-ner-all`	Disease, Chemical, Gene, Species	Biomedical documents
`StanfordAIMI/stanford-deidentifier-base`	Patient, Doctor, Hospital, Date	Clinical note de-identification

Browse models tagged with token-classification on the Hugging Face Hub to find domain-specific options.

Common Error: Slow Tokenizer Warning

If you see this warning:

1
2
UserWarning: `grouped_entities` is deprecated and will be removed in v5.
Use `aggregation_strategy` instead.

Replace the deprecated grouped_entities=True parameter with aggregation_strategy="simple". If you also see errors about slow tokenizers not supporting aggregation, make sure you have the fast tokenizer installed:

1
pip install tokenizers

Some models only ship with slow tokenizers. In that case, set aggregation_strategy="simple" (not "first" or "average", which require fast tokenizers with word-to-token mappings).

Zero-Shot NER with GLiNER

GLiNER is a game changer for NER. It extracts any entity type you define at inference time – no training data, no fine-tuning, no label mapping. You just pass a list of entity type strings and it finds them.

1
pip install gliner

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

text = """
Dr. Sarah Chen prescribed 500mg of Metformin for the patient's
Type 2 Diabetes. The next appointment is scheduled at Mount Sinai
Hospital on March 15th, 2026.
"""

labels = ["doctor", "medication", "dosage", "condition", "hospital", "date"]

entities = model.predict_entities(text, labels, threshold=0.5)

for entity in entities:
    print(f"{entity['text']:25s} {entity['label']:15s} {entity['score']:.4f}")

Output:

1
2
3
4
5
6
Dr. Sarah Chen            doctor          0.9812
500mg                     dosage          0.8934
Metformin                 medication      0.9567
Type 2 Diabetes           condition       0.9234
Mount Sinai Hospital      hospital        0.9678
March 15th, 2026          date            0.9445

Notice that these labels are not standard NER tags. They are plain English descriptions. GLiNER uses a bidirectional transformer to match text spans against your label descriptions, so you can define any entity type that makes sense for your domain.

Tuning the Threshold

The threshold parameter controls how confident the model needs to be before returning an entity. Lower values (0.3) catch more entities but introduce false positives. Higher values (0.7) are more precise but miss borderline cases. Start at 0.5 and adjust based on your precision/recall requirements.

Training a Custom NER Model with spaCy

When pretrained models do not cover your entity types and you have labeled data, train a custom spaCy NER model. spaCy v3 uses a config-driven training system.

Step 1: Generate a Base Config

Go to spacy.io/usage/training and use the quickstart widget, or generate one from the CLI:

1
python -m spacy init config config.cfg --lang en --pipeline ner --optimize accuracy

For a transformer-backed NER model (higher accuracy, slower), use:

1
python -m spacy init config config.cfg --lang en --pipeline ner --optimize accuracy --gpu

Step 2: Prepare Training Data

spaCy v3 expects .spacy binary files. Convert your annotated data from the common JSON format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
db = DocBin()

# Your training data: (text, {"entities": [(start, end, label)]})
TRAIN_DATA = [
    ("Aspirin 100mg taken twice daily", {"entities": [(0, 7, "DRUG"), (8, 13, "DOSAGE")]}),
    ("Patient prescribed Lisinopril 10mg", {"entities": [(19, 29, "DRUG"), (30, 34, "DOSAGE")]}),
    ("Ibuprofen 400mg for pain relief", {"entities": [(0, 9, "DRUG"), (10, 15, "DOSAGE")]}),
]

for text, annotations in TRAIN_DATA:
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annotations["entities"]:
        span = doc.char_span(start, end, label=label)
        if span is None:
            print(f"Skipping entity [{start}:{end}] in '{text}' - misaligned span")
            continue
        ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("./train.spacy")

The char_span returning None is the single most common bug in spaCy NER training. It happens when your character offsets do not align with token boundaries. Always check for it.

Step 3: Train

1
2
3
4
python -m spacy train config.cfg \
    --output ./output \
    --paths.train ./train.spacy \
    --paths.dev ./dev.spacy

If you hit this error:

1
2
E973: Unexpected type for NER data. Expected a list of (start, end, label) tuples
but got: ...

Your annotation format is wrong. Make sure each entity is a tuple of (start_char, end_char, label) and that the offsets are character positions, not token positions.

Step 4: Load and Use

1
2
3
4
nlp = spacy.load("./output/model-best")
doc = nlp("Take Aspirin 200mg with food")
for ent in doc.ents:
    print(ent.text, ent.label_)

Choosing the Right Approach

Pick your NER strategy based on what you actually need:

Use spaCy pretrained models when you need standard entity types (person, org, location, date, money) and want a fast, production-ready pipeline with no setup. Best for general document processing.

Use Hugging Face pipelines when you need a domain-specific model that someone has already fine-tuned (biomedical, legal, financial) and want to swap models without changing code.

Use GLiNER when your entity types are custom, you have no labeled data, and you need something working today. It is the fastest path to a domain-specific NER system, though accuracy on niche entities depends on how well your labels describe what you are looking for.

Train a custom spaCy model when you have labeled data, need high accuracy on domain-specific entities, and plan to run the model at scale in production.

Troubleshooting

spaCy model download fails behind a proxy:

1
2
# Download the model manually and install from local file
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0.tar.gz

GPU out of memory with transformer models:

Transformer-based models (both spaCy _trf and Hugging Face) load large weights into GPU memory. If you hit OOM errors, either use a smaller model (en_core_web_sm or dslim/bert-base-NER) or process text in smaller chunks. spaCy’s nlp.pipe() with a batch_size parameter helps:

1
docs = list(nlp.pipe(texts, batch_size=50))

GLiNER returns no entities:

Lower the threshold parameter. The default of 0.5 is conservative. Try 0.3 for recall-heavy use cases, and check that your label descriptions are descriptive enough – “medication” works better than “med” because the model matches against natural language.

Quick Start: spaCy Pretrained NER#

When spaCy Pretrained Models Fall Short#

Hugging Face Transformers NER Pipeline#

Picking the Right Model#

Common Error: Slow Tokenizer Warning#

Zero-Shot NER with GLiNER#

Tuning the Threshold#

Training a Custom NER Model with spaCy#

Step 1: Generate a Base Config#

Step 2: Prepare Training Data#

Step 3: Train#

Step 4: Load and Use#

Choosing the Right Approach#

Troubleshooting#

Related Guides#

About the Author

Quick Start: spaCy Pretrained NER

When spaCy Pretrained Models Fall Short

Hugging Face Transformers NER Pipeline

Picking the Right Model

Common Error: Slow Tokenizer Warning

Zero-Shot NER with GLiNER

Tuning the Threshold

Training a Custom NER Model with spaCy

Step 1: Generate a Base Config

Step 2: Prepare Training Data

Step 3: Train

Step 4: Load and Use

Choosing the Right Approach

Troubleshooting

Related Guides