Named entity recognition (NER) pulls structured data out of unstructured text – person names, organizations, dates, monetary amounts, locations. It is one of the most practical NLP tasks because nearly every document processing pipeline needs it. You can get a working NER system in under 10 lines of Python with spaCy, scale to custom domains with Hugging Face Transformers, or skip training entirely with GLiNER’s zero-shot approach.
This guide covers all three paths and shows you when to pick each one.
Quick Start: spaCy Pretrained NER
spaCy ships with pretrained models that handle the standard entity types (PERSON, ORG, GPE, DATE, MONEY, etc.) out of the box. For general-purpose entity extraction, start here.
| |
The en_core_web_trf model uses a RoBERTa transformer backbone and is the most accurate English model spaCy offers. If you need speed over accuracy, use en_core_web_sm (small, CPU-friendly) instead.
| |
Output:
| |
That is the entire pipeline. Load a model, pass text, iterate over doc.ents. Each entity has the extracted text, a label, and character offsets you can use to map back to the original document.
When spaCy Pretrained Models Fall Short
The pretrained models only recognize the entity types they were trained on (18 types for OntoNotes). If you need to extract domain-specific entities – drug names, gene symbols, legal citations, product SKUs – the pretrained model will miss them entirely. That is where custom training or zero-shot approaches come in.
Hugging Face Transformers NER Pipeline
Hugging Face gives you access to thousands of fine-tuned NER models through a single API. The token-classification pipeline handles tokenization, inference, and subword aggregation automatically.
| |
| |
Output:
| |
The aggregation_strategy="simple" parameter is critical. Without it, you get individual subword tokens instead of merged entities. A name like “Hawthorne” might appear as ["Haw", "##thorne"] with separate predictions for each piece.
Picking the Right Model
The model you choose matters more than any hyperparameter. Here are the most reliable options on Hugging Face:
| Model | Entity Types | Best For |
|---|---|---|
dslim/bert-base-NER | PER, ORG, LOC, MISC | General purpose, fast |
Jean-Baptiste/camembert-ner | PER, ORG, LOC, MISC | French text |
d4data/biomedical-ner-all | Disease, Chemical, Gene, Species | Biomedical documents |
StanfordAIMI/stanford-deidentifier-base | Patient, Doctor, Hospital, Date | Clinical note de-identification |
Browse models tagged with token-classification on the Hugging Face Hub to find domain-specific options.
Common Error: Slow Tokenizer Warning
If you see this warning:
| |
Replace the deprecated grouped_entities=True parameter with aggregation_strategy="simple". If you also see errors about slow tokenizers not supporting aggregation, make sure you have the fast tokenizer installed:
| |
Some models only ship with slow tokenizers. In that case, set aggregation_strategy="simple" (not "first" or "average", which require fast tokenizers with word-to-token mappings).
Zero-Shot NER with GLiNER
GLiNER is a game changer for NER. It extracts any entity type you define at inference time – no training data, no fine-tuning, no label mapping. You just pass a list of entity type strings and it finds them.
| |
| |
Output:
| |
Notice that these labels are not standard NER tags. They are plain English descriptions. GLiNER uses a bidirectional transformer to match text spans against your label descriptions, so you can define any entity type that makes sense for your domain.
Tuning the Threshold
The threshold parameter controls how confident the model needs to be before returning an entity. Lower values (0.3) catch more entities but introduce false positives. Higher values (0.7) are more precise but miss borderline cases. Start at 0.5 and adjust based on your precision/recall requirements.
Training a Custom NER Model with spaCy
When pretrained models do not cover your entity types and you have labeled data, train a custom spaCy NER model. spaCy v3 uses a config-driven training system.
Step 1: Generate a Base Config
Go to spacy.io/usage/training and use the quickstart widget, or generate one from the CLI:
| |
For a transformer-backed NER model (higher accuracy, slower), use:
| |
Step 2: Prepare Training Data
spaCy v3 expects .spacy binary files. Convert your annotated data from the common JSON format:
| |
The char_span returning None is the single most common bug in spaCy NER training. It happens when your character offsets do not align with token boundaries. Always check for it.
Step 3: Train
| |
If you hit this error:
| |
Your annotation format is wrong. Make sure each entity is a tuple of (start_char, end_char, label) and that the offsets are character positions, not token positions.
Step 4: Load and Use
| |
Choosing the Right Approach
Pick your NER strategy based on what you actually need:
Use spaCy pretrained models when you need standard entity types (person, org, location, date, money) and want a fast, production-ready pipeline with no setup. Best for general document processing.
Use Hugging Face pipelines when you need a domain-specific model that someone has already fine-tuned (biomedical, legal, financial) and want to swap models without changing code.
Use GLiNER when your entity types are custom, you have no labeled data, and you need something working today. It is the fastest path to a domain-specific NER system, though accuracy on niche entities depends on how well your labels describe what you are looking for.
Train a custom spaCy model when you have labeled data, need high accuracy on domain-specific entities, and plan to run the model at scale in production.
Troubleshooting
spaCy model download fails behind a proxy:
| |
GPU out of memory with transformer models:
Transformer-based models (both spaCy _trf and Hugging Face) load large weights into GPU memory. If you hit OOM errors, either use a smaller model (en_core_web_sm or dslim/bert-base-NER) or process text in smaller chunks. spaCy’s nlp.pipe() with a batch_size parameter helps:
| |
GLiNER returns no entities:
Lower the threshold parameter. The default of 0.5 is conservative. Try 0.3 for recall-heavy use cases, and check that your label descriptions are descriptive enough – “medication” works better than “med” because the model matches against natural language.
Related Guides
- How to Build a Language Detection and Translation Pipeline
- How to Build a Spell Checking and Autocorrect Pipeline with Python
- How to Build a Sentiment Analysis API with Transformers and FastAPI
- How to Build a Text Classification Pipeline with SetFit
- How to Build a Legal NER Pipeline with Transformers and spaCy
- How to Build a Relation Extraction Pipeline
- How to Build a Text Paraphrase Pipeline with T5 and PEGASUS
- How to Build a Text Readability Scoring Pipeline with Python
- How to Build a Text Style Transfer Pipeline with Transformers
- How to Build a Text-to-SQL Pipeline with LLMs