Coreference resolution is the task of figuring out which words in a text refer to the same real-world entity. When a paragraph says “Sarah joined the team. She immediately impressed everyone,” you need to know that “She” means “Sarah.” Without this step, downstream tasks like entity extraction, summarization, and knowledge graph construction all suffer from fragmented entity references.
The fastest way to get coreference resolution running in Python is with coreferee, a spaCy pipeline component. Here is the minimal setup:
| |
| |
This prints the detected coreference chains with token indices and the text they refer to. Each chain groups all mentions of the same entity together.
Setting Up the Coreferee Pipeline and Resolving Clusters
Coreferee works as a spaCy pipeline component that combines neural networks with rule-based heuristics. It supports English, French, German, and Polish. The library detects coreference chains and exposes them through spaCy’s extension attributes.
Here is how to inspect and use the chains programmatically:
| |
The most_specific_mention_index property gives you the index of the most descriptive mention in the chain, which is typically the full noun phrase rather than a pronoun. The resolve() method traverses chains to find the most specific referent for any given token.
Replacing Pronouns with Their Referents
A common use case is rewriting text with pronouns replaced by the entities they refer to. This makes text self-contained for downstream processing:
| |
This outputs text where “He” gets replaced with “Satya Nadella” and “it” gets replaced with “the new product,” making every sentence independently understandable.
Transformer-Based Coreference with Maverick
For higher accuracy, especially on complex documents, you can use Maverick, a state-of-the-art transformer-based coreference model from SapienzaNLP. It was published at ACL 2024 and outperforms models with up to 13 billion parameters while using only around 500 million.
| |
| |
Maverick returns clusters as lists of mention spans. Each cluster groups all text spans that refer to the same entity. The clusters_text_mentions field gives you the actual strings, and clusters_token_offsets gives you the start/end token positions.
Three pretrained models are available depending on your domain:
| Model | Dataset | F1 | Singletons |
|---|---|---|---|
maverick-mes-ontonotes | OntoNotes | 83.6 | No |
maverick-mes-litbank | LitBank | 78.0 | Yes |
maverick-mes-preco | PreCo | 87.4 | Yes |
Use ontonotes for general text, litbank for literary or narrative content, and preco for broad coverage with singleton mentions.
Improving Downstream Tasks with Coref Resolution
Coreference resolution acts as a preprocessing step that makes every other NLP task work better. Here are two concrete patterns.
Better Entity Extraction
Named entity recognition often misses the connection between “Google” and “the company” and “it” and “they.” By resolving coreferences first, you get a complete picture of every entity mention:
| |
Cleaner Summarization Input
Feed resolved text to a summarizer so it does not have to guess what pronouns mean:
| |
The resolved text replaces “She,” “her,” and “The researcher” with “Dr. Elena Vasquez,” giving the summarizer explicit entity references instead of ambiguous pronouns.
Handling Edge Cases
Cataphora (Forward References)
Cataphora is when a pronoun appears before its referent: “Before he left, John locked the door.” Coreferee handles this because it analyzes the full document before resolving chains, not just left-to-right context. No special configuration is needed, but be aware that cataphoric references have lower accuracy than standard anaphoric ones.
Nested References
Nested references occur when one coreference chain contains mentions that overlap with another chain. For example: “The CEO of Google said his company would invest more in AI. He confirmed it at the conference.” Here “his company” contains a possessive pronoun that itself refers to the CEO, while “his company” as a whole refers to Google.
Coreferee resolves these by tracking chains independently. Each chain handles one entity. You can traverse multiple chains to fully resolve nested references:
| |
Long Documents
Both coreferee and Maverick have practical limits on document length. Coreferee processes the full spaCy Doc object, so it scales with your available memory. Maverick’s transformer backbone has a context window constraint.
For long documents, chunk the text at paragraph or section boundaries, process each chunk independently, and then merge the results. Make sure your chunks overlap by a sentence or two so entities that cross boundaries still get resolved:
| |
Common Errors and Fixes
ModuleNotFoundError: No module named 'coreferee' after installing. You installed the package but forgot to download the language model. Run python3 -m coreferee install en after pip install coreferee.
ValueError: [E002] Can't find factory for 'coreferee'. This happens when the spaCy model version and coreferee version are incompatible. Coreferee is tested with spaCy 3.0 through 3.5. Check your spaCy version with python3 -m spacy info and pin a compatible coreferee version.
Poor resolution accuracy with en_core_web_sm. Coreferee relies heavily on the quality of spaCy’s POS tagger, parser, and NER. The small model (sm) lacks the accuracy coreferee needs. Always use en_core_web_lg or en_core_web_trf for production-quality results. The trf model gives the best accuracy but is slower.
RuntimeError: CUDA out of memory when running Maverick. The DeBERTa-large backbone needs around 4 GB of VRAM. If you hit memory limits, set device="cpu" or use a smaller batch size. CPU inference is slower but works on any machine.
Coreferee returns empty chains. This usually means the text is too short or lacks any anaphoric references. Coreferee intentionally avoids false positives, so if it cannot confidently resolve a reference, it skips it. Also check that your text actually contains pronouns or noun phrases that need resolution.
Maverick predict() returns no clusters. Make sure you are using the right model for your data. The ontonotes model does not predict singletons by default. If you need single-mention entities, either pass singletons=True or use the preco or litbank model variants.
Related Guides
- How to Build an Abstractive Summarization Pipeline with PEGASUS
- How to Build a Resume Parser with spaCy and Transformers
- How to Build a Text Summarization Pipeline with Sumy and Transformers
- How to Build a Named Entity Linking Pipeline with Wikipedia and Transformers
- How to Build an Aspect-Based Sentiment Analysis Pipeline
- How to Build a Legal NER Pipeline with Transformers and spaCy
- How to Build a Keyphrase Generation Pipeline with KeyphraseVectorizers
- How to Build a Text Anonymization Pipeline with Presidio and spaCy
- How to Build a RAG Pipeline with Hugging Face Transformers v5
- How to Build a Text Entailment and Contradiction Detection Pipeline