How to Build a Legal NER Pipeline with Transformers and spaCy

Legal documents are dense with entities that generic NER models miss entirely. Case citations, statute references, provision numbers, party names – a standard dslim/bert-base-NER model trained on CoNLL-2003 will label “Respondent No. 3” as a person and completely ignore “Section 319 Cr.P.C.” as a statute reference.

You need domain-specific NER. Here’s the fastest way to get legal entity extraction running with Hugging Face Transformers and spaCy.

Quick Start: Legal NER with Transformers

The simplest approach uses the Hugging Face pipeline directly. For general NER on legal text, dslim/bert-base-NER gives you a baseline, but it won’t catch legal-specific entities. Still, it’s a good sanity check before adding complexity.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from transformers import pipeline

ner_pipeline = pipeline(
    "token-classification",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple",
)

legal_text = """
The Supreme Court held in Brown v. Board of Education that
segregation in public schools violated the Equal Protection Clause
of the Fourteenth Amendment. Justice Warren delivered the opinion
on May 17, 1954 in Washington, D.C.
"""

entities = ner_pipeline(legal_text)
for ent in entities:
    print(f"{ent['word']:30s} {ent['entity_group']:10s} {ent['score']:.3f}")

1
2
3
4
5
6
7
8
Supreme Court                  ORG        0.998
Brown                          PER        0.943
Board of Education             ORG        0.976
Equal Protection Clause        MISC       0.712
Fourteenth Amendment           MISC       0.634
Justice Warren                 PER        0.987
May 17, 1954                   MISC       0.521
Washington, D. C.              LOC        0.993

Notice the problems. “Brown v. Board of Education” is a case citation, not a person and an organization. “Fourteenth Amendment” and “Equal Protection Clause” are legal provisions, not miscellaneous. Generic NER doesn’t understand legal semantics.

Loading a Legal-Domain NER Model

For proper legal entity recognition, you want a model trained on legal text. The OpenNYAI project’s en_legal_ner_trf is one of the best available options – it’s a spaCy transformer pipeline trained specifically on court judgments. It recognizes 14 legal entity types including STATUTE, PROVISION, PRECEDENT, JUDGE, PETITIONER, RESPONDENT, CASE_NUMBER, and more.

1
2
pip install spacy>=3.2.2 spacy-huggingface-pipelines
pip install https://huggingface.co/opennyaiorg/en_legal_ner_trf/resolve/main/en_legal_ner_trf-any-py3-none-any.whl

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import spacy

nlp = spacy.load("en_legal_ner_trf")

text = """
Section 319 Cr.P.C. contemplates a situation where the evidence
adduced by the prosecution for Respondent No. 3 - G. Sambiah
was presented before Justice K. Ramaswamy on 20th June 1984
in the matter of State of Karnataka v. Union of India.
"""

doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:45s} {ent.label_:15s}")

1
2
3
4
5
6
Section 319                                   PROVISION
Cr.P.C.                                       STATUTE
G. Sambiah                                    RESPONDENT
Justice K. Ramaswamy                          JUDGE
20th June 1984                                DATE
State of Karnataka v. Union of India          PRECEDENT

That’s dramatically better. The model distinguishes between statutes and provisions, knows that “State of Karnataka v. Union of India” is a case precedent rather than two organizations, and correctly identifies the respondent.

Integrating Hugging Face Models with spaCy as a Custom Component

If you’re working with a Hugging Face token classification model rather than a pre-packaged spaCy model, the spacy-huggingface-pipelines library bridges the gap. This is especially useful when you’ve fine-tuned your own legal NER model on Hugging Face.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import spacy

nlp = spacy.blank("en")

# Add a Hugging Face token classifier as a spaCy component
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "dslim/bert-base-NER",
        "aggregation_strategy": "simple",
        "alignment_mode": "expand",  # expand to spaCy token boundaries
    },
)

doc = nlp("Justice Ruth Bader Ginsburg wrote the dissent in Ledbetter v. Goodyear.")
for ent in doc.ents:
    print(f"{ent.text:30s} {ent.label_}")

For a fully custom pipeline that wraps any Hugging Face model, you can register your own spaCy component. This gives you complete control over entity mapping and post-processing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import spacy
from spacy.language import Language
from spacy.tokens import Doc, Span
from transformers import pipeline as hf_pipeline


@Language.factory("legal_ner_transformer")
class LegalNERTransformer:
    def __init__(self, nlp: Language, name: str, model_name: str):
        self.ner_pipeline = hf_pipeline(
            "token-classification",
            model=model_name,
            aggregation_strategy="simple",
        )
        # Map generic labels to legal-specific ones if needed
        self.label_map = {
            "PER": "PERSON",
            "ORG": "ORGANIZATION",
            "LOC": "JURISDICTION",
            "MISC": "LEGAL_REFERENCE",
        }

    def __call__(self, doc: Doc) -> Doc:
        text = doc.text
        predictions = self.ner_pipeline(text)

        spans = []
        for pred in predictions:
            start_char = pred["start"]
            end_char = pred["end"]
            label = self.label_map.get(pred["entity_group"], pred["entity_group"])

            span = doc.char_span(start_char, end_char, label=label, alignment_mode="expand")
            if span is not None:
                spans.append(span)

        # Filter overlapping spans, keeping the longest
        doc.ents = spacy.util.filter_spans(spans)
        return doc


# Usage
nlp = spacy.blank("en")
nlp.add_pipe(
    "legal_ner_transformer",
    config={"model_name": "dslim/bert-base-NER"},
)

doc = nlp("The Ninth Circuit ruled in Chevron U.S.A. v. NRDC on June 25, 1984.")
for ent in doc.ents:
    print(f"{ent.text:30s} {ent.label_}")

1
2
3
4
Ninth Circuit                  ORGANIZATION
Chevron U.S.A.                 ORGANIZATION
NRDC                           ORGANIZATION
June 25, 1984                  LEGAL_REFERENCE

The label_map dictionary is where you remap generic NER labels to legal-specific ones. In production, you’d expand this with regex-based post-processing to catch citation patterns the model misses.

Processing Legal Documents at Scale

Real legal pipelines process thousands of documents. Use nlp.pipe() for batch processing with proper batching and optional GPU acceleration.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import spacy
from collections import defaultdict

nlp = spacy.load("en_legal_ner_trf")

# Batch process documents
legal_documents = [
    "The petitioner filed under Section 482 of Cr.P.C. before the High Court of Delhi.",
    "In Maneka Gandhi v. Union of India, the Supreme Court expanded Article 21.",
    "Respondent argued that the Industrial Disputes Act, 1947 does not apply.",
]

entity_index = defaultdict(list)

for doc in nlp.pipe(legal_documents, batch_size=32):
    for ent in doc.ents:
        entity_index[ent.label_].append({
            "text": ent.text,
            "start": ent.start_char,
            "end": ent.end_char,
            "context": doc.text[max(0, ent.start_char - 50):ent.end_char + 50],
        })

# Print grouped entities
for label, entities in entity_index.items():
    print(f"\n--- {label} ({len(entities)} found) ---")
    for e in entities:
        print(f"  {e['text']}")

For GPU acceleration, add spacy.require_gpu() before loading the model. If you’re CPU-bound, keep the batch size smaller (8-16) to avoid memory spikes.

Post-Processing: Citation Extraction and Entity Linking

NER alone isn’t enough. Legal text has structured citation formats that benefit from regex-based post-processing layered on top of the model predictions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import re
from spacy.tokens import Doc, Span


CITATION_PATTERNS = {
    "US_CASE": re.compile(
        r"(?P<party1>[A-Z][\w\s.]+?)\s+v\.\s+(?P<party2>[A-Z][\w\s.]+?)(?=,|\.|;|\s\d)"
    ),
    "US_CODE": re.compile(
        r"(\d+)\s+U\.?S\.?C\.?\s*§\s*(\d+[\w\-]*)"
    ),
    "FEDERAL_REPORTER": re.compile(
        r"(\d+)\s+F\.\s*(?:2d|3d|4th)?\s+(\d+)"
    ),
    "SECTION_REF": re.compile(
        r"[Ss]ection\s+(\d+[A-Za-z]?(?:\(\d+\))?)"
    ),
}


def extract_legal_citations(doc: Doc) -> list[dict]:
    """Extract structured citations that NER models often miss."""
    citations = []
    text = doc.text

    for cite_type, pattern in CITATION_PATTERNS.items():
        for match in pattern.finditer(text):
            citations.append({
                "type": cite_type,
                "text": match.group(),
                "start": match.start(),
                "end": match.end(),
                "groups": match.groupdict() if match.groupdict() else match.groups(),
            })

    return citations


# Combine NER entities with regex citations
import spacy

nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={"model": "dslim/bert-base-NER", "aggregation_strategy": "simple"},
)

text = """
In Marbury v. Madison, 5 U.S. 137 (1803), Chief Justice Marshall
established judicial review under Article III. See also 28 U.S.C. § 1331
for federal question jurisdiction. The Court in Brown v. Board of Education,
347 U.S. 483 (1954), overruled Plessy under Section 1 of the Fourteenth Amendment.
"""

doc = nlp(text)

# NER entities
print("=== NER Entities ===")
for ent in doc.ents:
    print(f"  {ent.text:35s} {ent.label_}")

# Regex citations
print("\n=== Structured Citations ===")
citations = extract_legal_citations(doc)
for cite in citations:
    print(f"  [{cite['type']}] {cite['text']}")

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
=== NER Entities ===
  Marbury                             PER
  Madison                             PER
  Chief Justice Marshall              PER
  Brown                               PER
  Board of Education                  ORG
  Plessy                              PER

=== Structured Citations ===
  [US_CASE] Marbury v. Madison, 5 U
  [US_CODE] 28 U.S.C. § 1331
  [US_CASE] Brown v. Board of Education,
  [SECTION_REF] Section 1

The regex layer catches citation structures that the NER model fragments into individual entities. In production, you’d merge these two sources into a unified entity set, preferring the regex citation when it encompasses multiple NER entities that form a single legal reference.

Common Errors and Fixes

OSError: Can't find model 'en_legal_ner_trf'

You need to install the model wheel directly from Hugging Face. It’s not in PyPI.

1
pip install https://huggingface.co/opennyaiorg/en_legal_ner_trf/resolve/main/en_legal_ner_trf-any-py3-none-any.whl

ValueError: Span overlaps with existing entities

spaCy doesn’t allow overlapping entities in doc.ents. Use spacy.util.filter_spans() to keep the longest span when overlaps occur, or store overlapping entities in doc.spans["legal"] instead.

1
2
3
from spacy.tokens import SpanGroup

doc.spans["legal"] = SpanGroup(doc, name="legal", spans=all_spans)

RuntimeError: CUDA out of memory during batch processing

Reduce the batch size or switch to CPU. Transformer models on legal documents (which tend to be long) eat GPU memory fast.

1
2
3
4
5
6
7
# Force CPU processing
spacy.require_cpu()
nlp = spacy.load("en_legal_ner_trf")

# Or reduce batch size
for doc in nlp.pipe(documents, batch_size=4):
    process(doc)

Tokenization misalignment between Transformers and spaCy

When doc.char_span() returns None, the transformer’s token boundaries don’t align with spaCy’s tokenization. Use alignment_mode="expand" to snap to the nearest spaCy token boundary. If you’re losing too many entities, try "contract" instead, which takes the smaller matching span.

ModuleNotFoundError: No module named 'spacy_huggingface_pipelines'

Install the bridge library:

1
pip install spacy-huggingface-pipelines

This automatically registers the hf_token_pipe and hf_text_pipe components with spaCy. Make sure you import spacy after installation – the components register themselves via entry points.

Long documents getting truncated (512 token limit)

Most BERT-based models have a 512 token limit. Legal documents routinely exceed this. Use the stride parameter in hf_token_pipe to enable overlapping windows, or split documents into paragraphs before processing.

1
2
3
4
5
6
7
8
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "dslim/bert-base-NER",
        "aggregation_strategy": "simple",
        "stride": 128,  # overlap between windows
    },
)

Quick Start: Legal NER with Transformers#

Loading a Legal-Domain NER Model#

Integrating Hugging Face Models with spaCy as a Custom Component#

Processing Legal Documents at Scale#

Post-Processing: Citation Extraction and Entity Linking#

Common Errors and Fixes#

Related Guides#

About the Author