How to Build a Text Anonymization Pipeline with Presidio and spaCy

Redacting PII with <PERSON> tags works for logs, but it destroys the structure of your data. If you need anonymized text that still looks realistic – for training ML models, sharing datasets with vendors, or running QA on production-shaped data – you need a pipeline that replaces PII with fake but plausible values. Presidio’s anonymizer engine supports custom operators that do exactly this.

This post assumes you already have presidio-analyzer and presidio-anonymizer installed. If not:

1
2
pip install presidio-analyzer presidio-anonymizer faker
python -m spacy download en_core_web_lg

We also install faker here because we’ll use it to generate realistic replacement values.

Custom Anonymization Operators with Faker

Presidio’s built-in replace operator swaps PII with static text like [REDACTED]. That’s fine for redaction, but anonymization means producing output that preserves the shape of the original. A name should become another name, an email should become another email.

You build this with custom operators. A custom operator is a class that inherits from OperatorType and implements an operate method. Here’s one that uses Faker to generate entity-appropriate replacements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from typing import Dict
from faker import Faker
from presidio_anonymizer.operators import OperatorType

fake = Faker()
Faker.seed(42)  # reproducible output


class FakerOperator(OperatorType):
    """Replaces PII with fake data generated by Faker."""

    FAKER_GENERATORS = {
        "PERSON": lambda: fake.name(),
        "EMAIL_ADDRESS": lambda: fake.email(),
        "PHONE_NUMBER": lambda: fake.phone_number(),
        "LOCATION": lambda: fake.address().replace("\n", ", "),
        "US_SSN": lambda: fake.ssn(),
        "CREDIT_CARD": lambda: fake.credit_card_number(),
        "DATE_TIME": lambda: fake.date(),
        "IP_ADDRESS": lambda: fake.ipv4(),
        "URL": lambda: fake.url(),
    }

    def operate(self, text: str, params: Dict = None) -> str:
        entity_type = params.get("entity_type", "DEFAULT") if params else "DEFAULT"
        generator = self.FAKER_GENERATORS.get(entity_type)
        if generator:
            return generator()
        return f"<{entity_type}>"

    def validate(self, params: Dict = None) -> None:
        pass

    def operator_name(self) -> str:
        return "faker_replace"

    def operator_type(self) -> str:
        return "anonymize"

Each entity type maps to a Faker generator. When the anonymizer hits a PERSON, it generates a fake name. When it hits an EMAIL_ADDRESS, it generates a fake email. Anything not in the map falls back to the standard tag placeholder.

Now wire it into the anonymizer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Register the custom operator
anonymizer.add_anonymizer(FakerOperator)

text = """Patient: John Martinez, email: [email protected]
Phone: (415) 555-2938, SSN: 321-54-9876
Address: 1400 Oak Boulevard, San Francisco, CA 94102"""

results = analyzer.analyze(text=text, language="en")

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("faker_replace", {"entity_type": "PERSON"}),
        "EMAIL_ADDRESS": OperatorConfig("faker_replace", {"entity_type": "EMAIL_ADDRESS"}),
        "PHONE_NUMBER": OperatorConfig("faker_replace", {"entity_type": "PHONE_NUMBER"}),
        "US_SSN": OperatorConfig("faker_replace", {"entity_type": "US_SSN"}),
        "LOCATION": OperatorConfig("faker_replace", {"entity_type": "LOCATION"}),
        "DEFAULT": OperatorConfig("replace", {"new_value": "[ANONYMIZED]"}),
    },
)

print(anonymized.text)

The output looks like real data but contains no actual PII:

1
2
3
Patient: Allison Hill, email: [email protected]
Phone: 001-553-718-4407x958, SSN: 082-87-2839
Address: 809 Burns Creek Apt. 703, West Danielchester, NC 17498, San Francisco, CA 94102

Setting Faker.seed(42) makes the output deterministic, which matters when you need reproducible anonymization runs for testing.

Mixing Strategies: Fake Data, Hashing, and Masking

Real pipelines rarely use one strategy for everything. You might want fake names for readability, hashed emails for join keys, and masked credit cards for partial visibility. Presidio lets you assign a different operator per entity type in a single pass.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(FakerOperator)

text = """Customer: Sarah Chen, [email protected]
Card: 4532-0154-7789-3321, SSN: 456-78-9012
IP: 192.168.1.42, joined 2024-03-15"""

results = analyzer.analyze(text=text, language="en")

operators = {
    # Fake name for readability
    "PERSON": OperatorConfig("faker_replace", {"entity_type": "PERSON"}),
    # Hash email so you can still join records across tables
    "EMAIL_ADDRESS": OperatorConfig("hash", {"hash_type": "sha256"}),
    # Mask credit card, show last 4 digits
    "CREDIT_CARD": OperatorConfig("mask", {
        "chars_to_mask": 12,
        "masking_char": "*",
        "from_end": False,
    }),
    # Hash SSN for consistency across documents
    "US_SSN": OperatorConfig("hash", {"hash_type": "sha256"}),
    # Replace IP with fake one
    "IP_ADDRESS": OperatorConfig("faker_replace", {"entity_type": "IP_ADDRESS"}),
    # Replace date with fake date
    "DATE_TIME": OperatorConfig("faker_replace", {"entity_type": "DATE_TIME"}),
}

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators=operators,
)

print(anonymized.text)

Why hash emails and SSNs instead of replacing them? Hashing is deterministic – the same input always produces the same hash. If [email protected] appears in five documents, all five get the same hash value. You can still do joins and deduplication on the anonymized dataset without ever seeing the real email.

Batch Processing a Dataset

Processing one string at a time won’t cut it when you have thousands of records. Here’s a pipeline that reads a CSV, anonymizes specific columns, and writes the result back out:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import csv
from io import StringIO
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Sample CSV data (in production, read from a file)
csv_data = """id,name,email,notes
1,Maria Garcia,[email protected],Called about account #4421
2,James Wilson,[email protected],Requested refund for order shipped to 42 Elm St
3,Aisha Patel,[email protected],Mentioned her SSN 111-22-3333 on the call"""

# Initialize engines once -- this loads the spaCy model
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(FakerOperator)

faker_operators = {
    "PERSON": OperatorConfig("faker_replace", {"entity_type": "PERSON"}),
    "EMAIL_ADDRESS": OperatorConfig("faker_replace", {"entity_type": "EMAIL_ADDRESS"}),
    "PHONE_NUMBER": OperatorConfig("faker_replace", {"entity_type": "PHONE_NUMBER"}),
    "US_SSN": OperatorConfig("faker_replace", {"entity_type": "US_SSN"}),
    "LOCATION": OperatorConfig("faker_replace", {"entity_type": "LOCATION"}),
    "CREDIT_CARD": OperatorConfig("mask", {
        "chars_to_mask": 12,
        "masking_char": "*",
        "from_end": False,
    }),
    "DEFAULT": OperatorConfig("replace", {"new_value": "[REDACTED]"}),
}

# Columns that may contain PII
pii_columns = {"name", "email", "notes"}


def anonymize_text(text: str) -> str:
    if not text or not text.strip():
        return text
    results = analyzer.analyze(text=text, language="en")
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators=faker_operators,
    )
    return anonymized.text


reader = csv.DictReader(StringIO(csv_data))
output_rows = []

for row in reader:
    anonymized_row = {}
    for key, value in row.items():
        if key in pii_columns:
            anonymized_row[key] = anonymize_text(value)
        else:
            anonymized_row[key] = value
    output_rows.append(anonymized_row)

# Print anonymized output
for row in output_rows:
    print(row)

A few things to notice. The engines are initialized once outside the loop. Creating an AnalyzerEngine loads the spaCy model, which takes a few seconds. You don’t want that happening per row. The pii_columns set controls which fields get scanned. Columns like id that are clearly non-PII skip the analyzer entirely, which speeds up the pipeline significantly on wide tables.

For larger datasets, swap the loop for a multiprocessing pool. The analyzer is thread-safe, but each worker should get its own Faker instance to avoid seed collisions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from multiprocessing import Pool
from functools import partial

def anonymize_row(row, pii_cols):
    """Process a single row. Each worker process gets its own engine."""
    local_analyzer = AnalyzerEngine()
    local_anonymizer = AnonymizerEngine()
    local_anonymizer.add_anonymizer(FakerOperator)

    result = {}
    for key, value in row.items():
        if key in pii_cols:
            detections = local_analyzer.analyze(text=value, language="en")
            anon = local_anonymizer.anonymize(
                text=value, analyzer_results=detections, operators=faker_operators,
            )
            result[key] = anon.text
        else:
            result[key] = value
    return result

# Use 4 worker processes
with Pool(4) as pool:
    worker = partial(anonymize_row, pii_cols=pii_columns)
    anonymized_rows = pool.map(worker, output_rows)

In practice, the overhead of re-creating engines per worker is small compared to the I/O and NLP processing per row. If it bothers you, use an initializer function to create the engine once per worker process.

Custom Recognizers for Domain-Specific Entities

The built-in recognizers cover standard PII (names, emails, SSNs, credit cards), but your data probably has domain-specific identifiers too. Medical record numbers, internal ticket IDs, policy numbers – these need custom recognizers.

Here’s a recognizer for medical record numbers (MRN) that follow the pattern MRN- followed by 8 digits, plus a context-aware recognizer for patient IDs that appear near healthcare keywords:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from presidio_analyzer import (
    AnalyzerEngine,
    PatternRecognizer,
    Pattern,
    RecognizerResult,
)
from presidio_analyzer import LocalRecognizer

# Simple regex-based recognizer for MRN codes
mrn_recognizer = PatternRecognizer(
    supported_entity="MEDICAL_RECORD_NUMBER",
    name="MRN Recognizer",
    patterns=[
        Pattern(
            name="mrn_pattern",
            regex=r"MRN-\d{8}",
            score=0.95,
        )
    ],
    context=["patient", "medical", "record", "chart"],
)

# Regex recognizer for insurance policy numbers: 2 letters + 8 digits
policy_recognizer = PatternRecognizer(
    supported_entity="INSURANCE_POLICY",
    name="Insurance Policy Recognizer",
    patterns=[
        Pattern(
            name="policy_pattern",
            regex=r"\b[A-Z]{2}\d{8}\b",
            score=0.6,
        )
    ],
    context=["policy", "insurance", "coverage", "plan"],
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(mrn_recognizer)
analyzer.registry.add_recognizer(policy_recognizer)

text = """Patient record MRN-00194827 shows coverage under
insurance policy AB12345678. Contact the patient at 555-0199."""

results = analyzer.analyze(
    text=text,
    entities=["MEDICAL_RECORD_NUMBER", "INSURANCE_POLICY", "PHONE_NUMBER"],
    language="en",
)

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

Output:

1
2
3
MEDICAL_RECORD_NUMBER: 'MRN-00194827' (score: 1.00)
INSURANCE_POLICY: 'AB12345678' (score: 0.95)
PHONE_NUMBER: '555-0199' (score: 0.75)

The context parameter is important. The INSURANCE_POLICY pattern ([A-Z]{2}\d{8}) is generic enough to match lots of things. By providing context words like “policy” and “insurance”, Presidio boosts the confidence score when those words appear near the match. The base score of 0.6 gets bumped to 0.95 because “insurance policy” appears right before the match.

Now add Faker generators for these custom entities and run the full pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from faker import Faker

fake = Faker()

# Extend the FakerOperator's generators for custom entities
FakerOperator.FAKER_GENERATORS["MEDICAL_RECORD_NUMBER"] = (
    lambda: f"MRN-{fake.random_number(digits=8, fix_len=True)}"
)
FakerOperator.FAKER_GENERATORS["INSURANCE_POLICY"] = (
    lambda: f"{fake.random_uppercase_letter()}{fake.random_uppercase_letter()}"
           f"{fake.random_number(digits=8, fix_len=True)}"
)

anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(FakerOperator)

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "MEDICAL_RECORD_NUMBER": OperatorConfig(
            "faker_replace", {"entity_type": "MEDICAL_RECORD_NUMBER"}
        ),
        "INSURANCE_POLICY": OperatorConfig(
            "faker_replace", {"entity_type": "INSURANCE_POLICY"}
        ),
        "PHONE_NUMBER": OperatorConfig(
            "faker_replace", {"entity_type": "PHONE_NUMBER"}
        ),
    },
)

print(anonymized.text)

The anonymized output keeps the same structure but every identifier is fake. The MRN still looks like an MRN, the policy number still looks like a policy number – downstream systems and humans can work with the data normally.

Common Errors and Fixes

KeyError: 'faker_replace' when calling anonymize() – You forgot to register the custom operator. Call anonymizer.add_anonymizer(FakerOperator) before calling anonymize(). The operator name returned by operator_name() must exactly match the string you pass in OperatorConfig.

TypeError: operate() got an unexpected keyword argument – Your custom operator’s operate method signature doesn’t match what Presidio expects. It must be operate(self, text: str, params: Dict = None) -> str. The params dict contains the parameters from OperatorConfig plus an entity_type key that Presidio injects automatically.

Faker generates the same values every run – That’s Faker.seed() doing its job. If you want different fake data each run, remove the seed call. If you want deterministic output (for tests or audits), keep it.

Custom recognizer matches too aggressively – Lower the base score in your Pattern and add context words. Without context, a pattern like \b[A-Z]{2}\d{8}\b will match random alphanumeric strings throughout your data. Context words act as a filter: the score only gets boosted when the surrounding text contains relevant keywords. If you set the base score to 0.3 and filter results at score >= 0.7, only contextually supported matches survive.

AnonymizerEngine does not modify the original text – Presidio returns a new EngineResult object. The original string is never mutated. Always use anonymized.text to get the result. If you see unchanged output, check that analyzer.analyze() actually found entities – print the results list first.

Anonymized text has overlapping replacements or garbled output – This happens when analyzer results overlap (two recognizers match the same span). Presidio handles overlaps by keeping the higher-scoring result, but custom recognizers can conflict with built-in ones. Debug by printing all results sorted by position and checking for overlapping start/end ranges. Remove redundant recognizers or increase the score threshold.

Custom Anonymization Operators with Faker#

Mixing Strategies: Fake Data, Hashing, and Masking#

Batch Processing a Dataset#

Custom Recognizers for Domain-Specific Entities#

Common Errors and Fixes#

Related Guides#

About the Author

Custom Anonymization Operators with Faker

Mixing Strategies: Fake Data, Hashing, and Masking

Batch Processing a Dataset

Custom Recognizers for Domain-Specific Entities

Common Errors and Fixes

Related Guides