Microsoft Presidio finds and removes personally identifiable information (PII) from text. It combines spaCy’s named entity recognition with regex patterns and checksums to detect 50+ entity types – names, credit cards, SSNs, phone numbers, emails, and country-specific identifiers. Here is how to get it running.

Install Presidio and Download the spaCy Model

Presidio ships as two packages: presidio-analyzer for detection and presidio-anonymizer for redaction. Both need a spaCy language model to tokenize text and run NER.

1
2
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

Use en_core_web_lg for production workloads. The smaller en_core_web_sm works for quick experiments but misses more entities. If you skip the spaCy download, you will hit this error the first time you call analyze():

1
OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a Python package or a valid path to a data directory.

Fix it by running the spacy download command above.

Detect PII with AnalyzerEngine

The AnalyzerEngine scans text and returns a list of RecognizerResult objects with the entity type, start/end positions, and a confidence score between 0 and 1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

text = """
Contact Jane Smith at [email protected] or call 555-867-5309.
Her SSN is 123-45-6789 and she lives at 742 Evergreen Terrace, Springfield.
Credit card on file: 4111-1111-1111-1111.
"""

results = analyzer.analyze(
    text=text,
    entities=[
        "PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
        "US_SSN", "CREDIT_CARD", "LOCATION"
    ],
    language="en",
)

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

Output:

1
2
3
4
5
6
PERSON: 'Jane Smith' (score: 0.85)
EMAIL_ADDRESS: '[email protected]' (score: 1.00)
PHONE_NUMBER: '555-867-5309' (score: 0.75)
US_SSN: '123-45-6789' (score: 0.85)
LOCATION: '742 Evergreen Terrace, Springfield' (score: 0.85)
CREDIT_CARD: '4111-1111-1111-1111' (score: 1.00)

Pass an empty list or omit entities entirely to scan for all supported types. That is useful for an initial audit, but filtering to the entities you care about speeds things up and cuts false positives.

Redact PII with AnonymizerEngine

Once you have analyzer results, feed them into the AnonymizerEngine to strip or replace the sensitive values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from presidio_anonymizer import AnonymizerEngine

anonymizer = AnonymizerEngine()

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
)

print(anonymized.text)

Output:

1
2
3
Contact <PERSON> at <EMAIL_ADDRESS> or call <PHONE_NUMBER>.
Her SSN is <US_SSN> and she lives at <LOCATION>.
Credit card on file: <CREDIT_CARD>.

By default, Presidio replaces each detected entity with a placeholder tag like <PERSON>. You can change this behavior per entity type using operators.

Choose an Anonymization Strategy

Presidio ships six built-in operators: replace, redact, hash, mask, encrypt, and custom. Pick the one that matches your compliance requirements.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

anonymizer = AnonymizerEngine()

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("replace", {"new_value": "[REDACTED_NAME]"}),
        "EMAIL_ADDRESS": OperatorConfig("mask", {
            "chars_to_mask": 12,
            "masking_char": "*",
            "from_end": False,
        }),
        "US_SSN": OperatorConfig("hash", {"hash_type": "sha256"}),
        "CREDIT_CARD": OperatorConfig("redact"),
        "DEFAULT": OperatorConfig("replace", {"new_value": "[PII]"}),
    },
)

print(anonymized.text)

The DEFAULT key applies to any entity type without a specific operator. Use hash when you need to join records across datasets without exposing the raw value. Use encrypt when you need reversible anonymization (you will need to manage the encryption key).

Add a Custom Recognizer

The built-in recognizers handle common PII well, but most teams need to detect domain-specific patterns – internal employee IDs, project codes, medical record numbers. Presidio makes this straightforward with PatternRecognizer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern

# Detect employee IDs like "EMP-12345"
emp_pattern = Pattern(
    name="employee_id_pattern",
    regex=r"EMP-\d{5}",
    score=0.95,
)

emp_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[emp_pattern],
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(emp_recognizer)

text = "Assign ticket to EMP-48291 and CC EMP-10034."
results = analyzer.analyze(text=text, entities=["EMPLOYEE_ID"], language="en")

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score})")

Output:

1
2
EMPLOYEE_ID: 'EMP-48291' (score: 0.95)
EMPLOYEE_ID: 'EMP-10034' (score: 0.95)

For simpler cases, you can use a deny list instead of regex:

1
2
3
4
5
dept_recognizer = PatternRecognizer(
    supported_entity="INTERNAL_DEPT",
    deny_list=["Project Chimera", "Skunkworks", "Team Alpha"],
)
analyzer.registry.add_recognizer(dept_recognizer)

Set a Confidence Threshold

Presidio assigns a confidence score to every detection. Low-confidence results are often false positives, especially for entity types like PERSON and LOCATION where NER is doing the heavy lifting. Filter results by score to control precision.

1
2
3
4
results = analyzer.analyze(text=text, language="en")

# Only keep high-confidence detections
filtered = [r for r in results if r.score >= 0.7]

Tune this threshold on a labeled sample of your actual data. Start at 0.5 and increase until false positives drop to an acceptable level. Going above 0.85 will start missing real PII, so test carefully.

Process Text in Batches

For production pipelines processing thousands of documents, avoid re-initializing the engine on every call. Create the engine once and reuse it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

documents = [
    "Email [email protected] for details.",
    "Call Sarah at 202-555-0143.",
    "Patient SSN: 987-65-4321.",
]

for doc in documents:
    results = analyzer.analyze(text=doc, language="en")
    anonymized = anonymizer.anonymize(text=doc, analyzer_results=results)
    print(anonymized.text)

The AnalyzerEngine constructor loads the spaCy model into memory, which takes a few seconds. Calling it once and reusing the instance avoids that overhead.

Common Errors and Fixes

ValueError: No matching recognizers were found – You passed an entity name that does not exist. Check your spelling against the supported entities list. Entity names are uppercase with underscores: EMAIL_ADDRESS, not email or Email.

OSError: [E050] Can't find model 'en_core_web_lg' – Run python -m spacy download en_core_web_lg. If you are in a Docker container, add this to your Dockerfile.

False positives on short strings – Presidio may flag common words as PERSON or LOCATION in short text. Increase the score threshold or restrict the entity list to reduce noise.

Memory usage with en_core_web_lg – The large spaCy model uses around 800 MB of RAM. If that is too much, switch to en_core_web_md (about 120 MB) and accept slightly lower NER accuracy. Configure the model in the NlpEngineProvider:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_md"}],
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine(nlp_engine=nlp_engine)

When to Use Presidio

Presidio fits well as a preprocessing step before sending text to an LLM, storing user-submitted content, or logging application data. It runs locally, so sensitive text never leaves your infrastructure. For GDPR, HIPAA, or CCPA compliance, pair it with proper data handling policies – Presidio handles detection and redaction, but you still need to define what counts as PII for your use case and validate the results on your actual data.