Microsoft Presidio finds and removes personally identifiable information (PII) from text. It combines spaCy’s named entity recognition with regex patterns and checksums to detect 50+ entity types – names, credit cards, SSNs, phone numbers, emails, and country-specific identifiers. Here is how to get it running.
Install Presidio and Download the spaCy Model
Presidio ships as two packages: presidio-analyzer for detection and presidio-anonymizer for redaction. Both need a spaCy language model to tokenize text and run NER.
| |
Use en_core_web_lg for production workloads. The smaller en_core_web_sm works for quick experiments but misses more entities. If you skip the spaCy download, you will hit this error the first time you call analyze():
| |
Fix it by running the spacy download command above.
Detect PII with AnalyzerEngine
The AnalyzerEngine scans text and returns a list of RecognizerResult objects with the entity type, start/end positions, and a confidence score between 0 and 1.
| |
Output:
| |
Pass an empty list or omit entities entirely to scan for all supported types. That is useful for an initial audit, but filtering to the entities you care about speeds things up and cuts false positives.
Redact PII with AnonymizerEngine
Once you have analyzer results, feed them into the AnonymizerEngine to strip or replace the sensitive values.
| |
Output:
| |
By default, Presidio replaces each detected entity with a placeholder tag like <PERSON>. You can change this behavior per entity type using operators.
Choose an Anonymization Strategy
Presidio ships six built-in operators: replace, redact, hash, mask, encrypt, and custom. Pick the one that matches your compliance requirements.
| |
The DEFAULT key applies to any entity type without a specific operator. Use hash when you need to join records across datasets without exposing the raw value. Use encrypt when you need reversible anonymization (you will need to manage the encryption key).
Add a Custom Recognizer
The built-in recognizers handle common PII well, but most teams need to detect domain-specific patterns – internal employee IDs, project codes, medical record numbers. Presidio makes this straightforward with PatternRecognizer.
| |
Output:
| |
For simpler cases, you can use a deny list instead of regex:
| |
Set a Confidence Threshold
Presidio assigns a confidence score to every detection. Low-confidence results are often false positives, especially for entity types like PERSON and LOCATION where NER is doing the heavy lifting. Filter results by score to control precision.
| |
Tune this threshold on a labeled sample of your actual data. Start at 0.5 and increase until false positives drop to an acceptable level. Going above 0.85 will start missing real PII, so test carefully.
Process Text in Batches
For production pipelines processing thousands of documents, avoid re-initializing the engine on every call. Create the engine once and reuse it.
| |
The AnalyzerEngine constructor loads the spaCy model into memory, which takes a few seconds. Calling it once and reusing the instance avoids that overhead.
Common Errors and Fixes
ValueError: No matching recognizers were found – You passed an entity name that does not exist. Check your spelling against the supported entities list. Entity names are uppercase with underscores: EMAIL_ADDRESS, not email or Email.
OSError: [E050] Can't find model 'en_core_web_lg' – Run python -m spacy download en_core_web_lg. If you are in a Docker container, add this to your Dockerfile.
False positives on short strings – Presidio may flag common words as PERSON or LOCATION in short text. Increase the score threshold or restrict the entity list to reduce noise.
Memory usage with en_core_web_lg – The large spaCy model uses around 800 MB of RAM. If that is too much, switch to en_core_web_md (about 120 MB) and accept slightly lower NER accuracy. Configure the model in the NlpEngineProvider:
| |
When to Use Presidio
Presidio fits well as a preprocessing step before sending text to an LLM, storing user-submitted content, or logging application data. It runs locally, so sensitive text never leaves your infrastructure. For GDPR, HIPAA, or CCPA compliance, pair it with proper data handling policies – Presidio handles detection and redaction, but you still need to define what counts as PII for your use case and validate the results on your actual data.
Related Guides
- How to Detect AI-Generated Text with Watermarking
- How to Detect and Reduce Hallucinations in LLM Applications
- How to Build Automated PII Redaction Testing for LLM Outputs
- How to Build Automated Hate Speech Detection with Guardrails
- How to Detect and Mitigate Bias in ML Models
- How to Build Differential Privacy Testing for LLM Training Data
- How to Build Fairness-Aware ML Pipelines with Fairlearn
- How to Build Automated Output Safety Classifiers for LLM Apps
- How to Build Automated Data Retention and Deletion for AI Systems
- How to Build Membership Inference Attack Detection for ML Models