Benchmarking Local PII Redaction Tools for Meeting Transcripts

This post was AI-generated. It is a report, not a polished essay. Published for the off chance it saves someone time.

TLDR: Why can’t I just install a thing that strips PII from a markdown file?

Markdown is the default container for text right now. Transcription tools emit it, Obsidian vaults are full of it, and every AI coding agent and chat tool leaves a trail of it behind. A huge amount of text that could contain names, emails, phone numbers, addresses, account numbers, etc. In a modern agentic workflow it almost certainly makes its way into a cloud model for summarizing, drafting, or analysis. I do privacy consulting, and transmitting these details to a cloud model unredacted is highly undesireable.

The obvious safeguard is to scrub the PII locally before any of that text leaves the machine. Not as a cloud service (the cloud is the thing you are scrubbing for), but as a small local step you can run by hand or wire into an agent hook. I assumed this was solved. I expected to pacman -S or pipx install something, point it at a file, and get a clean file back.

I could not find that tool. There is a Microsoft framework that does the hard part, and there are a few CLIs around it, but none of them does the specific, boring thing I wanted: take a markdown file, redact the PII, leave the code blocks alone, and hand me back markdown. So I benchmarked the options to make sure I wasn’t missing something, and then I wrote the small wrapper I had been looking for. This post documents the findings.


Requirements

My priorities, roughly in order:

  1. Local only. No transcript text crosses the network, ever. A tool that phones home (telemetry, cloud detection, an audit dashboard) is out.
  2. High recall on names. A missed name that leaks is far worse than an over-redaction. Transcripts and notes are dominated by person names, so a tool that cannot catch names is useless here.
  3. Fast enough to run on every file. Seconds per document on CPU is fine. A minute per document is not something you run habitually.
  4. Auditable provenance. A real license, a real repo, source I can read.

The candidates

I evaluated four tools, plus a fifth as an accuracy ceiling.

A. redact-cli (github.com/censgate/redact, Apache-2.0). A Rust CLI with 36 regex-based entity types and a separate ONNX NER crate. Active project, nine releases between February and April 2026.

B. pii-vault (github.com/Jiansen/pii-vault, MIT). A spec-driven regex library with 47 recognizers across 15 countries, available in Rust and TypeScript.

C. Presidio (github.com/microsoft/presidio, MIT). Microsoft’s open-source PII framework. spaCy for named-entity recognition plus pattern-based recognizers. 8,500 stars, actively maintained.

D. opf (github.com/openai/privacy-filter, Apache-2.0). OpenAI’s Privacy Filter: a fine-tuned 1.5B-parameter model with a reported 96 to 97% F1 on the ai4privacy benchmark. I included it as a ceiling, not a practical recommendation.

Two tools got triaged out before benchmarking: pii-shield (an MSFT-AI-BUILD-INTERNAL org project built around Azure cloud co-detection, rejected on the local-only rule) and pii-redactor on PyPI (v0.1, abandoned since January 2025, no license).

Doesn’t a Presidio CLI already exist? Sort of. There is presidio-cli (insightsengineering/presidio-cli), and it does wrap Presidio behind a command line. But it is an analyzer, not a redactor: it reports the PII spans it finds, like a linter, and leaves you to do something with that report. It does not emit cleaned text, it is not markdown-aware, and it has not shipped a release in about a year. That still leaves a gap for “just clean my file” functionality.


Step 1: Security audit

Every tool passed the gate question (no transcript text leaves the machine), but the details are worth recording.

redact-cli. The workspace Cargo.toml lists reqwest, an HTTP client, which looks alarming until you read the dependency tree: it belongs only to the redact-api crate (an Axum HTTP server) and the examples. The CLI binary itself depends on redact-core and clap, nothing more. I confirmed with ss monitoring: zero new TCP connections during a redaction run.

pii-vault. The Rust crate depends only on regex, serde, and getrandom. Clean. But there is a catch: the Python binding (pip install pii-vault) raises NotImplementedError on every method. The Python SDK is documented as “planned (PyO3).” Only Rust and TypeScript actually work, and both are regex-only, with no NER and no name detection.

Presidio. The file azure_ai_language.py imports azure.ai.textanalytics, which would be a cloud call, but it is an optional recognizer that you have to instantiate explicitly. The default AnalyzerEngine() never loads it. Confirmed with ss: zero new connections during analysis.

opf. The model downloads from HuggingFace on first use via huggingface_hub.snapshot_download, guarded by if not target.exists(). Once it is cached at ~/.opf/privacy_filter/, the HuggingFace import never runs again, and inference is pure PyTorch. For a fully airgapped setup, HF_HUB_OFFLINE=1 blocks any accidental call, and HF_HUB_DISABLE_TELEMETRY=1 disables the library’s own anonymous telemetry.

Short version:

  • redact-cli: clean. reqwest is examples-only, not in the CLI.
  • pii-vault: clean, but the Python SDK is an unimplemented stub. Disqualified for Python use.
  • Presidio: clean. The Azure recognizer is optional and never loaded by default.
  • opf: clean for inference. One-time model download from HuggingFace; set HF_HUB_OFFLINE=1 for airgapped use.

Step 2: Test corpus

I built 15 synthetic markdown transcripts covering a realistic spread: project kickoffs, engineering syncs, HR onboarding, board finance reviews, legal hold meetings, security incident debriefs, sales discovery calls, all-hands, contract negotiations, breach notifications, clinical trial reviews, partnership agreements, and code audit sessions.

They were written to stress the failure modes I care about:

  • Names with no honorific, mid-sentence and in speaker labels
  • Phone numbers spelled out across line breaks (“five five five two two four…”)
  • Partial spoken addresses
  • Names from non-Western naming systems (Chinese, African, Slavic)
  • IBANs, tax IDs, sort codes, medical record numbers, insurance IDs, passport numbers
  • Service-account emails (breachnotify@ag.state.gov) that read less like personal PII
  • Fenced code blocks with real content (YAML config, Python exploits, JSON with API keys) that must not be touched

In total: 186 gold PII spans across 19 entity types, plus 25 must-not-touch spans inside code blocks.

For pii-vault, since the Python binding is a stub, I ported its JSON recognizer spec into Python regex directly. The numbers reflect what the spec detects, not a working Python library.

Those 15 transcripts produce the headline table below. The repo also includes a sixteenth, built specifically for the ambiguous-name stress test described later, for 16 in all. The full corpus, the evaluation scripts, and the numeric results live in the repo’s bench/ directory if you want to reproduce or extend the comparison. One caveat: the Presidio column reproduces from the package alone, but the other columns need their respective tools installed and running (the Rust redact binary, the Docker image on a local port, the opf model), so treat those scripts as the exact harness I ran rather than a turnkey one-command repro.


Step 3: Results

  redact-cli (CLI) pii-vault redact-cli (Docker+NER) Presidio opf (ceiling)
Recall 40% 53% 65% 90% 95%
PERSON recall 0% 0% 84% 98% 96%
PHONE recall 47% 76% 50% 92% 100%
EMAIL recall 100% 100% 98% 100% 95%
LOCATION recall 0% 0% 31% 38% 100%
Code over-redact 11 11 11 10 10
ms/doc 21 ms 7 ms 339 ms 64 ms 20,000 ms
RAM <50 MB <50 MB ~700 MB 763 MB ~3 GB
Install size 4 MB <10 MB 461 MB (image) ~600 MB 7.4 GB

The first two columns are the baseline, and the headline there is that redact-cli and pii-vault both score zero on person names. Not low. Zero. The reasons differ.

pii-vault is regex-only by design. There is no NER component anywhere in the codebase. The Rust path uses a static JSON recognizer spec, and the Python binding is an unimplemented stub.

redact-cli’s installed CLI is also regex-only, even though the repo ships an ONNX NER crate. The README is explicit that NER lives only in the Docker API server, and the CLI path skips it.

The “redact-cli (Docker+NER)” column is what you get from the full image (ghcr.io/censgate/redact:full), which bakes in a bert-base-NER model. PERSON recall jumps from 0% to 84%, and LOCATION, which needs contextual inference rather than pattern matching, goes from 0% to 31%. That comes at a cost: 339 ms/doc warm versus 21 ms for the CLI alone.

The gap between that Docker path and Presidio is the interesting part. Both use transformer NER, but Presidio pairs spaCy’s en_core_web_lg (trained on a broader dataset) with a deep regex recognizer layer for IBANs, tax IDs, and international phone formats. redact-cli’s pattern layer misses most of those, which is why it lands at 65% overall against Presidio’s 90% despite a comparable NER model.

Presidio reaches 90% recall at 64 ms per transcript. Its 10% miss rate breaks down as:

  • Addresses (38% recall). The biggest gap. It misses foreign-language addresses (“12 rue de la Paix, 75002 Paris”), multi-line addresses, and addresses with no clear geographic marker nearby. Eight of nineteen missed spans are addresses.
  • Spoken phone numbers. “five five five two two four seven eight nine zero” across a line break is not caught.
  • Bare passwords. “hunter2” with no “password:” keyword nearby is missed.
  • Unusual formats: Swiss VAT IDs, UK sort codes, bar numbers (BBO#682341).

opf reaches 95%. Its misses are non-Western person names (Chen Wei), service-account emails (breachnotify@ag.state.gov), a spoken bank routing number, and a UK sort code. It catches every address and every phone number, including the spoken ones.

On precision

Both Presidio and opf look bad on naive precision: 31% and 50%. That is mostly an artifact of a conservative gold standard. I did not annotate meeting dates, years, or domain names as PII, and both tools flag more than I labeled. For my use case, over-redaction beats under-redaction, so I am not bothered by it. The precision number I actually trust is the code-block over-redaction count, where both tools hit 10 of 25 must-not-touch spans. In other words, out of the box they both redact inside code. That is the problem the wrapper exists to solve.

One caveat about Presidio’s enthusiasm: it tags every date and time it sees. In a transcript full of “let’s meet Tuesday at 3” and “the Q2 numbers,” that is a lot of noise, and dates are usually not the PII you care about. The tool I built leaves the default set as-is but gives you a flag to keep dates if you would rather.

Code blocks

Every tool I tested redacts inside fenced code blocks. A YAML block with host: alice@internal loses the email. A Python snippet with sock.connect(('10.0.0.1', 4444)) gets the IP flagged. A SHA-256 hash trips the credit-card pattern.

The fix is to pull the code out before analysis and put it back after. I do it with a small preprocessing step: fenced blocks and inline `code` spans are swapped for unique null-byte placeholders, Presidio runs on the cleaned text, and the placeholders are restored afterward. With that in place, code-block over-redaction drops to zero.

A harder test: ambiguous names

Running redact-md on a real transcript, I watched it sail right past someone whose first name was also a well-known city. The model leans toward reading a name like that as a location, or as nothing at all. The same thing happens with names that double as ordinary words: “Jack” in “okay Jack, show me the cards,” or “Bill,” “Rose,” “Mark,” “Will.” A model trained to find names in running text leans on context, and these names actively supply the wrong context.

So I built a sixteenth transcript designed entirely around this. Every participant has a first name that is also a place (Savannah, Paris, Florence, Sydney) or a common word, verb, or modal (Jack, Rose, Mark, Bill, Will), and the dialogue is salted with non-person decoys: “revenue rose 12%,” “let’s mark that as done,” “the cloud bill came to forty grand,” “standing up the new Savannah office.” Twenty-one person mentions in all: nine where the name appears with a surname, twelve where it appears bare.

  full names (9) bare ambiguous (12) PERSON recall
opf 9 / 9 6 / 12 71%
Presidio 5 / 9 7 / 12 57%
redact-cli (Docker+NER) 0 / 9 0 / 12 0%

Both serious tools roughly halve their usual person-name recall on this one document. But they fail differently, which is the interesting part.

opf is rock solid on full names: nine for nine, including Savannah Okafor and Paris Adeyemi. It only stumbles on bare first names, and mostly the ones that read more as a city than as a person. It misses both bare “Paris,” plus bare “Florence” and “Sydney,” and the modal “Will,” while still catching bare “Savannah,” “Jack,” “Rose,” and “Bill.” A surname is enough to anchor it; without one, a strong place name wins.

Presidio is more erratic. It misses several full names that opf catches, including Savannah Okafor, Paris Adeyemi, Sydney Mbeki, and Will Castellano, apparently thrown off when a city-like first name sits in a speaker label. But it also catches bare names that opf drops, including both “Paris” and “Florence.” The two models have almost complementary blind spots, which is a concrete argument for the two-pass approach: run both, union the results, and this transcript climbs to 86% (18 of 21). The three the union still misses are all bare and contextless, one “Sydney” and two “Will”s, which is exactly the case a model cannot solve without knowing the roster.

The practical takeaway is that a bare first name with no surname and no disambiguating context is the genuinely hard case for any NER model, and you should not expect a clean 100%. When you actually know the roster, and for a meeting you usually do, the robust fix is not a better model but a name deny-list: Presidio takes one directly, so feeding it the participant list guarantees those names are caught regardless of how they read in context. This is built into redact-md: pass --names "Savannah Okafor, Will" (or --names-file roster.txt) and those names are always redacted, case-insensitively and including bare first names, alongside the normal detection.


Step 4: Recommendation

Daily driver: Presidio with the markdown-aware wrapper. 90% recall at 64 ms/doc. It catches 98% of person names (the PII type that dominates transcripts and notes), 100% of emails, 92% of phones, and 100% of SSNs, IBANs, and credit cards. It installs in one command, and code blocks survive.

Add an opf second pass when addresses matter. Presidio catches only 38% of street addresses. If a document involves HR onboarding, legal correspondence, shipping, or anything where a home or office address is likely, run opf afterward. At 20 seconds per transcript on CPU it is too slow to be your default, but it is fine to run on the documents you have flagged.

The 5-point recall gap between Presidio and opf is acceptable for most material and meaningful for some. For transcripts with no addresses and standard phone formats, the real-world gap is closer to 1 or 2 points. For HR or legal documents full of addresses, it is the difference that matters.


The tool: redact-md

I packaged the Presidio wrapper as a standalone CLI: redact-md. It is small (the interesting part is about fifty lines), and that is the point. It does one thing: markdown in, redacted markdown out, code blocks untouched.

Install with pipx. The [model] extra pulls in Presidio, spaCy, and en_core_web_lg in one shot:

pipx install "redact-md[model] @ git+https://github.com/jacksenechal/redact-md"

Then:

# Print to stdout
redact-md meeting.md

# Write to a new file
redact-md meeting.md -o meeting_redacted.md

# Overwrite in place (saves a backup as meeting.md.bak)
redact-md -i meeting.md

# Read from stdin, which is what you want for an agent hook
cat meeting.md | redact-md -

# Redact a whole folder
redact-md --dir ~/meetings/2026-06/ --out-dir ~/meetings/2026-06-redacted/

# Keep dates (and list or narrow the entity set)
redact-md --keep DATE_TIME meeting.md
redact-md --entities PERSON,EMAIL_ADDRESS meeting.md
redact-md --list-entities

# Always redact a known roster (catches names NER misses)
redact-md --names "Savannah Okafor, Will" meeting.md
redact-md --names-file roster.txt meeting.md

The stdin mode is the one I reach for most. It makes redact-md a clean filter you can drop in front of anything that ships text to a cloud model, including an agent pre-tool hook that scrubs a file before it is ever read into a prompt.