OpenAI's Privacy Filter, Tested on Real Documents

Research by RedMatiq Team • April 26th, 2026

OpenAI quietly shipped a small open-weight PII detection model on Hugging Face: 1.5 billion parameters total, around 50 million active per token, Apache 2.0, runs on CPU and in the browser via WebGPU. It detects eight categories — person, address, email, phone, URL, date, account number, secret — and outputs token-level spans with confidence scores. Permissively-licensed local PII detection at this size is genuinely new on the open-source map.

We pay attention to detector quality because RedMatiq runs entity recognition on every document a user hands it. So we spent two days benchmarking OpenAI's Privacy Filter against our existing pipeline — multilingual BERT NER (bert-base-multilingual-cased-ner-hrl) plus Microsoft Presidio for structured types with check-digit validation. Eighteen real documents: eleven plain text, two markdown, four PDFs (Docling-parsed), two .docx. Sixteen English, two German. MacBook Pro M2 Pro, CPU only.

Both detectors received the same parsed text. Here's what we found.

What we found

Latency on CPU is the dominant signal. Median: 157 ms/doc for Presidio + BERT-multilingual, 3,194 ms for OpenAI's — 20× slower. p95: 837 ms vs 18,329 ms. The longest document in the corpus, a 5,400-token veterans benefits appeal, took OpenAI's model 18.3 seconds; Presidio + BERT-multilingual took 837 milliseconds. Steady-state RSS: 900 MB vs 3.4 GB. Cold-start RSS: +475 MB vs +1,563 MB. On GPU these numbers improve by an order of magnitude or more, but for a Mac app running on the user's machine, the CPU number is the one that matters.

Latency on CPU (Apple M2 Pro), log scale. Lower is better.

Structured formats break it. A PDF bank statement parsed into a markdown table — the kind of layout users actually paste in — was the single worst case. Presidio + BERT-multilingual caught 12 of 12 PII entities. OpenAI's model caught 2 of 12. Names in table cells, six ISO dates in a column, an account number — silently dropped. The model card documents this as "fragmented boundaries in mixed-format text"; the spike confirmed it isn't theoretical. Recall on that document: 0.20 against 1.00.

The multilingual collapse didn't happen. This was our largest pre-spike concern. OpenAI's model is documented English-primary; the BERT-multilingual NER we use was fine-tuned on ten languages including German. On a 2007 German residence registration, OpenAI identified the names, the multi-line German address, and the dates correctly. F1: 0.78 against 0.80 — within two percentage points. One document, so this isn't a multilingual claim, but the obvious failure mode didn't materialize.

F1 against manual ground truth across three representative documents.

OpenAI's model wins on two real things.

Multi-line addresses come back as one cohesive span. "Hauptstraße 12, 2. Obergeschoss, links, 69117 Heidelberg" is one thing in the world; Presidio + BERT-multilingual returns it as four LOCATION fragments. Across the corpus: 19 cohesive private_address spans vs equivalent fragmented LOCATION groups on our side.
Natural-language multi-token dates: "June 1998 to June 2006," "February 2004 to January 2005." OpenAI caught 18 multi-token dates the regex layer missed entirely.

Presidio + BERT-multilingual wins on the structured tier.

Phone numbers: Presidio's pattern recognizers caught 11 unique phone numbers OpenAI missed — odd-format strings like "0253-4871-40" and "884-291-00".
URLs: OpenAI returned zero private_url spans across the corpus, despite the model card listing the label. Presidio caught one.
LOCATION and ORGANIZATION: OpenAI's taxonomy doesn't include either label. Across the corpus, BERT-multilingual found 85 LOCATION spans and 58 ORG spans — a mix of real PII (banks, hospitals, military units, employer names) and noise (document headings).

Agreement on shared labels — PERSON, EMAIL, PHONE, DATE, URL, ACCOUNT_NUMBER. Of 166 spans, 79 exact-boundary matches (47.6%) and 54 partial overlaps, cumulative 80%. PERSON detection has the strongest agreement: 49 exact matches plus 46 partials out of roughly 100. DATE has the most divergence: OpenAI catches 18 unique multi-token dates, regex catches 8 unique short-form dates.

False positives, observed. "Heidelberg" labeled as private_person in a German signature line. "Phonak Audéo," a hearing-aid product brand, labeled as private_person in a veterans benefits appeal. Both are documented model-card failure modes — over-detection on capitalized non-name strings.

	Presidio + BERT-multilingual	OpenAI Privacy Filter
Median latency, CPU	157 ms	3,194 ms
p95 latency, CPU	837 ms	18,329 ms
Steady-state RSS	~900 MB	~3.4 GB
Bank statement F1 (PDF table)	1.00	0.22
German document F1	0.80	0.78
English email F1	0.91	1.00
Exact-boundary agreement (shared labels)	—	79/166 (47.6%)

What this means for RedMatiq

We're not adopting it. The 20× latency, the 3.4 GB resident memory cost, and the structured-format collapse on PDFs make it a regression for the Mac app's hot path. Twelve entities versus two on a real bank statement is the dealbreaker independent of everything else. Even as a parallel asynchronous detector, the overhead doesn't justify the gains — paying roughly three seconds and 2.5 GB of additional memory per document for cohesive address spans and multi-token date detection is too much for too little.

This reinforces a design choice we already made. Regex with check-digit validation and neural NER aren't substitutes; they have different failure modes. Luhn-validated credit cards, mod-97 IBANs, country-specific national IDs — the regex layer rejects pattern-matches that fail their structural test. A neural model can't run those algorithms. It catches natural-language references regex can't reach. Both layers are doing different work, and the named-type granularity — IBAN_CODE, US_SSN, AWS_SECRET_KEY — survives because the regex layer tagged it that way.

We're watching quantization. If an int8 or fp16 OpenAI checkpoint ships and brings p95 latency under a second on Apple Silicon, the calculus changes. On B2B server hardware with GPUs, the latency picture is different again — different product, different tradeoff.

Methodology and limits

Eighteen documents, sixteen English and two German. Three with manual ground truth, one annotator. Default OpenAI operating point. CPU only, no MPS or CUDA, no quantization. Same parser for both detectors. Not enough for confident multilingual claims; enough to falsify the obvious failure modes and to size the latency, structured-format, and label-coverage gaps. We'll re-run with a larger and more multilingual corpus before drawing harder conclusions.

Layered detection, on your machine

RedMatiq runs the architecture this article describes — regex with check-digit validation for structured PII, multilingual neural NER for natural-language entities, both local. No upload, no API call, no third party seeing your raw documents.

What we found

What this means for RedMatiq

Methodology and limits

Related reading

Layered detection, on your machine

Download Beta

Check your inbox

Thank you!