On-device pseudonymizer: built and benchmarked

2026-05-22 · Caiioo Team

We wanted to solve the privacy problem of AI systems training on and retaining real personally identifiable information. "Zero Data Retention" policies and agreements reduce risk, but they are riddled with exceptions. They essentially say: "We won't store any of your prompts or outputs except for: (enter large list of exceptions here: security purposes, government surveillance, litigation holds, product develop, error logs, improving the services...)"

To solve this, we built a personal data filter that runs on the user's own machine, sees the message before it leaves the device, and returns the same answer the user would have received had it been typed without a filter at all.

So we built one. Caiioo's next version will include the Pseudonymizer. Accessible through the shield icon in agent chat as well as settings.

This white paper outlines the rationale, architecture, evaluation process, and design principles behind our privacy filtering system.

What we set out to do

General privacy protection by data minimization. When a user chats with a remote model, the model does not need a real name, a home address, a real email, or a customer's phone number to answer the user's question. It needs the shape of the question, and it needs to look like a real question so that the LLM does not dismiss the query as a test. The filter therefore strips real identifiers from the data sent to the AI and stitches the real values back in on the way back. The model sees synthetic names and identifiers; the user sees the real conversation.

HIPAA compliance assistance. A second mode targets the 18 identifiers in the HIPAA Safe Harbor rule (§164.514) and the looser Limited Data Set variant. A clinician, a healthcare admin, or anyone working on a covered workflow can talk to a general-purpose model about real cases without sending protected health information to the AI. We are not the covered entity's compliance officer — but we can be the layer that keeps PHI from leaving the laptop in the first place. Our evaluations provide measurable benchmarks that organizations can use to assess the filter’s suitability for their compliance and privacy standards. All privacy and security measures are judgment calls that are the responsibility of the user or user entity.

We chose both because the engineering is the same, and the PHI use case is actually technically easier because of its use of distinct named entities, which are easier to redact than the more general data in the Personal Data category. HIPAA filtering is also helped by the fact that it is generally in English or Spanish. We detect identifiers, substitute stable pseudonyms and substituted identifiers in the same format, restore them on the way back, and never log the real values. The map between the synthetic information and the real information stays only on the user's device, so that the user can read real information in the agent’s responses. The category list and the policy gate are what change between modes.

Why regex plus machine learning, and not just one of them

There are two main filter technologies in our pseudonymizer: a deterministic pattern recognition language called regex. and trained machine learning models.

Regex is unbeatable on surface formats. An email address has a shape ([email protected]). So does an IP, a credit card, an IBAN, a VIN, an SSN (XXX-XX-XXXX), an API key. If the format is reliable, regex catches it deterministically, every time, with no model load and no inference cost.

Regex is, however, hopeless on context. "Sarah's chart from last Tuesday" contains a person and a date, but neither one can be distinguished from its format alone. "The patient at 14 Elm" contains an address that overlaps a thousand non-addresses. "Their MRN is 7741032" needs the words around the number to mean anything.

A small fine-tuned language model handles context — ours is a specially-trained 110M-parameter encoder distilled from a multilingual tiny language model — . It reads the sentence, not the substring and is small enough to run extremely fast on a device, even a mobile phone.

The benchmarks illustrate the complementary strengths of the two layers. We benchmarked the two layers in isolation to make sure each is actually pulling its weight. On the 150-question PrivacyBench-PD question set where the questions and answers were explicitly NOT used to train the model:

Layer	PII caught	Caught rate
Regex only	205 / 670	30.6 %
ML only	516 / 670	77.0 %
Both (production)	625 / 670	93.3 %

Regex alone misses three quarters of identifiers because most identifiers in real prose aren't structured. ML alone misses 16 percentage points the regex layer would have caught — the things that are purely their shape (a credit card looks like a credit card; the model has no extra signal to add). Together they cover what neither covers alone.

Looking deeper at where each layer is decisive: across that same set, 16 test values were caught only by regex (emails, IPs, financial accounts, structured IDs) and 327 were caught only by ML (names, contextual identifiers, multilingual phrasings).

Small, smart, fast — and runs everywhere

We had to make the filter run on-device, which was an engineering challenge.

It has to be small because our app runs on device on many systems: inside a browser extension, a macOS app, an iOS app, an Android app, or Windows and Linux. The bundle is about 113 MB per model. There are two models — one for general personal data, one for protected health information — and Safe Harbor mode runs both in parallel. Among these, a low-end Android device is the least performant, yet our system works fine

It has to be smart because false negatives leak real data to a remote LLM and false positives wreck the conversation. Names must be redacted; pronouns must not be. The doctor's email in a forwarded thread should be redacted; the [email protected] footer probably shouldn't.

It has to be fast because it sits directly in the way of every message the user sends or receives. We measured the round-trip overhead at under 200 ms on a single CPU thread, mostly tokenization. On WebGPU and on Apple's Neural Engine it is tiny as compared to the network latency of the LLM call itself.

It has to run in multiple runtimes because Caiioo is multi-platform. The same system runs in Chrome extensions, macOS and iOS, Android, and on Windows and Linux. One detection model, one regex library, one merger, one policy — identical behavior across every surface where Caiioo runs.

The scores

After rounds of testing and training, we settled on the 16th version of our models. Below are three benchmarks, each measuring something different.

Our own test set, 150 questions the model has never seen

Before testing against public benchmarks, we ran our Personal Data filter against an internal test set we deliberately kept out of the training data — so the detector has never seen any of these questions before. 150 questions split into four groups (code snippets, document prose, intentionally unfamiliar phrasings, and 10 non-English languages), plus a "negative" group that contains no private data at all (a sanity check that we don't over-redact). Combined regex + ML pipeline:

Sub-bench	Caught	Rate
code_bench	69 / 74	93.2 %
doc_bench	233 / 247	94.3 %
generalization_bench	123 / 133	92.5 %
multilingual_bench	200 / 216	92.6 %
All 4 positive benches	625 / 670	93.3 %

(The negative group had no private data to find. The filter masked one thing it shouldn't have — consistent with the precision numbers further down.)

The grader is strict: every expected piece of private data has to fully disappear from the masked output for the question to score. No partial credit, no "close enough." That's harsher than benchmarks that ask another LLM to be the judge (LLM judges tend to be generous). For directly comparable numbers against other systems, see the head-to-head section further down.

PrivacyBenchHIPAA — 40 healthcare questions

Each question lists protected health information that must be redacted (names, medical record numbers, etc.) AND signals that must be kept (dates, geography, age if under 90 — the HIPAA Limited Data Set rule). The grader checks both directions: did we remove what we should have removed, and did we leave alone what we should have left alone?

Mode	PHI redacted	Retained kept
Limited Data Set (preserve dates / geography / age ≤89)	79 / 79 (100 %)	34 / 34
Safe Harbor (redact everything including dates)	99 / 104 (95.2 %)	—

The Limited Data Set submode is perfect on this benchmark. Safe Harbor — which has more to redact, so more opportunities to miss — covers 95.2%.

Category-by-category results on our own data, 200 samples per mode

Public benchmarks lump everything together. Our internal test data breaks results out by category (names, emails, addresses, and so on) and runs each one three ways: regex only, ML only, and both together. That tells us exactly which technology is catching which kind of identifier — and where each one needs the other. Most recent run, 2026-05-20:

Summary across all three filter modes

Mode	Combined recall	Precision	F2	Samples
Personal Data filter	97.3 %	97.8 %	97.4	200
HIPAA filter — Limited Data Set	92.3 %	92.3 %	92.3	200
HIPAA filter — Safe Harbor	91.9 %	91.5 %	91.8	200

These numbers exclude URLs, which we deliberately leave alone — clobbering a URL would break downstream actions like "open this link" or "fetch that page." More on that in the workflow section below.

The big picture before the per-category tables: in every mode, the identifiers that actually identify a person are caught at or near 100% of the time. Names, emails, phone numbers, postal addresses, government IDs, biometric IDs, precise geolocation, medical record numbers, dates of birth, social security numbers — caught on every sample, every test. The categories where we slip below 100% are predictable: device IDs (huge variety of formats in real text), miscellaneous institutional IDs (loyalty numbers, employee IDs — same problem), and photos (a text-only filter can't see what's inside an image). None of those is the "name on a chart" or "email in a draft" identifier that actually matters for leakage. The high-stakes categories are the reliable ones.

Personal Data filter

Sorted by combined recall (best first), then by risk tier (T1 = most sensitive).

Category	Tier	Regex recall	ML recall (raw)	Combined recall	Gold n
biometric_id	T1	100.0 % (10/10)	100.0 % (10/10)	100.0 %	10
email_address	T1	100.0 % (20/20)	100.0 % (20/20)	100.0 %	20
government_id	T1	100.0 % (10/10)	100.0 % (10/10)	100.0 %	10
person_name	T1	13.3 % (4/30)	96.7 % (29/30)	100.0 %	30
phone_or_fax	T1	100.0 % (10/10)	100.0 % (10/10)	100.0 %	10
postal_address	T1	0.0 % (0/10)	100.0 % (10/10)	100.0 %	10
precise_geolocation	T1	100.0 % (10/10)	100.0 % (10/10)	100.0 %	10
birth_date	T2	0.0 % (0/10)	100.0 % (10/10)	100.0 %	10
ip_address	T2	100.0 % (10/10)	100.0 % (10/10)	100.0 %	10
online_handle	T2	40.0 % (4/10)	100.0 % (10/10)	100.0 %	10
vehicle_id	T2	50.0 % (5/10)	100.0 % (10/10)	100.0 %	10
authentication_secret	T4	40.0 % (4/10)	100.0 % (10/10)	100.0 %	10
financial_account	T1	90.0 % (18/20)	100.0 % (20/20)	90.0 %	20
institutional_id	T3	80.0 % (8/10)	90.0 % (9/10)	90.0 %	10
device_id	T3	40.0 % (4/10)	50.0 % (5/10)	80.0 %	10

Reading across, the two-layer design is paying off. Postal addresses, birth dates, and person names score 0–13% under regex alone — there's no shape to match, so only the ML model can catch them. Emails, phone numbers, IPs, government IDs, biometric IDs, and precise geolocation score 100% under regex alone — surface formats the ML model gets for free. Online handles, vehicle IDs, and authentication secrets are mixed: regex catches the standard forms, ML catches the rest. The combined recall meets or exceeds whichever single layer is stronger, in every category.

Device IDs and miscellaneous institutional IDs are the categories below 100%, and we know why: those have the widest variety of formats in real text. We'd rather be honest about the categories where recall slips than pretend the filter is perfect everywhere.

HIPAA filter — Limited Data Set submode

The Limited Data Set submode preserves dates, geography, and ages 89 or younger by design — those are the signals HIPAA allows an organization to keep for legitimate clinical research and operations.

Category	Tier	Regex recall	ML recall (raw)	Combined recall	Gold n
biometric_id	T1	100.0 % (12/12)	100.0 % (12/12)	100.0 %	12
email_address	T1	100.0 % (13/13)	100.0 % (13/13)	100.0 %	13
medical_record_number	T1	100.0 % (26/26)	100.0 % (26/26)	100.0 %	26
person_name	T1	15.4 % (4/26)	100.0 % (26/26)	100.0 %	26
phone_or_fax	T1	100.0 % (13/13)	100.0 % (13/13)	100.0 %	13
social_security_number	T1	100.0 % (12/12)	100.0 % (12/12)	100.0 %	12
account_number	T2	0.0 % (0/13)	100.0 % (13/13)	100.0 %	13
health_plan_id	T2	0.0 % (0/13)	100.0 % (13/13)	100.0 %	13
ip_address	T2	100.0 % (12/12)	100.0 % (12/12)	100.0 %	12
license_number	T2	0.0 % (0/12)	100.0 % (12/12)	100.0 %	12
vehicle_id	T2	25.0 % (3/12)	100.0 % (12/12)	100.0 %	12
device_id	T3	41.7 % (5/12)	100.0 % (12/12)	100.0 %	12
photo	T2	0.0 % (0/12)	0.0 % (0/12)	0.0 %	12

Photos are a known miss — a text-only filter can't see what's inside an image. Image-PHI is a separate problem we haven't shipped yet. Every other category in this mode is at 100%.

HIPAA filter — Safe Harbor submode

Safe Harbor strips everything the Limited Data Set submode would have kept — dates, ages over 89, geography. To get the strictest coverage, it runs both filter models in parallel: the HIPAA-specific one and the general personal-data one.

Category	Tier	Regex recall	ML recall (raw)	Combined recall	Gold n
age_over_89	T1	100.0 % (18/18)	0.0 % (0/18)	100.0 %	18
biometric_id	T1	100.0 % (9/9)	100.0 % (9/9)	100.0 %	9
email_address	T1	100.0 % (10/10)	100.0 % (10/10)	100.0 %	10
medical_record_number	T1	100.0 % (20/20)	100.0 % (20/20)	100.0 %	20
person_name	T1	20.0 % (4/20)	100.0 % (20/20)	100.0 %	20
phone_or_fax	T1	100.0 % (10/10)	100.0 % (10/10)	100.0 %	10
social_security_number	T1	90.0 % (9/10)	100.0 % (10/10)	100.0 %	10
account_number	T2	0.0 % (0/10)	100.0 % (10/10)	100.0 %	10
general_date	T2	100.0 % (27/27)	29.6 % (8/27)	100.0 %	27
health_plan_id	T2	0.0 % (0/10)	100.0 % (10/10)	100.0 %	10
ip_address	T2	100.0 % (9/9)	100.0 % (9/9)	100.0 %	9
license_number	T2	0.0 % (0/10)	100.0 % (10/10)	100.0 %	10
vehicle_id	T2	10.0 % (1/10)	100.0 % (10/10)	100.0 %	10
device_id	T3	22.2 % (2/9)	88.9 % (8/9)	77.8 %	9
photo	T2	0.0 % (0/9)	0.0 % (0/9)	0.0 %	9

The two interesting rows are general dates (regex 100%, ML 30%) and ages over 89 (regex 100%, ML 0%). We deliberately let regex handle both in Safe Harbor: dates have a shape regex catches every time, and we don't want a probabilistic model second-guessing a numeric threshold like ">89". A deterministic rule is more reliable than asking the ML model to learn the same rule.

Overall numbers across all categories

Adding it all up: how does the full pipeline (regex + ML together) compare to either layer alone?

Mode	Layers	Recall	Precision	F2
Personal Data	regex only	65.8 %	93.0 %	69.9 %
Personal Data	ML only	95.4 %	92.4 %	94.8 %
Personal Data	both (full)	96.9 %	98.0 %	97.1 %
Limited Data Set	regex only	55.9 %	95.0 %	60.9 %
Limited Data Set	ML only	92.9 %	84.5 %	91.0 %
Limited Data Set	both (full)	92.9 %	89.3 %	92.1 %
Safe Harbor	regex only	58.9 %	93.6 %	63.6 %
Safe Harbor	ML only	82.4 %	88.3 %	83.5 %
Safe Harbor	both (full)	92.4 %	88.9 %	91.7 %

The Personal Data filter's "full" row beats both single-layer versions on every metric — combining regex (for surface formats) with the ML model (for context) genuinely provides something that neither layer alone can. The 97.3% headline earlier in this post is the number that reflects what a real user gets. The number in the table above is a bit lower only because it includes the URL category, which we deliberately preserve so we don't break links and tool calls.

Head-to-head against other dedicated privacy filters

The fair comparison for an on-device, near-instant privacy filter like ours is against other on-device, near-instant privacy filters — not against giant cloud-hosted LLMs that take seconds per message and need a network round-trip. We ran every system in this class against the same test sets, using the same matching rules. Same standard for everyone.

The peer class:

openai/privacy-filter — OpenAI's open-source dedicated privacy filter. About 50 million parameters, small enough to run in any modern browser.
piiranha-v1 — a 278M-parameter detector from iiiorg. Its license restricts it to research and evaluation only (we can measure it, but it cannot be shipped commercially).
Microsoft Presidio — the most widely-deployed open-source redactor, combining traditional pattern matching with a small language model for context.
GLiNER PII family — a family of small, general-purpose entity classifiers. Knowledgator ships small (~44M), base (~86M), and large (~304M) variants; NVIDIA released a 570M variant in October 2025.
Caiioo in all three modes (Personal Data, HIPAA Limited Data Set, HIPAA Safe Harbor).

Recall across all five test sets, sorted with Caiioo first:

System	PrivacyBench PD-25	Caiioo synthetic PD-200	PrivacyBenchHIPAA-40	Caiioo synthetic PHI-200	Multilingual PD-40 (10 locales)
Caiioo — Personal Data	96.2 % (76/79)	99.0 % (198/200)	—	—	92.6 % (200/216)
Caiioo — HIPAA Limited Data Set	—	—	100.0 % (79/79)	100.0 % (200/200)	—
Caiioo — HIPAA Safe Harbor	—	—	100.0 % (79/79)	100.0 % (200/200)	—
openai/privacy-filter (50M)	96.2 % (76/79)	83.0 % (166/200)	93.7 % (74/79)	77.0 % (154/200)	94.9 % (205/216)
gliner_pii_nvidia (570M)	94.9 % (75/79)	85.5 % (171/200)	84.8 % (67/79)	85.0 % (170/200)	76.9 % (166/216)
gliner_pii_large (~304M)	72.2 % (57/79)	86.5 % (173/200)	84.8 % (67/79)	93.0 % (186/200)	50.0 % (108/216)
gliner_pii_base (~86M)	87.3 % (69/79)	66.0 % (132/200)	74.7 % (59/79)	66.0 % (132/200)	51.4 % (111/216)
gliner_pii_small (~44M)	88.6 % (70/79)	84.5 % (169/200)	91.1 % (72/79)	83.0 % (166/200)	68.5 % (148/216)
Microsoft Presidio	82.3 % (65/79)	76.5 % (153/200)	84.8 % (67/79)	76.5 % (153/200)	69.0 % (149/216)
piiranha-v1 (~278M)	60.8 % (48/79)	58.5 % (117/200)	43.0 % (34/79)	47.0 % (94/200)	82.4 % (178/216)

Caiioo leads the dedicated-filter class on the two largest tests (PD-200 and PHI-200), ties or leads on the public benchmarks, and is second on multilingual. On the smallest test (PrivacyBench PD-25, just 25 questions) Caiioo and openai/privacy-filter tie at 96.2%. On multilingual, openai/privacy-filter still leads at 94.9% with Caiioo at 92.6% — the language where we trail most is Chinese; everywhere else we sit at or near the top. If multilingual coverage is mission-critical, openai/privacy-filter is a reasonable alternative. For most other workloads in this class, Caiioo is.

The HIPAA result is the headline. Both Caiioo HIPAA modes hit 100% recall on every HIPAA test — every patient name, every medical record number, every date of birth, every account number is caught. The second-best system is openai/privacy-filter at 93.7% on PrivacyBenchHIPAA — a 6.3-point gap, on a benchmark where every miss is a real-world disclosure.

A second number worth reading: over-redaction — masking things that weren't actually private data. Over-redaction isn't a privacy harm, it's a usability cost. Mask too many things and the LLM's reasoning gets worse, and the returned answer is degraded. Caiioo masks unnecessarily 1–24 times across the test sets. Presidio: 10–51. NVIDIA's GLiNER: 31–64 on the HIPAA tests alone. Precision matters as much as recall when the goal is the best possible answer with the least possible exposure.

What about just using a frontier LLM as the filter?

They can — and on raw recall, they win. Big general-purpose LLMs (Llama 3.1 8B, Gemma 4, Qwen 3.5 9B, and similar), running either in the cloud or locally, can score 95–100% across every test including multilingual. That is a real option for users who need maximum recall and are willing to pay for it.

The drawbacks are real, though:

It's slow. Seconds per message instead of milliseconds. The filter sits in front of every message the user sends.
Either the message leaves the user's machine, or the model does. To filter in the cloud, the message has to go up there — defeating the purpose. To filter on-device requires downloading a 1–17 GB model.
It can be tricked. A generative model can be talked, mid-message, into not redacting (a "prompt injection" attack). A small classifier like ours can't.
Same input, different output. Generative models don't always give the same answer twice. That breaks the round-trip — masking on the way out and unmasking on the way back relies on the same real value always mapping to the same fake.

Caiioo is built for the other side of that tradeoff: a tiny, predictable, sub-second filter that runs in the composer before the user presses send, and that always produces the same fake for the same real value within a conversation, so the round-trip stays coherent. The peer-class table above is the apples-to-apples comparison for that kind of use case.

The proof is in the pudding

Benchmarks are a starting point, not a finish line. The filter is wired into Caiioo's new feature: pseudonymizer — the component that actually sits between the composer and the model.

Here is what happens when the user presses send.

Detect. The regex layer runs first — it's deterministic and microsecond-fast. The ML model runs next on whatever's left. If the two layers overlap on the same piece of text, we use a simple rule: regex wins on surface formats, ML wins on context.
Tag self vs. other. Caiioo separates identifiers that refer to the user from identifiers that refer to other people. The user can choose to redact only one, or both. Names the user has added to a personal dictionary always count as "self."
Substitute. Each real value gets a stable, style-matched pseudonym. "Sarah Goldberg" becomes "Maya Hartwell" — and stays "Maya Hartwell" for the whole conversation, so the model's reasoning about that person doesn't fragment across turns. A real-to-fake lookup table is kept on the user's device, encrypted with a key from the platform's keychain.
Send. The model receives a fully fake message. No real identifier ever crosses the network, and our audit log records counts only — never the values themselves.
Restore. The streaming response is mapped back as it arrives. "Maya Hartwell" in the model's reply becomes "Sarah Goldberg" before it reaches the screen, rendered with a small glow pill so the user can see at a glance what was protected.
Restore tool arguments too. If the model calls a tool — send an email, file a ticket, write to a doc — and the arguments contain fakes, we substitute the real values back before the tool runs. The model reasons over fakes; the action takes the real value.

The filter doesn't care which AI service is in use. It runs before the message ever reaches the model, so OpenRouter, Anthropic, Google, OpenAI, and a local Ollama all get the same masked payload. Adding a new provider doesn't reopen the privacy hole.

Who it protects

The user. A user's name, email, address, phone, IP, and biometric identifiers — the things that, taken together, identify a person to an aggregator — never leave the device when the filter is on.

The people the user talks about. Most privacy tools focus on the person typing, but what they ignore is the 'social contract' — the fact that we all owe a responsibility to others as well as to ourselves. Submitting "Please analyze Mr. Saunders's conduct for incompetence" to an LLM, where it may be recorded in system logs indefinitely, is irresponsible (and potentially defamatory). Asking an LLM for help with a Google Sheet that contains 1,000 business contacts deposits all of them into the data-retention flypaper (to varying extents, depending on the actual 'zero data retention' in effect). Caiioo's filter also covers third parties: the client a contract is being drafted for, the patient whose record is being reasoned through, the colleague whose email was pasted into context. They did not consent to a remote LLM seeing their identifiers. The filter respects that by default; the user can switch to "self only" or "others only" if a workflow requires it.

Entities — businesses, hospitals, firms. Account numbers, license numbers, medical record numbers, financial routing details, internal IDs, API keys, customer rosters. A business has the same data-minimization interest a person does. A clinician using Caiioo to draft a discharge note isn't shipping the patient's medical record number to OpenAI. A lawyer isn't shipping the client's account number to Anthropic. A support engineer isn't shipping the API key from the customer's log to Google. The filter doesn't ask whether the identifier belongs to a person or an entity — it just keeps it on the device.

Maximum utility, minimum exposure

The whole point is that the user shouldn't feel the filter.

Most privacy tools force a choice: redact aggressively and watch the model's answer get worse, or keep the prompt usable and watch the privacy promise erode. We rejected that tradeoff. The model still gets a fully-formed prompt — it sees a name, a place, a date, a medical record number, all in the right grammatical positions. It just sees fake ones. Its reasoning is identical; only the strings are scrubbed.

Stable substitution is what makes that work. Because the same real value always maps to the same fake within a conversation — across the user's message, the tool result that comes back referencing that name, the model's earlier reply that mentioned it — the model has a coherent person, place, or thing to reason about. Multi-turn conversations don't get scrambled. Tool calls don't break. Sub-agents inherit the parent conversation's map and stay consistent across the whole task.

The output the user sees is the real conversation. The data the provider sees is a coherent fiction. The work happens in the gap between those two views, and the goal — the only goal — is to make that gap invisible.

A privacy filter that gets in the way will get turned off. A privacy filter that disappears into the workflow is the only kind worth shipping.

That's the bar we built to.