ADR-0011: Auto-detect confidence scoring and table context¶
Status¶
Accepted — 2026-06-09
Context¶
Zero-config PII auto-detection (OpenSpec §9) matches column names against
a pattern library and applies masking actions without explicit YAML entries.
Column-name matching alone produces false positives (products.description
is not clinical notes) and false negatives (patients.narrative does not
match any name pattern).
The MVP must ship a boolean “mask or passthrough” outcome for operators, but the detection layer should not hard-code a single signal. Future improvements (table-context priors, value-shape stats, semantic embeddings) need a stable finding model.
Decision¶
-
Finding objects, not bare actions. Every detection pass returns a
DetectionFindingwithaction,provider,confidence(high|medium|low), andreasons[]explaining the score. -
Auto-mask only on
high.mediumfindings are surfaced inprivaci dry-run --reportas “uncertain — manual review” and are left as passthrough at run time unless explicitly configured in YAML.lowis passthrough. -
Table context as a confidence modifier (not a primary matcher). Structured PII patterns (
email,ssn,phone, …) remain column-name driven withhighconfidence regardless of table name. Freeform / L2 (ner_mask) candidates additionally consider: - column type (
textorvarchar≥ 500), pg_stats.avg_width(≥ 200 → eligible forhigh),-
table-name priors (
patient,user,product, …) that raise or lower confidence without replacing the column-name match. -
Semantic / embedding analysis is deferred. No value sampling or external model calls in the MVP. A future scorer term (
w_semantic · embedding_similarity) will plug into the same finding model; see revisit triggers below.
Consequences¶
- Operators get a reviewable middle tier instead of silent wrong masks on ambiguous columns.
products.descriptionwith long marketing copy may land inmediumrather than auto-ner_mask, prompting explicit YAML.- Clinical
visit_noteswith stats present stayshighand auto-masks. - Strict mode (
strict_autodetect) treats anyhighormediumfinding not explicitly addressed in YAML as a validation failure (exit3). - Catalog introspection gains a read of
pg_stats.avg_width; missing stats (never-ANALYZEtables) yieldmediumfor freeform candidates on sensitive tables,lowotherwise. - Revisit when: we add config packs with table-scoped patterns, ship value-shape sampling, or integrate an on-prem embedding model for semantic column classification.