Test Fixtures — MedicalHelpDesk Corp¶
This document defines the reference source database used to test
PrivaCI end-to-end. It is the test-side counterpart to
docs/local-development.md.
The fixture is a fictional SaaS company called MedicalHelpDesk Corp — a multi-tenant support tool sold to medical practices, hospitals, and clinics. The choice is deliberate:
- PHI density — patients, providers, prescriptions, visits all carry HIPAA-relevant identifiers (MRN, NPI, DEA, diagnosis codes, insurance member IDs). This exercises the future HIPAA vertical config pack on a realistic schema.
- Free-text NER bait — support ticket messages and visit notes contain narrative prose mentioning patient names and dates, perfect for Level 2 SpaCy NER testing.
- Multi-tenancy — every meaningful table has an
org_id, which exercises FK consistency across tables. - Naturally cyclic FKs — every SaaS schema with "primary contact" and "owner" pointers has cycles. We use this organically rather than contriving one.
- Two tenants worth of data in one schema — the SaaS-product
side (
public) and the patient-data side (clinical) live in separate schemas, exercising cross-schema FK behavior.
The full schema and generator scripts live at
tests/fixtures/demo_corp/.
0. Implementation status¶
What exists today: the mini tier — all major tables from the schema below with CI-friendly row counts — generated by
tests/fixtures/demo_corp/and committed attests/fixtures/sql/demo-corp/. Regenerate withmake fixtures-generate(orpython -m tests.fixtures.demo_corp.generate --tier mini). The full demo tier (millions of rows, ~300–500 MB loaded) uses the same generator with--tier demoand is not committed.
Loading the mini dataset¶
With compose.dev.yml up, load it into the source database and inspect it.
Use the project venv (activate it first, or prefix commands with .venv/bin/):
source .venv/bin/activate # once per shell
SOURCE_DB_URL=postgresql://postgres:dev@127.0.0.1:55432/privaci_source \
python -m scripts.load_sample_data
SOURCE_DB_URL=postgresql://postgres:dev@127.0.0.1:55432/privaci_source \
privaci catalog inspect
catalog inspect should report 20+ tables, partitioned raw_events and
patient_visits, deferred FK edges breaking the org↔user cycle, a
polymorphic_fk_warning on comments, and implied-FK warnings on
patient_documents.
Heads-up: the loader runs
DROP SCHEMA ... CASCADEforpublic,clinical,auth, andaudit_internalbefore recreating them, so it replaces the Week-1 spike tables in the source database'spublicschema. Re-create those by recreating the compose volume if you need them back.
1. Three-Tier Strategy¶
┌────────────────────────────────────────────────────────────┐
│ TIER 1 — Mini fixtures (programmatic, per-test) │
│ Built in-memory by pytest fixtures. │
│ 1-100 rows. Each test picks the minimum schema subset it │
│ needs. │
│ Run-time per test: < 1 second. │
└────────────────────────────────────────────────────────────┘
│
┌────────────────────────────────────────────────────────────┐
│ TIER 2 — Demo Corp (committed) │
│ Faker-seeded generator. Cached SQL committed to repo. │
│ ~500 orgs / 10k users / 100k patients / 1M ticket msgs. │
│ ~300-500 MB loaded into Postgres. │
│ ~50-80 MB compressed on-disk (gzipped binary COPY). │
│ Loads into a fresh Postgres via docker compose in ~60s. │
│ Used by: `compose.yml` smoke, E2E integration tests, │
│ release CI smoke gate, demos. │
└────────────────────────────────────────────────────────────┘
│
┌────────────────────────────────────────────────────────────┐
│ TIER 3 — Stress (generated on demand, NOT committed) │
│ Same schema, 10x to 100x scale. │
│ Up to ~50 GB loaded. │
│ Used by: performance benchmarks, week-1 throughput spike. │
│ Constant-memory streaming is validated separately by the │
│ server-side 1 GB load test (see section 8). │
└────────────────────────────────────────────────────────────┘
Determinism rule: every tier uses Faker.seed_instance(42) and
random.seed(42). Same seed, same rows, byte-for-byte. This is what
makes "the masked email for john.doe@… is always X" testable.
2. The Schema¶
══════════════════════════════════════════════════════════════════════
public schema (the SaaS-product side)
══════════════════════════════════════════════════════════════════════
┌─────────────────────┐ ┌──────────────────────┐
│ organizations │◀────────│ subscriptions │
│ id IDENTITY │ │ id IDENTITY │
│ name │ │ org_id FK │
│ ein │ │ plan │
│ billing_email UQ │ │ started_at │
│ owner_user_id ┐ │ │ billing_address_id │
│ primary_user_id│ ──┼──┐ └──────────┬───────────┘
└─────────┬───────┘ │ │ │
│ │ │ ⤴ CYCLE ▼
▼ │ │ ┌──────────────────────┐
┌─────────────────────┐ │ │ │ invoices │
│ users │◀┘ │ │ id IDENTITY │
│ id IDENTITY │ │ │ subscription_id FK │
│ email UQ │◀──┘ │ period_start │
│ phone │ │ total_cents │
│ first_name │ └──────────┬───────────┘
│ last_name │ │
│ dob │ ▼
│ ssn │ ┌──────────────────────┐
│ password_hash │ │ invoice_line_items │
│ role │ │ invoice_id FK │
│ org_id FK ────────┘ │ line_no │
│ manager_id (self) ──┐ │ PRIMARY KEY │
│ last_login_ip │ │ (invoice_id, │
│ ←───┘ │ line_no) │
└────┬─────────────────┘ └──────────────────────┘
│
├──────────────────────┐
▼ ▼
┌───────────────────┐ ┌──────────────────────────┐
│ user_addresses │ │ user_payment_methods │
│ id IDENTITY │ │ id IDENTITY │
│ user_id FK │ │ user_id FK │
│ street/city/... │ │ card_number / cvv / │
│ │ │ expiry / cardholder_name │
└───────────────────┘ └──────────────────────────┘
┌─────────────────────┐
│ employees │ practice-side staff (separate from `users`)
│ id IDENTITY │
│ org_id FK │
│ manager_id (self) │ self-referential
│ full_name │
│ ssn │
│ dob │
│ hire_date │
│ salary │
│ npi (some only) │ some employees are also `providers`
└─────────────────────┘
┌─────────────────────┐ ┌──────────────────────┐
│ tickets │ │ ticket_messages │
│ id IDENTITY │◀────────│ id IDENTITY │
│ org_id FK │ │ ticket_id FK │
│ reporter_user_id │ │ author_user_id FK │
│ subject │ │ body (LARGE TEXT) │ ← L2 NER bait
│ status │ │ posted_at │
│ priority │ └──────────────────────┘
│ assigned_to_user │
└─────────────────────┘
┌──────────────────────────────────────────────────────┐
│ comments │
│ id IDENTITY │
│ commentable_type TEXT ← 'Ticket' | 'Patient' | │
│ commentable_id BIGINT │
│ author_user_id FK→users │
│ body │
│ │
│ POLYMORPHIC FK — no catalog edge. │
│ Engine emits a `polymorphic_fk_warning`. │
└──────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ raw_events │
│ PARTITIONED BY RANGE (event_at) — 24 monthly parts │
│ │
│ raw_events (parent) │
│ ├─ raw_events_2024_01 ... raw_events_2025_12 │
│ │
│ 60+ columns including jsonb payload. │
│ Wide row → exercises batch auto-tune. │
└──────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ geo_locations │
│ id, name, region_path ltree │
│ Contains UNSUPPORTED type (ltree) → exercises │
│ per-table text-mode COPY fallback. │
└──────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ active_clinics_v (VIEW) │
│ Plain view. Engine SHALL skip + audit log. │
│ │
│ monthly_revenue_v (VIEW) │
│ Aggregating view. Engine SHALL skip + audit log. │
│ │
│ tickets_open_mv (MATERIALIZED VIEW) │
│ Engine SHALL skip + audit log. │
└──────────────────────────────────────────────────────┘
══════════════════════════════════════════════════════════════════════
clinical schema (the patient-data side — HIPAA territory)
══════════════════════════════════════════════════════════════════════
┌─────────────────────┐ ┌──────────────────────┐
│ providers │ │ patients │
│ id IDENTITY │◀────┐ │ id IDENTITY │
│ npi UQ │ │ │ mrn UQ │
│ dea_number │ │ │ first_name │
│ first_name │ │ │ last_name │
│ last_name │ │ │ full_name │ ← composite
│ email UQ │ │ │ dob │
│ specialty │ │ │ ssn │
│ org_id FK ────────────┐ │ │ phone │
│ (CROSS-SCHEMA → │ └───│ primary_provider_id │
│ public.orgs) │ │ org_id FK ──────┐ │
└───────────────────────│ │ (CROSS-SCHEMA)│ │
│ │ insurance_member_id│ │
│ │ insurance_group │ │
│ └─────────┬──────────┘ │
│ │ │
└────────────────┼─────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ patient_visits │
│ PARTITIONED BY LIST (region_code) — │
│ visits_us_east / us_west / us_central / intl │
│ │
│ id IDENTITY (composite with region_code) │
│ patient_id FK │
│ provider_id FK │
│ visit_date │
│ visit_type │
│ region_code │
│ diagnosis_code │
│ chief_complaint TEXT │
│ visit_notes TEXT (LARGE — L2 NER bait, narrative │
│ prose mentioning patient names and dates) │
└──────────────────────────────────────────────────────┘
┌─────────────────────┐
│ prescriptions │
│ id IDENTITY │
│ patient_id FK │
│ provider_id FK │
│ drug_name │
│ dosage │
│ refills │
│ prescribed_at │
└─────────────────────┘
┌──────────────────────────────────────────────────────┐
│ patient_documents │
│ id IDENTITY │
│ patient_id FK→patients │
│ referring_provider_email TEXT ← IMPLIED FK, │
│ informally references clinical.providers.email │
│ (no FK constraint declared) │
│ document_blob bytea │
│ uploaded_by_user_email TEXT ← IMPLIED FK to │
│ public.users.email │
│ │
│ Engine emits an `implied_fk_warning` for both. │
│ User config can mitigate via `seed_alias` to make │
│ faked values consistent. │
└──────────────────────────────────────────────────────┘
══════════════════════════════════════════════════════════════════════
auth schema (cross-schema FKs)
══════════════════════════════════════════════════════════════════════
┌─────────────────────┐ ┌──────────────────────┐
│ sessions │ │ api_keys │
│ id IDENTITY │ │ id IDENTITY │
│ user_id FK ───────────┐ │ user_id FK ───────┐ │
│ (→ public.users) │ │ (→ public.users) │ │
│ token UUID UQ │ │ key_hash │ │
│ ip_address │ │ last_used_at │ │
│ user_agent │ │ last_used_ip │ │
│ created_at │ │ scopes TEXT[] │ │
│ expires_at │ └─────────────────────┘ │
└───────────────────────┘ │
│
══════════════════════════════════════════════════════════════════════
audit_internal schema (exclude-strategy test)
══════════════════════════════════════════════════════════════════════
┌──────────────────────────────────────────────────────┐
│ audit_log_events │
│ occurred_at, actor_user_id, action, target, │
│ payload jsonb │
│ │
│ NO PRIMARY KEY — exercises: │
│ • table-level checkpoint fallback │
│ • the `strategy: exclude` config path (default │
│ test config excludes this table) │
└──────────────────────────────────────────────────────┘
What every quirky table teaches¶
| Table / object | Lesson exercised |
|---|---|
organizations ↔ users (owner_user_id, primary_user_id) |
2-table FK cycle, DEFERRABLE constraint, cycle-break heuristic |
users.manager_id |
Self-referential FK |
employees.manager_id |
Self-referential FK in a non-user table |
invoice_line_items |
Composite PK — keyset-pagination + UNIQUE composite faker |
audit_internal.audit_log_events |
No PK → table-level checkpoint fallback; exclude strategy in default config |
comments (commentable_type, commentable_id) |
Polymorphic FK → warning, not failure |
auth.sessions / auth.api_keys |
Cross-schema FK |
clinical.providers.org_id |
Cross-schema FK between non-public schemas |
raw_events (24 partitions, monthly) |
Range-partitioned table — clone parent + children, per-partition checkpoint |
clinical.patient_visits (4 list partitions) |
List-partitioned table — same as above, list variant |
raw_events (60+ columns) |
Wide row → batch auto-tune |
ticket_messages.body, patient_visits.visit_notes |
Free-text → SpaCy NER (L2) |
users |
Every PII pattern: email (UNIQUE), phone, ssn, dob, password, name, address |
user_payment_methods |
PCI-DSS pattern coverage (card_number / cvv / expiry) |
clinical.patients |
HIPAA pattern coverage (mrn, dob, ssn, insurance_member_id) |
clinical.providers.npi, dea_number |
HIPAA provider identifiers |
clinical.patients.full_name + UNIQUE on (first_name, last_name, dob) |
Composite UNIQUE → faker uniqueness suffix |
clinical.patient_documents.*_email |
Implied (soft) FK → warning + seed_alias mitigation |
geo_locations.region_path ltree |
Unsupported type → text-mode COPY fallback |
active_clinics_v, monthly_revenue_v |
Plain views — engine skips + audit-log entry |
tickets_open_mv |
Materialized view — engine skips + audit-log entry |
raw_events.id (SERIAL) + most others (IDENTITY) |
Both legacy SERIAL and modern IDENTITY sequences |
Edge-case fixtures (separate small SQL files)¶
These don't belong in the main Demo Corp because they're failure-mode
tests. Each lives at tests/fixtures/sql/edge-cases/<name>.sql and
is loaded only by the tests that need it.
| File | Failure mode tested |
|---|---|
non-deferrable-cycle.sql |
Cycle with NOT DEFERRABLE constraint → pre-flight exit 2 with SQL hint |
no-primary-key.sql |
Table with no PK → table-level checkpoint fallback |
unsupported-types.sql |
PostGIS-like custom domain → text-mode fallback |
permission-denied.sql |
Role lacks CREATE SCHEMA → pre-flight exit 2 |
dangling-fk-exclude.sql |
exclude'd table is FK target → config validation exit 3 |
composite-pk-only.sql |
Only composite PKs, no monotonic single-column key |
adversarial-types.sql |
Deep FK chain + 4-byte UTF-8, large jsonb, max numeric, arrays, bytea, inet → byte-for-byte round-trip |
3. Generator Layout¶
tests/fixtures/demo_corp/
├── README.md
├── __init__.py
├── schema.py ← DDL as Python — single source of truth
├── generate.py ← CLI: produce SQL files at given tier
├── seed.py ← Faker seeding + deterministic helpers
├── providers/ ← per-domain row generators
│ ├── __init__.py
│ ├── orgs.py
│ ├── users.py
│ ├── patients.py
│ ├── visits.py
│ ├── tickets.py
│ ├── messages.py ← long-form narrative w/ PHI mentions
│ └── events.py
└── tiers.py ← scale parameters per tier
tests/fixtures/sql/
├── demo-corp/ ← Tier 2 output (committed, ~50-80 MB)
│ ├── 00-schemas.sql
│ ├── 01-tables-public.sql
│ ├── 02-tables-clinical.sql
│ ├── 03-tables-auth.sql
│ ├── 04-tables-audit-internal.sql
│ ├── 05-partitions.sql
│ ├── 06-constraints.sql
│ ├── 07-indexes.sql
│ ├── 08-views.sql
│ └── 09-data/ ← gzipped binary COPY files per table
│ ├── public.organizations.copy.gz
│ ├── public.users.copy.gz
│ └── ...
└── edge-cases/
├── non-deferrable-cycle.sql
└── ...
Generation invocation¶
# Regenerate Tier 2 (committed Demo Corp dataset)
python -m tests.fixtures.demo_corp.generate --tier demo \
--out tests/fixtures/sql/demo-corp
# Generate Tier 3 stress dataset (not committed)
python -m tests.fixtures.demo_corp.generate --tier stress \
--scale 10 \
--out /tmp/stress
# Tier 1 (mini) is built in-memory by pytest fixtures; no script
# invocation needed.
The generator is reproducible: make fixtures-verify regenerates Tier
2 into a temp dir and diffs against the committed files. CI fails if
the committed SQL drifts from what the code would produce.
Tier scale parameters (Tier 2 / Demo)¶
| Entity | Count |
|---|---|
| organizations | 500 |
| users | 10,000 |
| employees | 5,000 |
| providers | 1,000 |
| patients | 100,000 |
| patient_visits | 500,000 (across 4 list partitions) |
| prescriptions | 250,000 |
| patient_documents | 25,000 |
| tickets | 50,000 |
| ticket_messages | 1,000,000 (free-text; the bulk of disk) |
| invoices | 25,000 |
| invoice_line_items | 100,000 |
| raw_events | 5,000,000 (across 24 monthly partitions) |
| comments (polymorphic) | 50,000 |
Loaded into Postgres: ~300-500 MB. Compressed COPY files on disk:
~50-80 MB. docker compose up → masked target ready in under 60
seconds on a modern laptop.
4. Reference Configs¶
The fixtures ship with three reference configs at
tests/fixtures/configs/:
| Config | Purpose |
|---|---|
demo-corp.yaml |
The "happy path" config — every table addressed explicitly, sane defaults |
minimal.yaml |
Only required fields. Relies entirely on auto-detect. Tests the zero-config story |
strict.yaml |
strict_autodetect: true + explicit passthrough for known-safe columns. Tests the compliance-mode path |
Snippet from demo-corp.yaml (excerpt — full file in repo):
version: "1.0"
global_salt: env://PRIVACI_TEST_SALT
on_existing_data: fail
strict_autodetect: false
tables:
public.organizations:
strategy: transform
columns:
name: { action: fake, provider: company }
ein: { action: regex_mask, pattern: '^\d{2}-\d{7}$', replace: "00-0000000" }
billing_email: { action: fake, provider: email }
public.users:
strategy: transform
columns:
email: { action: fake, provider: email }
phone: { action: fake, provider: phone }
first_name: { action: fake, provider: first_name }
last_name: { action: fake, provider: last_name }
dob: { action: fake, provider: dob }
ssn: { action: fake, provider: ssn }
password_hash: { action: static, value: "privaci-test-pw" }
last_login_ip: { action: fake, provider: ip_address }
clinical.patients:
strategy: transform
columns:
mrn: { action: hash }
first_name: { action: fake, provider: first_name }
last_name: { action: fake, provider: last_name }
full_name: { action: fake, provider: full_name }
dob: { action: fake, provider: dob }
ssn: { action: fake, provider: ssn }
phone: { action: fake, provider: phone }
email: { action: fake, provider: email }
insurance_member_id: { action: hash }
clinical.patient_documents:
strategy: transform
columns:
# Aliases make implied FKs deterministic across tables
referring_provider_email:
action: fake
provider: email
seed_alias: clinical.providers.email
uploaded_by_user_email:
action: fake
provider: email
seed_alias: public.users.email
document_blob: { action: passthrough } # synthetic blob, no PII
public.ticket_messages:
strategy: transform
columns:
body: { action: ner_mask } # L2 SpaCy
clinical.patient_visits:
strategy: transform
columns:
visit_notes: { action: ner_mask }
audit_internal.audit_log_events:
strategy: exclude
5. Assertions Every Integration Test Must Make¶
When a test runs the engine against Demo Corp (full or subset), the test SHOULD verify, at minimum:
# Original PII values do not appear anywhere in target
assert "john.doe@acme-clinic.example" not in serialized_target_rows
# FK integrity preserved
assert all_fks_valid(target_conn)
# Deterministic faker consistency
masked_email_in_users = pick_one(target_conn, "public.users", "email")
masked_email_in_messages_authored_by_same_user = ...
assert they_match_via_seed()
# UNIQUE constraints satisfied after masking
assert no_unique_violations(target_conn)
# Audit log written
assert (
audit_count(target_conn, event_type="column.masked") > 0
)
# Skipped views recorded
assert "active_clinics_v" in skipped_objects(target_conn)
assert "tickets_open_mv" in skipped_objects(target_conn)
# Polymorphic + implied FKs warned
assert (
audit_count(target_conn, event_type="polymorphic_fk_warning") == 1
)
assert (
audit_count(target_conn, event_type="implied_fk_warning") >= 2
)
# Partitions all loaded
assert partition_count(target_conn, "public.raw_events") == 24
assert partition_count(target_conn, "clinical.patient_visits") == 4
# No raw PII in audit log
assert no_pii_in_audit_log(target_conn)
The helper module tests/integration/assertions.py provides these as
composable functions so every integration test reads identically.
6. What This Fixture Does NOT Cover¶
Honesty matters. Things this fixture intentionally does not cover yet — captured as future work:
- Multi-region partitioning at scale (hundreds of partitions). Tier 2 covers 24 (range) and 4 (list); v1.x test additions can expand.
- Triggers / rules — current spec already says "skip + audit log";
Demo Corp doesn't ship any. Edge-case fixture
with-triggers.sqlcan be added when needed. - Custom domains and user-defined types beyond
ltree. - Mongo / non-relational sources — out of MVP scope.
- Sharded databases / Citus — out of MVP scope.
- Logical replication slots / publications — engine doesn't use them; Demo Corp doesn't need them.
7. Operational Notes¶
- PHI must never be real. Faker generates all patient and provider
data with a fixed seed. The
seed.pymodule includes a hard assertion that prevents the generator from running if it detects any obvious real-PII patterns in the input (e.g., a name dictionary imported from a real source). - Faker locale.
en_USfor predictability. Vertical packs can later adden_GB,de_DE, etc. - Long-text quality.
ticket_messages.bodyandpatient_visits.visit_notesuse a templated narrative generator that produces realistic prose mentioning Faker names, dates, and Faker-generated MRN-shaped strings. This is what makes L2 NER tests meaningful. - Regeneration determinism. Any change to
generate.py,schema.py,seed.py, or any provider module that affects output requires a regenerated committed SQL set in the same PR. The pre-commit hookverify-fixturesenforces this. - Storage cost. ~80 MB of compressed SQL in the repo is acceptable. If it grows past ~200 MB we'll move to Git LFS or fetch-on-demand from a release artifact.
8. Large-scale load test (§18.3)¶
The beta release gate requires proof that the engine streams a multi-gigabyte table in constant memory. This is validated by a dedicated, opt-in test rather than the Tier 3 Python generator.
Why not
--tier stress? The Python generators materialise rows in process memory before writing SQL, so asking them for a 1 GB dataset balloons RAM. The load test instead builds the dataset inside PostgreSQL withINSERT ... SELECT FROM generate_series(...)(seetests/integration/loadtest_data.py): the database streams generated rows straight to its own heap/TOAST storage while Python holds nothing larger than the SQL statement.
What it asserts¶
tests/integration/test_loadtest_1gb.py runs the full masking pipeline against
the generated load_test.customers table and verifies:
- Completeness — every source row reaches the target (row parity).
- Integrity — no foreign key is left orphaned.
- No leakage — the probe email
customer1@example.testnever survives. - Bounded memory — peak process RSS growth during the run stays under a ceiling and below half the dataset size, proving the engine never buffers the whole table.
Running it¶
The test is marked slow + integration, so it is excluded from the default
suite. With compose.dev.yml up:
source .venv/bin/activate
# Default 64 MiB — a fast confidence check.
pytest tests/integration/test_loadtest_1gb.py -m "integration and slow"
# Full 1 GiB target (validated locally at 1.15 GB / 900k rows in ~40 s).
PRIVACI_LOADTEST_BYTES=1073741824 \
pytest tests/integration/test_loadtest_1gb.py -m "integration and slow"
| Environment variable | Default | Purpose |
|---|---|---|
PRIVACI_LOADTEST_BYTES |
67108864 (64 MiB) |
Minimum pg_total_relation_size of the customers table to build. |
PRIVACI_LOADTEST_MEM_CEILING_MB |
512 |
Maximum allowed peak RSS growth during the streamed run. |