Spike 2.2: SpaCy en_core_web_sm throughput¶
Hypothesis¶
Level-2 NER can process representative freeform notes at ≥1,000 rows/sec on a typical developer laptop, keeping L2 in the MVP.
Setup¶
- Fixture text:
tests/fixtures/spikes/freeform_notes.txt(~200+ chars/paragraph) - Model:
en_core_web_sm - Default benchmark: 2,000 rows,
nlp.pipe(batch_size=64)
Procedure¶
python -m spacy download en_core_web_sm
python scripts/spikes/run_week1_spikes.py
Results (fill after local run)¶
| Metric | Value |
|---|---|
| Date | YYYY-MM-DD |
| Row count | 2000 |
| Batch size | 64 |
| Elapsed (s) | … |
| Rows/sec | … |
| Target (≥1000/sec) | met / not met |
Conclusion¶
- Status: PASS / FAIL / BLOCKED
- Notes: If below target, try
batch_size=128or batched multi-column pipe per design risk R1.
Impact on design¶
- PASS: Keep L2 in MVP; document recommended
batch_sizefornlp.pipe. - FAIL: Trigger R1 mitigations (batch columns, reduce model, or L1-only MVP).