Skip to content

Spike 2.2: SpaCy en_core_web_sm throughput

Hypothesis

Level-2 NER can process representative freeform notes at ≥1,000 rows/sec on a typical developer laptop, keeping L2 in the MVP.

Setup

  • Fixture text: tests/fixtures/spikes/freeform_notes.txt (~200+ chars/paragraph)
  • Model: en_core_web_sm
  • Default benchmark: 2,000 rows, nlp.pipe(batch_size=64)

Procedure

python -m spacy download en_core_web_sm
python scripts/spikes/run_week1_spikes.py

Results (fill after local run)

Metric Value
Date YYYY-MM-DD
Row count 2000
Batch size 64
Elapsed (s)
Rows/sec
Target (≥1000/sec) met / not met

Conclusion

  • Status: PASS / FAIL / BLOCKED
  • Notes: If below target, try batch_size=128 or batched multi-column pipe per design risk R1.

Impact on design

  • PASS: Keep L2 in MVP; document recommended batch_size for nlp.pipe.
  • FAIL: Trigger R1 mitigations (batch columns, reduce model, or L1-only MVP).