wordchipper

wordchipper is a high-performance Rust byte-pair encoder tokenizer for the OpenAI GPT-2 tokenizer family. Through a
combination of strict allocation discipline, factoring along the implementation lines of the pre-tokenization and BPE
merge algorithm choices, thread-local resources, and extensive metrics; we were able to achieve throughput speedups
relative to tiktoken-rs in rust on a 64 core machine of ~4.3-5.7x
(4 to 64 cores) for general regex BPE vocabularies, and ~6.9x-9.2x when using custom DFA lexers for specific OpenAI
vocabularies. Under python wrappers, we see a range of ~2x-4x (4 to 64 cores) speedups
over tiktoken. The substitutable design yields a benchmark cross-product that
reveals workload-dependent encoder selection and corpus-modulated performance inversion between algorithm families.