wordchipper: Fast BPE Tokenization with Substitutable Internals
Read the full paper
Authors:
- Crutcher Dunnavant (crutcher@gmail.com)
- Dilshod Tadjibaev (dilshod@gmail.com)
wordchipper is a Rust / Python BPE Tokenizer library.
We present wordchipper1, a Rust-based byte-pair encoder
tokenizer for the OpenAI GPT-2 tokenizer family. Through a combination of strict
allocation discipline, factoring along the implementation lines of the
pre-tokenization and BPE merge algorithm choices, thread-local resources, and
extensive metrics; we were able to achieve throughput speedups relative to
tiktoken-rs2 in rust on a 64 core machine of ~4.3-5.7x (4 to 64 cores)
for general regex BPE vocabularies, and ~6.9x-9.2x when using custom DFA lexers
for specific OpenAI vocabularies. Under python wrappers, we see a range of
~2x-4x (4 to 64 cores) speedups over tiktoken3. The substitutable design
yields a benchmark cross-product that reveals workload-dependent encoder
selection and corpus-modulated performance inversion between algorithm families.
Rust Throughput
The central goal of wordchipper is to serve as its own experimental
apparatus. Rather than committing to a single pre-tokenization strategy or BPE
merge algorithm, the library defines stable interface boundaries at
algorithmic decision points, permitting implementations to be substituted
independently. This yields a benchmark space that is the cross-product of span
encoder implementations, spanner backends, vocabulary configurations, and
thread counts.
In wordchipper, a spanner partitions raw input text into a sequence of
labeled spans, each carrying a byte slice and a classification: a Special
token, a Word, or a Gap. The spanner’s sole responsibility is this
segmentation; it has no knowledge of the token vocabulary or merge rules. A
span encoder accepts a single word span and a vocabulary, and is responsible
for resolving it to a sequence of one or more tokens via BPE merge. The span
encoder owns whatever working memory its algorithm requires — scratch
buffers, priority queues, linked list nodes — retaining that memory across
calls to avoid repeated allocation on the fast path.
Python Throughput
The Python binding exposes wordchipper via PyO34, a Rust crate that
generates a native CPython extension module. The extension is built and
packaged using maturin5, which handles ABI compatibility and PyPI
publication. The binding is intentionally thin: the core Tokenizer and
TokenizerOptions types are re-exported directly from the Rust library, with
no reimplementation of encoding logic in Python. The GIL is released during
encoding via py.detach(), permitting concurrent calls from Python threads
without serialization through the interpreter lock.
For the fast pattern gpt2, the logos,
regex-automata, and fancy-regex lexers all outperform tiktoken and
tokenizers. There is no r50k support for bpe_openai yet.
For low thread counts, our rayon + regex-automata implementation (the
default) outperforms tiktoken by ~2x. Above 8 threads, the performance of
the python tiktoken implementation collapses. At 64 threads, the ratio grows
to 4.1x; with a peak throughput of rayon + regex-automata at 104.3 MiB/s
on 64 threads.
Across thread counts, our rayon + logos range between 1.2x to 1.1x the
performance of our rayon + regex-automata implementation; reaching a peak
throughput of 116.7 MiB/s on 64 threads.
The performance of the o200k pattern is
similar to the gpt2 pattern, with the rayon + logos implementation
outperforming the rayon + regex-automata implementation.
There are few meaningful differences outside the performance of tiktoken.
Lexer Implementations
A spanner’s segmentation is driven by a lexer — an engine that applies the vocabulary’s pre-tokenization pattern to raw input bytes. Because pre-tokenization is I/O-bound, scanning every byte of input, lexer throughput is a primary determinant of end-to-end performance. The spanner interface permits substitution of the underlying lexer implementation independently of the span encoder and vocabulary configuration, permitting controlled measurement of lexer throughput in isolation.
fancy-regex Lexer
The first lexer implementation in wordchipper was built upon the
fancy-regex6 crate, as used by tiktoken2. This regex library
provides support for extended regex pattern negative lookaheads, which are
used to implement the (?!\S) whitespace fixup pattern.
regex-automata Lexer
The regex-automata7 crate compiles regular expressions into a DFA at
runtime, avoiding the backtracking interpreter used by fancy-regex. It does
not support lookaheads, so the \s+(?!\S) branches in the OpenAI patterns
must be transformed before compilation. For each known pattern, wordchipper
maintains a parallel version with the lookahead branches collapsed to \s+
and applies a lightweight post-processing pass that truncates multi-character
whitespace matches by one trailing character, reproducing the lookahead
semantics.
This approach was motivated by bpe-openai8, which demonstrated that
regex-automata provides substantial speedups over fancy-regex for these
patterns. In wordchipper, the regex-automata backend serves as a middle
tier: ~4-8x faster than fancy-regex, available for all patterns without
per-pattern development, but slower than the compile-time logos DFAs. Search
state is held in an external Cache object; bpe-openai shares a single
cache behind a mutex, which introduces contention visible at 8+ threads.
wordchipper instead distributes per-slot caches across its PoolToy thread
pool, eliminating cache contention entirely.
logos DFA Lexers
The logos9 crate compiles regex fragments into a DFA at build time via
a derive macro, eliminating runtime compilation and backtracking. For cl100k
and o200k, this yields ~14-21x throughput over fancy-regex (250–300 MB/s
single-threaded).
Two semantic gaps separate logos from the OpenAI patterns. First, logos has no
lookahead support, so the \s+(?!\S) branches cannot be expressed directly.
Second, logos uses longest-match semantics rather than regex first-match,
which produces different span boundaries when Unicode category classes overlap
across token branches (e.g. \p{M} appearing in both word and punctuation
patterns for o200k). A shared post-processing state machine corrects both:
it buffers whitespace tokens and, based on the classification of the following
token, splits the last character into the next span (simulating the lookahead)
or absorbs a preceding ASCII space into punctuation. Each logos token variant
maps to a role enum that drives these rewrite rules, keeping the
vocabulary-specific definitions (token enum, role mapping) separate from the
shared correction logic. Three vocabulary families are implemented:
r50k_base, cl100k_base, and o200k_base.
Correctness is validated by combinatorial equivalence testing. The Unicode predicates in the OpenAI patterns partition the codepoint space into 22 equivalence cells. The test harness generates all k-tuples $(k=1..6)$ of 29 representative codepoints (~594 million inputs per lexer) and asserts byte-identical span boundaries against the regex reference.
Span Encoder Implementations
When comparing span encoder algorithms, we normalize the throughput against the fastest span encoder at a given thread count.
Naively, we would expect CPU cache bandwidth effects to have a higher relative impact on span encoder throughput when the lexer consumes a lower portion of the total time. Faster regex patterns / lexer implementations should be more sensitive to cache effects.
For the fast pattern r50k, we see a strong preference for the
simple buffer algorithms (BufferSweep, TailSweep, MergeHeap) under the
fastest logos DFA lexer; more mixed results are seen when using the default
regex-automata backend; and complex algorithms (PriorityMerge,
BpeBacktrack) do relatively well under the slowest fancy-regex lexer.
For the slow pattern o200k, we see a strong preference for
simple buffer algorithms (BufferSweep, TailSweep, MergeHeap) under the
fastest logos DFA lexer; intermediate preferences under regex-automata,
and a collapse of distinctions under the slowest fancy-regex lexer. From
this we conclude that lexer bottleneck effects can entirely mask performance
differences between span encoders.
Single-string encoding throughput varies significantly with corpus composition. We compare an English prose corpus (~70 KB after 10x repeat) against a diverse multilingual stress corpus (~90 KB) spanning CJK, Arabic, Hebrew, Thai, emoji, and mixed-script runs. The D/E ratio (diverse throughput divided by English throughput) quantifies each encoder’s sensitivity to multi-byte, vocabulary-sparse input.
We select BpeBacktrack as the default encoder in wordchipper, for its
relative performance under pathological conditions. The cost of this selection
is a ~10% drop in throughput in normal operation, to avoid a ~40%-60% drop
under pernicious conditions.
BufferSweep Encoder
The BufferSweep encoder is the reference
implementation against which all other span encoders are benchmarked. Its
working memory is a single retained token buffer, cleared but not deallocated
between calls.
The algorithm is $O(n^2)$ in span length: each merge requires a full scan to locate the minimum pair, and up to n merges may be performed. This cost is acceptable in practice because BPE spans are short by construction. The implementation’s value is in its simplicity — the behavior is unambiguous, making it the reference against which correctness of other encoders is verified.
TailSweep Encoder
The TailSweep encoder applies the same $O(n^2)$ sweep
algorithm as BufferSweep, but eliminates the separate working buffer by
operating directly on the tail of the output buffer. The encoder carries no
per-instance state.
The tradeoff against BufferSweep is allocation pattern versus cache
behavior. TailSweep avoids the separate working buffer and the final copy,
but performs merges directly into the shared output buffer, which grows
monotonically across spans. Whether this improves or degrades cache
utilization relative to BufferSweep’s isolated working buffer is
workload-dependent and is one of the questions the benchmark cross-product is
designed to answer.
MergeHeap Encoder
The MergeHeap encoder retains the $O(n^2)$ sweep
structure of TailSweep but eliminates repeated pair vocabulary lookups by
maintaining a parallel p_ranks buffer alongside the token buffer. This
buffer caches the merge rank for every adjacent token pair, kept invariant
across merges by recomputing only the two entries neighboring each merge site.
PriorityMerge Encoder
The PriorityMerge encoder replaces the linear scan
per merge round (as in Sennrich et al.10) with a standard binary min-heap
that yields the lowest-ranked pair directly. The linked list gives $O(1)$
removal and neighbor access; the heap gives $O(log n)$ extraction. Stale
entries — invalidated by prior merges — are detected in $O(1)$ by comparing
stored token identities against current node state, a standard lazy-deletion
technique. Since each merge reduces the node count by one and enqueues at most
two new pairs, total heap operations are linear in span length, giving
$O(n log n)$ per span. HuggingFace tokenizers11 uses a structurally
similar heap-and-linked-list approach.
BpeBacktrack Encoder
The BpeBacktrack encoder works top-down rather than
bottom-up: instead of starting from bytes and merging upward, it greedily
matches the longest token at each position via the Aho-Corasick automaton12
and backtracks when boundary validation fails. The key invariant is that a
bitfield tracks candidate positions, and positions are only ever cleared,
never re-set. This monotonicity bounds total backtracking to at most n steps
across the entire span — each byte position is visited at most once going
forward and at most once during backtracking. The prefix-chain walk at each
position and the ValidPair decomposition check are bounded by vocabulary
depth, independent of input length. Total work is $ O(n) $ amortized in span
length. The algorithm and its correctness argument derive from van Antwerpen
and Neubeck13.