wordchipper: Fast BPE Tokenization with Substitutable Internals

Authors:

Crutcher Dunnavant (crutcher@gmail.com)
Dilshod Tadjibaev (dilshod@gmail.com)

wordchipper is a Rust / Python BPE Tokenizer library.

We present wordchipper¹, a Rust-based byte-pair encoder tokenizer for the OpenAI GPT-2 tokenizer family. Through a combination of strict allocation discipline, factoring along the implementation lines of the pre-tokenization and BPE merge algorithm choices, thread-local resources, and extensive metrics; we were able to achieve throughput speedups relative to tiktoken-rs² in rust on a 64 core machine of ~4.3-5.7x (4 to 64 cores) for general regex BPE vocabularies, and ~6.9x-9.2x when using custom DFA lexers for specific OpenAI vocabularies. Under python wrappers, we see a range of ~2x-4x (4 to 64 cores) speedups over tiktoken³. The substitutable design yields a benchmark cross-product that reveals workload-dependent encoder selection and corpus-modulated performance inversion between algorithm families.

Rust Throughput

The central goal of wordchipper is to serve as its own experimental apparatus. Rather than committing to a single pre-tokenization strategy or BPE merge algorithm, the library defines stable interface boundaries at algorithmic decision points, permitting implementations to be substituted independently. This yields a benchmark space that is the cross-product of span encoder implementations, spanner backends, vocabulary configurations, and thread counts.

In wordchipper, a spanner partitions raw input text into a sequence of labeled spans, each carrying a byte slice and a classification: a Special token, a Word, or a Gap. The spanner’s sole responsibility is this segmentation; it has no knowledge of the token vocabulary or merge rules. A span encoder accepts a single word span and a vocabulary, and is responsible for resolving it to a sequence of one or more tokens via BPE merge. The span encoder owns whatever working memory its algorithm requires — scratch buffers, priority queues, linked list nodes — retaining that memory across calls to avoid repeated allocation on the fast path.

Python Throughput

The Python binding exposes wordchipper via PyO3⁴, a Rust crate that generates a native CPython extension module. The extension is built and packaged using maturin⁵, which handles ABI compatibility and PyPI publication. The binding is intentionally thin: the core Tokenizer and TokenizerOptions types are re-exported directly from the Rust library, with no reimplementation of encoding logic in Python. The GIL is released during encoding via py.detach(), permitting concurrent calls from Python threads without serialization through the interpreter lock.

For the fast pattern gpt2, the logos, regex-automata, and fancy-regex lexers all outperform tiktoken and tokenizers. There is no r50k support for bpe_openai yet.

For low thread counts, our rayon + regex-automata implementation (the default) outperforms tiktoken by ~2x. Above 8 threads, the performance of the python tiktoken implementation collapses. At 64 threads, the ratio grows to 4.1x; with a peak throughput of rayon + regex-automata at 104.3 MiB/s on 64 threads.

Across thread counts, our rayon + logos range between 1.2x to 1.1x the performance of our rayon + regex-automata implementation; reaching a peak throughput of 116.7 MiB/s on 64 threads.

The performance of the o200k pattern is similar to the gpt2 pattern, with the rayon + logos implementation outperforming the rayon + regex-automata implementation.

There are few meaningful differences outside the performance of tiktoken.

Lexer Implementations

A spanner’s segmentation is driven by a lexer — an engine that applies the vocabulary’s pre-tokenization pattern to raw input bytes. Because pre-tokenization is I/O-bound, scanning every byte of input, lexer throughput is a primary determinant of end-to-end performance. The spanner interface permits substitution of the underlying lexer implementation independently of the span encoder and vocabulary configuration, permitting controlled measurement of lexer throughput in isolation.

fancy-regex Lexer

The first lexer implementation in wordchipper was built upon the fancy-regex⁶ crate, as used by tiktoken². This regex library provides support for extended regex pattern negative lookaheads, which are used to implement the (?!\S) whitespace fixup pattern.

regex-automata Lexer

The regex-automata⁷ crate compiles regular expressions into a DFA at runtime, avoiding the backtracking interpreter used by fancy-regex. It does not support lookaheads, so the \s+(?!\S) branches in the OpenAI patterns must be transformed before compilation. For each known pattern, wordchipper maintains a parallel version with the lookahead branches collapsed to \s+ and applies a lightweight post-processing pass that truncates multi-character whitespace matches by one trailing character, reproducing the lookahead semantics.

This approach was motivated by bpe-openai⁸, which demonstrated that regex-automata provides substantial speedups over fancy-regex for these patterns. In wordchipper, the regex-automata backend serves as a middle tier: ~4-8x faster than fancy-regex, available for all patterns without per-pattern development, but slower than the compile-time logos DFAs. Search state is held in an external Cache object; bpe-openai shares a single cache behind a mutex, which introduces contention visible at 8+ threads. wordchipper instead distributes per-slot caches across its PoolToy thread pool, eliminating cache contention entirely.

logos DFA Lexers

The logos⁹ crate compiles regex fragments into a DFA at build time via a derive macro, eliminating runtime compilation and backtracking. For cl100k and o200k, this yields ~14-21x throughput over fancy-regex (250–300 MB/s single-threaded).

Two semantic gaps separate logos from the OpenAI patterns. First, logos has no lookahead support, so the \s+(?!\S) branches cannot be expressed directly. Second, logos uses longest-match semantics rather than regex first-match, which produces different span boundaries when Unicode category classes overlap across token branches (e.g. \p{M} appearing in both word and punctuation patterns for o200k). A shared post-processing state machine corrects both: it buffers whitespace tokens and, based on the classification of the following token, splits the last character into the next span (simulating the lookahead) or absorbs a preceding ASCII space into punctuation. Each logos token variant maps to a role enum that drives these rewrite rules, keeping the vocabulary-specific definitions (token enum, role mapping) separate from the shared correction logic. Three vocabulary families are implemented: r50k_base, cl100k_base, and o200k_base.

Correctness is validated by combinatorial equivalence testing. The Unicode predicates in the OpenAI patterns partition the codepoint space into 22 equivalence cells. The test harness generates all k-tuples $(k=1..6)$ of 29 representative codepoints (~594 million inputs per lexer) and asserts byte-identical span boundaries against the regex reference.

Span Encoder Implementations

When comparing span encoder algorithms, we normalize the throughput against the fastest span encoder at a given thread count.

Naively, we would expect CPU cache bandwidth effects to have a higher relative impact on span encoder throughput when the lexer consumes a lower portion of the total time. Faster regex patterns / lexer implementations should be more sensitive to cache effects.

For the fast pattern r50k, we see a strong preference for the simple buffer algorithms (BufferSweep, TailSweep, MergeHeap) under the fastest logos DFA lexer; more mixed results are seen when using the default regex-automata backend; and complex algorithms (PriorityMerge, BpeBacktrack) do relatively well under the slowest fancy-regex lexer.

For the slow pattern o200k, we see a strong preference for simple buffer algorithms (BufferSweep, TailSweep, MergeHeap) under the fastest logos DFA lexer; intermediate preferences under regex-automata, and a collapse of distinctions under the slowest fancy-regex lexer. From this we conclude that lexer bottleneck effects can entirely mask performance differences between span encoders.

Single-string encoding throughput varies significantly with corpus composition. We compare an English prose corpus (~70 KB after 10x repeat) against a diverse multilingual stress corpus (~90 KB) spanning CJK, Arabic, Hebrew, Thai, emoji, and mixed-script runs. The D/E ratio (diverse throughput divided by English throughput) quantifies each encoder’s sensitivity to multi-byte, vocabulary-sparse input.

We select BpeBacktrack as the default encoder in wordchipper, for its relative performance under pathological conditions. The cost of this selection is a ~10% drop in throughput in normal operation, to avoid a ~40%-60% drop under pernicious conditions.

BufferSweep Encoder

The BufferSweep encoder is the reference implementation against which all other span encoders are benchmarked. Its working memory is a single retained token buffer, cleared but not deallocated between calls.

The algorithm is $O(n^2)$ in span length: each merge requires a full scan to locate the minimum pair, and up to n merges may be performed. This cost is acceptable in practice because BPE spans are short by construction. The implementation’s value is in its simplicity — the behavior is unambiguous, making it the reference against which correctness of other encoders is verified.

TailSweep Encoder

The TailSweep encoder applies the same $O(n^2)$ sweep algorithm as BufferSweep, but eliminates the separate working buffer by operating directly on the tail of the output buffer. The encoder carries no per-instance state.

The tradeoff against BufferSweep is allocation pattern versus cache behavior. TailSweep avoids the separate working buffer and the final copy, but performs merges directly into the shared output buffer, which grows monotonically across spans. Whether this improves or degrades cache utilization relative to BufferSweep’s isolated working buffer is workload-dependent and is one of the questions the benchmark cross-product is designed to answer.

MergeHeap Encoder

The MergeHeap encoder retains the $O(n^2)$ sweep structure of TailSweep but eliminates repeated pair vocabulary lookups by maintaining a parallel p_ranks buffer alongside the token buffer. This buffer caches the merge rank for every adjacent token pair, kept invariant across merges by recomputing only the two entries neighboring each merge site.

PriorityMerge Encoder

The PriorityMerge encoder replaces the linear scan per merge round (as in Sennrich et al.¹⁰) with a standard binary min-heap that yields the lowest-ranked pair directly. The linked list gives $O(1)$ removal and neighbor access; the heap gives $O(log n)$ extraction. Stale entries — invalidated by prior merges — are detected in $O(1)$ by comparing stored token identities against current node state, a standard lazy-deletion technique. Since each merge reduces the node count by one and enqueues at most two new pairs, total heap operations are linear in span length, giving $O(n log n)$ per span. HuggingFace tokenizers¹¹ uses a structurally similar heap-and-linked-list approach.

BpeBacktrack Encoder

The BpeBacktrack encoder works top-down rather than bottom-up: instead of starting from bytes and merging upward, it greedily matches the longest token at each position via the Aho-Corasick automaton¹² and backtracks when boundary validation fails. The key invariant is that a bitfield tracks candidate positions, and positions are only ever cleared, never re-set. This monotonicity bounds total backtracking to at most n steps across the entire span — each byte position is visited at most once going forward and at most once during backtracking. The prefix-chain walk at each position and the ValidPair decomposition check are bounded by vocabulary depth, independent of input length. Total work is $ O(n) $ amortized in span length. The algorithm and its correctness argument derive from van Antwerpen and Neubeck¹³.