Introduction
What is a tokenizer?
Large language models don't read text the way humans do. Before a model can process the sentence "The cat sat on the mat," it must convert that string of characters into a sequence of numbers called tokens. Each token is an integer that maps to an entry in a fixed vocabulary. The process of converting text to tokens is called tokenization, and the software that does it is a tokenizer.
Tokenization sits at the boundary between human-readable text and machine-readable numbers. Every time you send a prompt to an LLM, a tokenizer runs first. Every time the model generates a response, a tokenizer runs in reverse to turn the output tokens back into text.
A simple example
Consider the string "hello world". A tokenizer with a vocabulary like OpenAI's cl100k_base would
encode it as two tokens:
"hello world" -> [15339, 1917]
The token 15339 maps to the bytes for "hello", and 1917 maps to " world" (note the leading
space). Decoding reverses this: [15339, 1917] becomes the original string.
Not all strings split at word boundaries. The tokenizer might split "tokenization" into
["token", "ization"], or keep "the" as a single token. The exact splits depend on the vocabulary
and algorithm.
Why does tokenization matter?
Tokenization affects LLM behavior in ways that are easy to overlook:
- Cost. API pricing is per-token. A tokenizer that produces fewer tokens for the same text saves money. A tokenizer that splits your language inefficiently costs more.
- Context window. Models have a fixed token limit. Efficient tokenization means more text fits in the window.
- Latency. Each token requires a forward pass through the model. Fewer tokens means faster inference.
- Multilingual fairness. Some tokenizers handle English efficiently but split Chinese, Arabic, or Hindi into many more tokens for the same semantic content.
- Reproducibility. If you're building retrieval, caching, or evaluation pipelines, you need a tokenizer that produces exactly the same output as the model's own tokenizer. Approximate doesn't cut it.
What is BPE?
Byte Pair Encoding (BPE) is the most widely used tokenization algorithm for large language
models. It works by starting with individual bytes and iteratively merging the most frequent
adjacent pairs into new tokens. The result is a vocabulary of subword units that balances two
goals: common words like "the" become a single token, while rare words like "defenestration"
decompose into reusable pieces (["def", "en", "est", "ration"]).
BPE is used by GPT-2 through GPT-4o and most of their derivatives. The How Tokenizers Work chapter walks through the algorithm step by step.
What is wordchipper?
wordchipper is a high-performance BPE tokenizer library written in Rust. It encodes and decodes text using the same vocabularies as OpenAI's models (GPT-2, GPT-3.5, GPT-4, GPT-4o), producing identical output to the reference implementations.
Key properties:
- Fast. Up to 2x the throughput of
tiktoken-rson encode/decode benchmarks, with 30-50x faster pre-tokenization via compile-time DFA lexers. - Correct. Validated against OpenAI's
tiktokenand HuggingFacetokenizerson large corpora. Token-for-token identical output. - Portable. Core tokenization works in
no_stdenvironments. Runs on WASM, embedded targets, and bare-metal Rust. - Multi-language. Native Rust API, plus Python and JavaScript/TypeScript bindings.
- Extensible. Bring your own vocabulary, write custom lexers, swap BPE algorithms.
Supported models
| Model | Used by | Vocab size |
|---|---|---|
r50k_base | GPT-2 | ~50k |
p50k_base | Codex | ~50k |
p50k_edit | Codex (edit) | ~50k |
cl100k_base | GPT-3.5, GPT-4 | ~100k |
o200k_base | GPT-4o | ~200k |
o200k_harmony | GPT-4o (harmony) | ~200k |
Who is this book for?
This book is for anyone who wants to understand how tokenizers work and how to use wordchipper effectively. It's structured in layers:
- If you're new to tokenizers, start with How Tokenizers Work for a visual, step-by-step explanation of BPE.
- If you just want to encode text, skip to Getting Started for a working example in five lines.
- If you want to train your own tokenizer, read Training Your Own Tokenizer for the CLI quick-start and Rust API deep dive.
- If you're optimizing throughput, read Performance to understand the available BPE algorithms and DFA acceleration.
- If you're building something custom, see Building Custom Logos Lexers and Advanced: Span Encoders.
- If you want to try it now, the Interactive Tokenizer Demo runs in your browser via WASM.
Getting Started
Installation
Add wordchipper to your Cargo.toml:
[dependencies]
wordchipper = "0.7"
The default features include std, fast-hash (foldhash hasher), and parallel (rayon). Add
client for vocabulary download. See Feature Flags for all options and
no_std & Embedded for minimal configurations.
Encode and decode in five lines
use wordchipper::{ Tokenizer, TokenizerOptions, TokenEncoder, TokenDecoder, load_vocab, disk_cache::WordchipperDiskCache, }; fn main() -> wordchipper::WCResult<()> { // Load vocabulary (downloads and caches on first run) let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("openai:cl100k_base", &mut cache)?; // Build a tokenizer let tok = TokenizerOptions::default().build(vocab); // Encode let tokens = tok.try_encode("hello world")?; assert_eq!(tokens, vec![15339, 1917]); // Decode let text = tok.try_decode_to_string(&tokens)?; assert_eq!(text, "hello world"); Ok(()) }
The first call to load_vocab downloads the vocabulary file from OpenAI's CDN and caches it in
~/.cache/wordchipper/. Subsequent calls load from disk.
Available models
use wordchipper::list_models; fn main() { for model in list_models() { println!("{}", model); } }
This prints all registered models with their namespace prefix:
openai::r50k_base
openai::p50k_base
openai::p50k_edit
openai::cl100k_base
openai::o200k_base
openai::o200k_harmony
You can also use short names without the openai:: prefix when loading:
#![allow(unused)] fn main() { use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache}; let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("o200k_base", &mut cache).unwrap(); }
Batch encoding
For multiple strings, batch encoding runs spans in parallel across threads (when the parallel
feature is enabled):
use wordchipper::{ TokenEncoder, TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache, }; fn main() -> wordchipper::WCResult<()> { let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("openai:cl100k_base", &mut cache)?; let tok = TokenizerOptions::default() .with_parallel(true) .build(vocab); let texts = vec!["hello world", "the cat sat on the mat", "foo bar baz"]; let batch = tok.try_encode_batch(&texts)?; for (text, tokens) in texts.iter().zip(batch.iter()) { println!("{:?} -> {:?}", text, tokens); } Ok(()) }
Choosing a token type
wordchipper is generic over the token integer type. Most users should use u32, which is the
default. If your vocabulary fits in 16 bits (< 65,536 entries), you can use u16 to save memory:
use std::sync::Arc; use wordchipper::{UnifiedTokenVocab, TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache}; fn main() -> wordchipper::WCResult<()> { let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("openai:r50k_base", &mut cache)?; // Convert to u16 (works for r50k_base with ~50k tokens) let vocab_u16: Arc<UnifiedTokenVocab<u16>> = Arc::new(vocab.as_ref().to_token_type()); let tok = TokenizerOptions::default().build(vocab_u16); Ok(()) }
For cl100k_base (~100k tokens) and o200k_base (~200k tokens), use u32.
What happens under the hood
When you call try_encode("hello world"), wordchipper runs a two-phase pipeline:
-
Spanning. The text is split into coarse segments called spans. For OpenAI models, this uses a regex pattern (or a faster DFA lexer) that handles whitespace, punctuation, letters, and digits.
"hello world"becomes["hello", " world"]. -
BPE encoding. Each span is encoded independently using Byte Pair Encoding. The encoder iteratively merges adjacent byte pairs according to the vocabulary's merge table, always picking the lowest-ranked (highest priority) pair first. The result is a sequence of token IDs.
Decoding reverses the process: each token ID maps to a byte sequence, and the byte sequences are concatenated to produce the original text.
The next chapter explains this pipeline in detail, with visual examples of how BPE works.
How Tokenizers Work
This chapter explains the algorithms behind BPE tokenization. If you already know how BPE works, skip to Pretrained Models. To see tokenization in action, try the Interactive Tokenizer Demo.
The problem: turning text into numbers
Neural networks operate on fixed-size numerical inputs. A language model needs a way to convert arbitrary text into a sequence of integers from a finite vocabulary. The simplest approaches have obvious downsides:
- Character-level: Vocabulary is small (~256 for ASCII, ~150k for full Unicode), but sequences
are long.
"tokenization"becomes 12 tokens. Long sequences are expensive for attention mechanisms. - Word-level:
"tokenization"is one token, but the vocabulary must include every possible word. Rare words, typos, and new terms become<UNK>(unknown). Multilingual text explodes the vocabulary. - Subword: Split text into pieces that balance vocabulary size against sequence length.
"tokenization"might become["token", "ization"]. Common words stay whole. Rare words decompose into reusable parts.
BPE is a subword tokenization algorithm. It's used by GPT-2 through GPT-4o, and it's what wordchipper implements.
Byte Pair Encoding (BPE)
BPE was originally a data compression algorithm (Gage, 1994). The NLP adaptation (Sennrich et al., 2016) works in two phases: training and encoding.
Training: building the vocabulary
Training starts with a base vocabulary of individual bytes (0-255) and iteratively discovers the most useful multi-byte tokens. In practice, the training corpus is first split into spans by a regex pattern (see Pre-tokenization below), and pair counting happens within each span independently. The regex determines what token boundaries are possible. The steps below describe the core merge algorithm that runs on those spans.
Step 1. Start with every unique byte as a token. The vocabulary has 256 entries.
Step 2. Count every adjacent pair of tokens across all spans. Find the most frequent pair.
Step 3. Create a new token that represents that pair merged together. Add it to the vocabulary.
Record the merge rule: (token_a, token_b) -> new_token.
Step 4. Replace all occurrences of that pair in the corpus with the new token.
Step 5. Repeat from Step 2 until the vocabulary reaches the desired size.
Example: training on "aabaa aab"
Start with bytes: a=0, b=1, space=2
Corpus: [a, a, b, a, a, ' ', a, a, b]
Round 1: Most frequent pair is (a, a), count=3
Merge: (a, a) -> aa (token 3)
Corpus becomes: [aa, b, aa, ' ', aa, b]
Round 2: Most frequent pair is (aa, b), count=2
Merge: (aa, b) -> aab (token 4)
Corpus becomes: [aab, aa, ' ', aab]
Round 3: Most frequent pair is (aab, aa), count=1
(tie with others; pick first)
...
After training, the vocabulary contains bytes plus all discovered merges, and the merge table records the order. The rank of a merge (its position in the table) determines priority during encoding.
Encoding: applying the merges
Given a trained vocabulary, encoding a string works as follows:
Step 1. Convert the input to a sequence of byte tokens.
Step 2. Find all adjacent pairs. Look up each pair in the merge table.
Step 3. Merge the pair with the lowest rank (highest priority). This is the pair that was learned earliest during training, meaning it's the most frequent.
Step 4. Repeat from Step 2 until no more merges are possible.
Example: encoding "aab"
Using the vocabulary from above:
Start: [a, a, b] (3 byte tokens)
Pairs: (a,a)=rank 0 (a,b)=not in table
Merge: [aa, b] (merge (a,a) at rank 0)
Pairs: (aa,b)=rank 1
Merge: [aab] (merge (aa,b) at rank 1)
No pairs: done.
Result: [4] (token 4 = "aab")
The key insight: BPE encoding is deterministic. Given the same vocabulary and merge table, the same input always produces the same tokens. This is why wordchipper can be validated against tiktoken and produce identical output.
Pre-tokenization: splitting before BPE
Running BPE directly on raw text would be wasteful. The string "hello world" would start as
individual bytes and need many merge steps. Worse, the space between words could merge with the
surrounding characters, creating tokens that span word boundaries (like "o w"), which hurts the
model's ability to generalize.
The solution is pre-tokenization (also called spanning in wordchipper). Before BPE runs, the text is split into coarse segments using a regex pattern. Each segment is then encoded independently.
OpenAI's cl100k_base uses this regex pattern:
(?i:'s|'t|'re|'ve|'m|'ll|'d)
|[^\r\n\p{L}\p{N}]?\p{L}+
|\p{N}{1,3}
| ?[^\s\p{L}\p{N}]+
|\s+(?!\S)
|\s+
This splits text into:
- Contractions:
's,'t,'re, etc. - Words (letters), optionally preceded by a non-letter character
- Numbers, up to 3 digits at a time
- Punctuation sequences, optionally preceded by a space
- Whitespace (with a special rule for trailing space)
For example, "Hello, world! 123" becomes:
["Hello", ",", " world", "!", " 123"]
Each of these spans is then BPE-encoded independently. This keeps word boundaries clean and makes encoding much faster (each span is short).
The whitespace rule
The pattern \s+(?!\S) is worth understanding. It matches whitespace greedily, then uses a negative
lookahead to back off one character. So " world" (3 spaces + "world") becomes:
[" ", " world"]
The last space "leaks" into the next word. This is an intentional design choice by OpenAI: it means the model sees leading-space tokens as word starts, improving prediction.
This lookahead is straightforward in a regex engine but impossible in a DFA (deterministic finite automaton), which is why wordchipper's logos-based lexers need a post-processing step. See Building Custom Logos Lexers for details.
The wordchipper pipeline
Putting it all together, here's what happens when you call try_encode("Hello, world!"):
Input: "Hello, world!"
|
v
┌─────────────────────┐
│ 1. Pre-tokenize │ (regex or DFA lexer)
│ Split into spans │
└──────────┬──────────┘
|
["Hello", ",", " world", "!"]
|
v
┌─────────────────────┐
│ 2. BPE encode │ (per-span, possibly parallel)
│ Merge byte pairs │
└──────────┬──────────┘
|
[9906, 11, 1917, 0]
|
v
Token IDs
In wordchipper's architecture:
- Step 1 is handled by a
TextSpanner(thespannersmodule). The default uses regex. For known patterns (cl100k, o200k, r50k), a compile-time DFA lexer is used automatically for 30-50x faster spanning. - Step 2 is handled by a
SpanEncoder(theencodersmodule). Multiple BPE algorithms are available, optimized for different use cases.
The Tokenizer type combines both steps behind a single try_encode / try_decode interface.
Decoding: tokens back to text
Decoding is simpler than encoding. Each token ID maps to a byte sequence in the vocabulary. The decoder concatenates all byte sequences and interprets the result as UTF-8:
[9906, 11, 1917, 0]
| | | |
v v v v
"Hello" "," " world" "!"
|
v
"Hello, world!"
No merging or splitting needed. Decoding is always O(n) in the number of tokens.
Other tokenization approaches
BPE is not the only subword algorithm. Two others are worth knowing:
WordPiece (BERT)
WordPiece is similar to BPE but uses a different training objective. Instead of merging the most
frequent pair, it merges the pair that maximizes the likelihood of the training data. Subword pieces
after the first are prefixed with ## (e.g., "token" + "##ization").
Used by: BERT, DistilBERT, ELECTRA.
SentencePiece / Unigram (Llama, Gemini)
SentencePiece treats the input as a raw byte stream and replaces spaces with a special \u2581
character before tokenization. The Unigram variant starts with a large vocabulary and prunes it down
based on a unigram language model.
Used by: Llama, Gemini, T5, ALBERT.
wordchipper focuses on BPE tokenizers with OpenAI-style regex pre-tokenization. Tokenizers that use SentencePiece or WordPiece have fundamentally different pre-tokenization and don't need the spanning machinery described in this book.
Further reading
- Sennrich et al. (2016), "Neural Machine Translation of Rare Words with Subword Units" (the paper that adapted BPE for NLP)
- OpenAI's
tiktokenrepository for the reference regex patterns - The Performance chapter for how wordchipper optimizes BPE encoding
Pretrained Models
wordchipper ships with loaders for OpenAI's public BPE vocabularies. Each model defines three things: a regex pattern for pre-tokenization, a merge table (the vocabulary), and a set of special tokens.
Model overview
| Model | Vocab tokens | Pattern | Special tokens | Used by |
|---|---|---|---|---|
r50k_base | ~50k | GPT-2 pattern | <|endoftext|> | GPT-2 |
p50k_base | ~50k | GPT-2 pattern | <|endoftext|> | Codex |
p50k_edit | ~50k | GPT-2 pattern | endoftext, fim_prefix/middle/suffix | Codex (edit) |
cl100k_base | ~100k | cl100k pattern | endoftext, fim_prefix/middle/suffix, endofprompt | GPT-3.5, GPT-4 |
o200k_base | ~200k | o200k pattern | endoftext, endofprompt | GPT-4o |
o200k_harmony | ~200k | o200k pattern | endoftext, endofprompt, startoftext, + many reserved | GPT-4o (harmony) |
What changed between models
r50k / p50k (GPT-2 era). The simplest pattern. Contractions ('s, 't, etc.) are matched
literally. Words and numbers are preceded by an optional space. Case-sensitive.
cl100k (GPT-3.5 / GPT-4). Case-insensitive contractions. Words can be preceded by a non-letter, non-newline character (allowing punctuation to attach). Numbers limited to 3 digits at a time. Newlines handled explicitly.
o200k (GPT-4o). Much more sophisticated word patterns. Recognizes casing transitions (CamelCase
splits). Uses Unicode general categories (\p{Lu}, \p{Ll}, \p{Lt}, \p{Lm}, \p{Lo}, \p{M})
for precise letter classification. Contractions are appended to words rather than matched
separately.
Shared vocabularies
Some models share the same vocabulary file but differ in special tokens:
p50k_edituses the same vocab asp50k_basebut adds FIM (fill-in-middle) tokens.o200k_harmonyuses the same vocab aso200k_basebut adds many reserved and named special tokens for structured generation.
Loading a vocabulary
The standard way to load a model:
#![allow(unused)] fn main() { use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache}; let mut cache = WordchipperDiskCache::default(); let (desc, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap(); }
load_vocab returns a (VocabDescription, Arc<UnifiedTokenVocab<u32>>). The description contains
metadata; the vocab contains everything needed for encoding and decoding.
Short names
The openai:: prefix is optional. Both "cl100k_base" and "openai:cl100k_base" work. Use
list_vocabs() for all registered short names and list_models() for fully qualified names.
Loading from a file path
If you have a .tiktoken file on disk, you can skip the download:
#![allow(unused)] fn main() { use wordchipper::pretrained::openai::OATokenizer; let vocab = OATokenizer::Cl100kBase .load_path::<u32>("/path/to/cl100k_base.tiktoken") .unwrap(); }
Loading from a reader
For maximum flexibility (e.g., loading from an in-memory buffer or a network stream):
#![allow(unused)] fn main() { use std::io::BufReader; use wordchipper::pretrained::openai::OATokenizer; let data: &[u8] = b"..."; // tiktoken base64 format let reader = BufReader::new(data); let vocab = OATokenizer::O200kBase.read_vocab::<u32, _>(reader).unwrap(); }
Special tokens
Special tokens are strings with reserved token IDs that are never produced by BPE encoding. They're used for control flow: marking end-of-text, fill-in-middle boundaries, prompt boundaries, etc.
#![allow(unused)] fn main() { use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache, TokenizerOptions, TokenEncoder}; let mut cache = WordchipperDiskCache::default(); let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap(); let tok = TokenizerOptions::default().build(vocab); let specials = tok.special_vocab(); for (bytes, &id) in specials.span_map().iter() { let name = String::from_utf8_lossy(bytes); println!("{} -> {}", name, id); } }
For cl100k_base, this prints:
<|endoftext|> -> 100257
<|fim_prefix|> -> 100258
<|fim_middle|> -> 100259
<|fim_suffix|> -> 100260
<|endofprompt|> -> 100276
The OATokenizer enum
For programmatic access to all OpenAI models, use the OATokenizer enum:
#![allow(unused)] fn main() { use wordchipper::pretrained::openai::OATokenizer; // Iterate over all models #[cfg(feature = "std")] { use strum::IntoEnumIterator; for model in OATokenizer::iter() { println!("{}", model); } } }
Each variant provides:
pattern()- the regex pattern for pre-tokenizationspecial_tokens::<T>()- the special token listload_vocab::<T>(loader)- load the vocabulary with download supportload_path::<T>(path)- load from a local file
The tiktoken format
OpenAI's vocabulary files use a simple base64 format. Each line contains a base64-encoded byte sequence and its integer token ID, separated by a space:
IQ== 0
Ig== 1
Iw== 2
...
The base64 decodes to the raw bytes that the token represents. This format is what
WordchipperDiskCache downloads and caches from OpenAI's CDN.
Choosing a model
If you're building a tool that interacts with an OpenAI model, use the matching tokenizer:
| If you use... | Load... |
|---|---|
| GPT-4o, GPT-4o-mini | o200k_base |
| GPT-4, GPT-3.5-turbo | cl100k_base |
| GPT-3 (text-davinci-003) | p50k_base |
| GPT-2 | r50k_base |
If you're building your own model or just need token counting, o200k_base has the largest
vocabulary and handles the widest range of languages efficiently.
Training Your Own Tokenizer
This chapter walks through training a BPE tokenizer from scratch. If you haven't read How Tokenizers Work yet, start there. This chapter assumes you know what BPE merges are and why pre-tokenization matters.
Why train a custom tokenizer?
Pretrained tokenizers like cl100k_base and o200k_base are trained on broad internet text. They
work well for general English, but they have trade-offs:
- Domain-specific text. If your corpus is primarily code, legal documents, or scientific notation, a general-purpose tokenizer wastes vocabulary slots on tokens your data rarely uses. A tokenizer trained on your data will encode it in fewer tokens.
- Non-English languages. General tokenizers are biased toward English. The same sentence in Hindi or Korean may produce 3-5x more tokens. Training on your target language fixes this.
- Smaller vocabularies. Pretrained models use 50k-200k tokens. If you're building a small model or running on constrained hardware, a 4k or 8k vocabulary trained on your data can be more efficient than a 100k vocabulary trained on someone else's.
- Research. You want to study how vocabulary size, training data, or regex pattern affect downstream model performance.
The short version: train your own when the pretrained vocabulary doesn't fit your data well.
Quick start: the CLI
The fastest way to train a tokenizer is with wchipper. No Rust code needed.
Training on text files
Create a text file with your training data (one document per line, or free-form text):
cargo run --release -p wchipper -- \
train \
--vocab-size=8000 \
--output=my_tokenizer.tiktoken \
corpus.txt
This reads corpus.txt, splits it using the default regex pattern (r50k_base's pattern), runs BPE
training to learn 8000 - 256 = 7744 merges, and writes the vocabulary to my_tokenizer.tiktoken in
base64-encoded tiktoken format.
Training on Parquet files
For larger datasets, Parquet is more efficient. The file must have a column named "text":
cargo run --release -p wchipper -- \
train \
--input-format=parquet \
--vocab-size=50281 \
--output=nanochat.tiktoken \
~/Data/nanochat/dataset/*.parquet
Parquet files are read in streaming record batches, so memory usage stays bounded even for multi-gigabyte datasets.
What the output looks like
Training logs progress at 1% intervals:
INFO Reading shards:
INFO 0: shard_00000.parquet
INFO 1: shard_00001.parquet
INFO Training Tokenizer...
INFO Starting BPE training: 50025 merges to compute
INFO Building pair index...
INFO Building heap with 16044 unique pairs
INFO Starting merge loop
INFO Progress: 1% (501/50025 merges) - Last merge: (69, 120) -> 756 (frequency: 166814)
INFO Progress: 2% (1001/50025 merges) - Last merge: (66, 77) -> 1256 (frequency: 17847)
...
INFO Progress: 100% (50025/50025 merges) - Last merge: (302, 3405) -> 50280 (frequency: 18)
INFO Finished training: 50025 merges completed
INFO Vocabulary Size: 50280
The early merges have high frequencies (166k+). These are the most common byte pairs in your corpus,
things like (space, t) or (e, space). By the end, merges are rare (frequency 18), scraping the
bottom of what's statistically useful.
The training pipeline
Training has three phases. Understanding them helps you make better choices about regex pattern, vocabulary size, and data preparation.
Phase 1: counting spans
Before any merges happen, the trainer splits every input document using the regex pattern, then counts how many times each unique span appears. This is the same pre-tokenization step that happens during encoding, but here it builds a frequency table.
Input: "the cat sat on the mat"
Regex splits: ["the", " cat", " sat", " on", " the", " mat"]
Span counts:
"the" -> 1
" the" -> 1
" cat" -> 1
" sat" -> 1
" on" -> 1
" mat" -> 1
With more data, common spans like " the" accumulate large counts. The trainer calls
update_from_samples() incrementally, so you can feed it data in batches without loading everything
into memory.
Internally, this is handled by TextSpanCounter, which stores span strings as CompactString keys
(stack-allocated for short strings, heap-allocated when longer) with u32 counts.
Phase 2: building the pair index
Once all data is counted, the trainer converts each unique span into a TokenSpanBuf: a buffer of
byte-level token IDs. Every span starts as its raw UTF-8 bytes, mapped through a ByteMapVocab (the
initial 256-entry vocabulary where token 0 = byte 0, token 1 = byte 1, etc.).
Then it scans every span to build two structures:
- Pair counts: for every adjacent pair
(a, b), the total count across all spans (weighted by span frequency). If" the"appears 50,000 times and contains the pair(116, 104)(t,h), that pair gets +50,000. - Pair index: for every pair, which span indices contain it. This avoids scanning all spans when a merge only affects a few.
These go into a PairSpanIndex, then seed an octonary heap (an 8-ary heap, faster than a binary
heap for this workload) sorted by count descending.
Phase 3: the merge loop
This is the core of BPE training. It runs until the vocabulary reaches the target size:
while merges_done < num_merges:
1. Pop the highest-count pair from the heap.
2. If the count is stale (changed since it was pushed), refresh and re-push.
3. Allocate a new token ID for this merge.
4. For every span containing this pair:
a. Replace all occurrences of (a, b) with the new token.
b. Update pair counts: decrement removed pairs, increment new pairs.
c. Track which spans now contain newly created pairs.
5. Push new pairs onto the heap.
The "lazy refresh" in step 2 is important. When pair (a, b) is merged, it changes the neighbors of
a and b in every span where they appeared. This can reduce the count of other pairs in the heap
that haven't been popped yet. Rather than updating every affected heap entry (expensive), the
trainer just checks the current count when a pair is popped. If it's stale, it's re-pushed with the
correct count and the loop continues. This is a standard trick in BPE implementations.
A concrete merge example
Given the span "hello" (tokens: [104, 101, 108, 108, 111]) and the merge (108, 108) -> 256
(merging l, l):
Before: [104, 101, 108, 108, 111]
h e l l o
Pairs removed: (101, 108), (108, 108), (108, 111)
Pairs added: (101, 256), (256, 111)
After: [104, 101, 256, 111]
h e ll o
The TokenSpanBuf::merge_pair_cb method handles this in a single pass through the span, reporting
pair deltas through a callback so the trainer can update global counts incrementally.
Training via the Rust API
For more control, use wordchipper-training directly. Add it to your Cargo.toml:
[dependencies]
wordchipper = "0.8"
wordchipper-training = "0.8"
Minimal example
use std::sync::Arc; use wordchipper::{ TokenEncoder, TokenDecoder, Tokenizer, TokenizerOptions, UnifiedTokenVocab, pretrained::openai::OA_CL100K_BASE_PATTERN, vocab::ByteMapVocab, }; use wordchipper_training::BPETRainerOptions; fn main() { // 1. Configure the trainer let options = BPETRainerOptions::new( OA_CL100K_BASE_PATTERN, // regex pattern for pre-tokenization 1000, // target vocabulary size ); let mut trainer = options.init(); // 2. Feed it training data let samples = vec![ "the cat sat on the mat", "the dog sat on the log", "the cat and the dog", ]; trainer.update_from_samples(samples.iter()); // 3. Train let byte_vocab: ByteMapVocab<u32> = Default::default(); let vocab: Arc<UnifiedTokenVocab<u32>> = trainer .train(byte_vocab) .expect("training failed") .into(); // 4. Use the tokenizer let tokenizer = TokenizerOptions::default().build(vocab); let tokens = tokenizer.try_encode("the cat").unwrap(); let decoded = tokenizer.try_decode_to_string(&tokens).unwrap().unwrap(); assert_eq!(decoded, "the cat"); println!("'the cat' -> {:?}", tokens); }
Feeding data in batches
For large datasets, call update_from_samples multiple times. The trainer accumulates span counts
across calls:
#![allow(unused)] fn main() { use wordchipper::pretrained::openai::OA_CL100K_BASE_PATTERN; use wordchipper_training::BPETRainerOptions; let mut trainer = BPETRainerOptions::new(OA_CL100K_BASE_PATTERN, 1000).init(); // First batch let batch1 = vec!["hello world", "hello there"]; trainer.update_from_samples(batch1.iter()); // Second batch let batch2 = vec!["world peace", "hello again"]; trainer.update_from_samples(batch2.iter()); // Counts from both batches are combined before training }
This is how the CLI processes Parquet files: each record batch is fed as a separate call to
update_from_samples.
Saving the vocabulary
The trained vocabulary can be saved in tiktoken's base64 format:
use std::sync::Arc; use wordchipper::{ UnifiedTokenVocab, VocabIndex, pretrained::openai::OA_CL100K_BASE_PATTERN, vocab::{ByteMapVocab, io::save_base64_span_map_path}, }; use wordchipper_training::BPETRainerOptions; fn main() { let mut trainer = BPETRainerOptions::new(OA_CL100K_BASE_PATTERN, 1000).init(); trainer.update_from_samples(vec!["hello world"].iter()); let vocab: Arc<UnifiedTokenVocab<u32>> = trainer .train(ByteMapVocab::default()) .expect("training failed") .into(); // Save as .tiktoken file save_base64_span_map_path( &vocab.span_vocab().span_map(), "my_vocab.tiktoken", ).expect("failed to save"); }
This produces a file compatible with OpenAI's tiktoken format: one line per token, each line is
base64(byte_sequence) token_id.
Choosing a regex pattern
The regex pattern controls how text is split before BPE runs. This is one of the most consequential decisions in tokenizer design. The pattern determines what kinds of tokens can exist.
Available patterns
wordchipper provides the same patterns used by OpenAI's models as constants:
| Constant | Used by | Key behavior |
|---|---|---|
OA_R50K_BASE_PATTERN | GPT-2 | Basic word/number/punctuation splitting |
OA_CL100K_BASE_PATTERN | GPT-3.5, GPT-4 | Adds case-insensitive contractions |
OA_O200K_BASE_PATTERN | GPT-4o | Unicode-aware, broader letter categories |
All patterns live in wordchipper::pretrained::openai.
What the pattern affects
Consider the difference between splitting on \w+ versus the cl100k pattern:
Input: "don't stop"
\w+ splits: ["don", "t", "stop"]
cl100k pattern: ["don", "'t", " stop"]
The cl100k pattern keeps the contraction 't together and attaches the leading space to stop.
These design choices propagate through the entire vocabulary: a tokenizer trained with \w+ will
never learn a token for 't or stop because the regex never produces those spans.
Custom patterns
You can use any valid regex:
#![allow(unused)] fn main() { use wordchipper_training::BPETRainerOptions; // Split only on whitespace boundaries let options = BPETRainerOptions::new(r"\S+|\s+", 4000); // Split on individual characters (character-level BPE) let options = BPETRainerOptions::new(r".", 1000); // Domain-specific: keep email addresses and URLs intact let options = BPETRainerOptions::new( r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|https?://\S+|\w+|\s+|.", 8000, ); }
As a starting point, use OA_CL100K_BASE_PATTERN. It's well-tested and handles contractions,
whitespace, and Unicode reasonably. Change it only when you have a specific reason.
Choosing a vocabulary size
The vocabulary size is the total number of tokens, including the 256 byte tokens. So
--vocab-size=1000 means 256 byte tokens + 744 learned merges.
Trade-offs
| Vocab size | Merges | Effect |
|---|---|---|
| 256 | 0 | Pure byte-level. Every character is 1-4 tokens. Long sequences. |
| 1,000 | 744 | Common bigrams and short words are single tokens. |
| 8,000 | 7,744 | Most common words are single tokens. Good for small models. |
| 50,000 | 49,744 | Covers most English words. Standard for GPT-2 era models. |
| 100,000 | 99,744 | Covers English + common multilingual tokens. GPT-4 range. |
| 200,000 | 199,744 | Broad multilingual coverage. GPT-4o range. |
Larger vocabularies mean shorter token sequences (faster inference, more text per context window) but larger embedding tables (more parameters, more memory). For a small model on a specific domain, 4k-16k is often a good range. For a general-purpose model, 50k-200k.
How merges tail off
Early merges capture high-frequency patterns like common letter pairs and short words. Late merges capture rare combinations. You can see this in the training output: the first merge might have frequency 166,000 while the last merge has frequency 18.
If your late merges have very low frequencies (single digits), you've likely exhausted the useful patterns in your data. A smaller vocabulary size would waste less of the embedding table on tokens that rarely appear.
Performance expectations
Training speed depends on two factors:
- Corpus size. Counting spans is O(n) in the input size. Expect roughly 1 second per 10 MB of input text for the counting phase.
- Vocabulary size. The merge loop runs one iteration per merge (vocab_size - 256 iterations). Each iteration touches only the spans affected by that merge, but the total cost grows with both vocabulary size and corpus diversity.
For reference, training a 50k vocabulary on the nanochat dataset (8 Parquet shards, roughly 80 MB of text) takes around 80 CPU-minutes on a single thread. The counting phase is fast; the merge loop dominates.
The trainer is single-threaded. If your data loading is slow (e.g., reading from disk or network), run it on a separate thread so the trainer isn't blocked on I/O. The CLI handles this for Parquet by streaming record batches.
Putting it all together
Here's a complete workflow: train a tokenizer on your data, save it, and use it for encoding.
use std::sync::Arc; use wordchipper::{ TokenEncoder, TokenDecoder, Tokenizer, TokenizerOptions, UnifiedTokenVocab, VocabIndex, pretrained::openai::OA_CL100K_BASE_PATTERN, vocab::{ByteMapVocab, io::save_base64_span_map_path}, }; use wordchipper_training::BPETRainerOptions; fn main() -> wordchipper::WCResult<()> { // --- Train --- let mut trainer = BPETRainerOptions::new(OA_CL100K_BASE_PATTERN, 2000).init(); // In practice, feed much more data let corpus = vec![ "The quick brown fox jumps over the lazy dog.", "Pack my box with five dozen liquor jugs.", "How vexingly quick daft zebras jump!", ]; trainer.update_from_samples(corpus.iter()); let vocab: Arc<UnifiedTokenVocab<u32>> = trainer .train(ByteMapVocab::default()) .expect("training failed") .into(); // --- Save --- save_base64_span_map_path( &vocab.span_vocab().span_map(), "/tmp/my_vocab.tiktoken", )?; // --- Use --- let tokenizer = TokenizerOptions::default().build(vocab); let tokens = tokenizer.try_encode("The quick brown fox")?; let decoded = tokenizer.try_decode_to_string(&tokens)?.unwrap(); assert_eq!(decoded, "The quick brown fox"); println!("Encoded {} tokens: {:?}", tokens.len(), tokens); Ok(()) }
What's next
Once you have a trained tokenizer, you might want to:
- Benchmark it against pretrained tokenizers. See Performance for how wordchipper measures throughput.
- Build a custom DFA lexer for your regex pattern. See Building Custom Logos Lexers for 30-50x faster pre-tokenization.
- Understand the BPE encoding algorithms available. See Advanced: Span Encoders for the different encoder strategies.
Performance
wordchipper is designed for throughput. This chapter covers the knobs you can turn: BPE algorithm selection, DFA-accelerated pre-tokenization, parallelism, and hash function choice.
The two bottlenecks
Tokenization has two phases, and each can be the bottleneck:
- Pre-tokenization (spanning). Splitting text into spans using a regex. This is I/O-bound: the regex engine scans every byte of the input.
- BPE encoding. Merging byte pairs within each span. This is compute-bound: each span requires iterative pair lookups and merges.
For short texts, BPE dominates. For long texts with many spans, pre-tokenization matters more. wordchipper optimizes both.
DFA-accelerated spanning (logos)
The biggest single optimization is replacing the regex engine with a compile-time DFA. wordchipper uses the logos crate to compile pre-tokenization patterns into deterministic finite automata at build time.
This is enabled by default. When you load a vocabulary whose regex pattern matches a known logos lexer (cl100k, o200k, r50k), the DFA lexer is used automatically.
Benchmarks
Spanning throughput on a single thread:
| Model | Regex | Logos DFA | Speedup |
|---|---|---|---|
| cl100k_base | ~25 MB/s | ~732 MB/s | 29x |
| o200k_base | ~15 MB/s | ~765 MB/s | 52x |
Disabling DFA acceleration
If you want to force regex-based spanning (e.g., for testing or debugging):
#![allow(unused)] fn main() { use wordchipper::{TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache}; let mut cache = WordchipperDiskCache::default(); let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap(); let tok = TokenizerOptions::default() .with_accelerated_lexers(false) .build(vocab); }
BPE algorithm selection
wordchipper includes five span encoder implementations. Each trades off differently between single-threaded speed, concurrent access, memory usage, and implementation complexity. See Advanced: Span Encoders for a deep dive into how each works.
The algorithms
| Algorithm | Best for | Notes |
|---|---|---|
MergeHeap | Concurrent / multi-threaded | Default for ConcurrentDefault. Heap-based merge tracking. |
PriorityMerge | Single-threaded | Default for SingleThreadDefault. Priority-queue merging. |
BufferSweep | Testing / reference | Simple and correct, not optimized. |
TailSweep | Memory-constrained | Alternative scanning strategy. |
BpeBacktrack | Exact BPE semantics | O(n) via Aho-Corasick automaton + backtracking. |
Selecting an algorithm
#![allow(unused)] fn main() { use wordchipper::{ TokenizerOptions, TokenEncoderOptions, encoders::token_span_encoder::span_encoders::SpanEncoderSelector, load_vocab, disk_cache::WordchipperDiskCache, }; let mut cache = WordchipperDiskCache::default(); let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap(); let mut opts = TokenizerOptions::default(); opts.encoder.set_span_encoder(SpanEncoderSelector::BpeBacktrack); let tok = opts.build(vocab); }
Which algorithm should I use?
For most users, the defaults are correct:
- Multi-threaded workloads (web servers, batch processing): Use the default
ConcurrentDefault, which selectsMergeHeap. - Single-threaded workloads (CLI tools, embedded): Use
SingleThreadDefault, which selectsPriorityMerge. - Exact BPE needed: Use
BpeBacktrack. It uses an Aho-Corasick automaton pre-built from the vocabulary for O(n) encoding that exactly matches the theoretical BPE algorithm. The other algorithms also produce correct output for standard vocabularies;BpeBacktrackis the theoretical gold standard.
Parallelism with rayon
When the parallel feature is enabled (it is by default), batch operations parallelize across
threads:
#![allow(unused)] fn main() { use wordchipper::{TokenizerOptions, TokenEncoder, load_vocab, disk_cache::WordchipperDiskCache}; let mut cache = WordchipperDiskCache::default(); let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap(); let tok = TokenizerOptions::default() .with_parallel(true) .build(vocab); let texts: Vec<&str> = vec!["hello"; 1000]; let batch = tok.try_encode_batch(&texts).unwrap(); }
Setting parallel(true) affects both encoding and decoding. The concurrent(true) option
additionally selects a span encoder optimized for concurrent access from multiple threads.
Controlling thread count
wordchipper uses rayon's global thread pool. Control it with the RAYON_NUM_THREADS environment
variable:
RAYON_NUM_THREADS=8 cargo run --release
Or configure it programmatically via rayon::ThreadPoolBuilder.
Hash function selection
wordchipper uses hash maps extensively for vocabulary lookups. The fast-hash feature (enabled by
default) swaps in foldhash for faster hashing:
| Feature | Hash function | Notes |
|---|---|---|
fast-hash | foldhash | Good general-purpose, works in no_std |
| (none) | default | SipHash (std) or hashbrown default |
Unlike previous versions, fast-hash works in no_std environments. See
Feature Flags for details.
End-to-end benchmarks
Encode + decode throughput on 90 MB shards, 48 threads:
| Model | wordchipper | tiktoken-rs | HuggingFace tokenizers |
|---|---|---|---|
| r50k_base | 239 MiB/s | 169 MiB/s | 22 MiB/s |
| p50k_base | 251 MiB/s | 163 MiB/s | 22 MiB/s |
| p50k_edit | 242 MiB/s | 170 MiB/s | 21 MiB/s |
| cl100k_base | 214 MiB/s | 125 MiB/s | 22 MiB/s |
| o200k_base | 119 MiB/s | 124 MiB/s | 22 MiB/s |
| o200k_harmony | 122 MiB/s | 122 MiB/s | 22 MiB/s |
Running benchmarks yourself
The sample-timer tool runs wordchipper vs. tiktoken-rs side-by-side:
RAYON_NUM_THREADS=48 cargo run --release -p sample-timer -- \
--dataset-dir $DATASET_DIR --shards 0 --model openai::cl100k_base
For criterion-based microbenchmarks:
cargo bench -p wordchipper-bench
Performance tips
- Use the default features. DFA acceleration and rayon parallelism (
parallel) are both on by default. - Batch your inputs.
try_encode_batchis significantly faster than callingtry_encodein a loop because it amortizes thread pool overhead. - Reuse the tokenizer. Building a
Tokenizerpre-computes data structures. Build it once, share viaArc. - Match the encoder to your workload. Use
ConcurrentDefaultfor multi-threaded,SingleThreadDefaultfor single-threaded. - Profile spanning vs. encoding. If spanning is the bottleneck, make sure DFA acceleration is
active. If encoding is the bottleneck, experiment with
SpanEncoderSelectorvariants.
Building Custom Logos Lexers
This chapter assumes familiarity with the two-phase pipeline (spanning + BPE encoding) described in How Tokenizers Work.
BPE tokenizers split text in two phases: first into words (pre-tokenization), then each word into
subword tokens. The first phase is typically a big regex. OpenAI's cl100k_base pattern, for
example, uses alternations with Unicode property classes and lookaheads to segment "hello world"
into ["hello", " world"].
Regex is correct but slow. Each match backtracks through Unicode property tables, and the engine runs single-threaded. The logos crate takes a different approach: it compiles regex patterns into a deterministic finite automaton (DFA) at build time via a derive macro. No backtracking, no runtime regex compilation. For wordchipper's cl100k and o200k patterns, this gives 30-50x speedups (700+ MB/s).
But logos DFA can't express everything a regex can. The OpenAI patterns use \s+(?!\S), a negative
lookahead that backtracks so the last whitespace character becomes a prefix of the next word. Logos
has no lookaheads. So we need a post-processing step that corrects the token stream after the DFA
runs.
This post-processing is extracted into composable, public building blocks. You supply the patterns. We handle the rest.
When to use this
The building blocks in this chapter are designed for tokenizers that use OpenAI-style regex
pre-tokenization with the \s+(?!\S) lookahead idiom. This includes:
- cl100k_base (GPT-4, GPT-3.5-turbo)
- o200k_base (GPT-4o)
- p50k_base / p50k_edit (GPT-3, Codex)
- r50k_base (GPT-2)
- Any custom tokenizer that copies the OpenAI regex structure
Tokenizers with fundamentally different pre-tokenization don't need this machinery:
- SentencePiece (Llama, Gemini) replaces spaces with
▁before tokenization. No regex pre-tokenization. - Byte-level BPE (GPT-NeoX, Falcon) uses HuggingFace's
ByteLevelpre-tokenizer. - Bert-style tokenizers split on whitespace and punctuation with simple rules, no lookaheads.
If your tokenizer's regex pattern doesn't use \s+(?!\S) or a similar whitespace-backtracking
idiom, you can still use logos for DFA speed, but you won't need Gpt2FamilyTokenRole or
for_each_classified_span. Just implement SpanLexer directly.
The whitespace problem, concretely
To understand why we need post-processing, consider what happens with the input "hello world".
The regex \s+(?!\S) matches whitespace greedily, then backtracks one character. So " " (three
spaces) becomes " " (two spaces as one token) + " world" (the last space merges into the next
word). This is how OpenAI's patterns work: whitespace "leaks" into the following word.
Logos has no lookaheads. Its DFA matches " " as a single Whitespace token and "world" as a
separate Letters token. Without correction, you'd get different spans than the regex.
The post-processing engine fixes this. It buffers whitespace tokens and, when the next token arrives, decides how to split them. The rules depend on what kind of token follows:
- A letter token (
world): split off the last whitespace character and merge it with the word. Result:" "+" world". Matches the regex. - A punctuation token (
!): if the last whitespace character is an ASCII space, merge it with the punctuation (matching the?prefix in patterns like?[^\s\p{L}\p{N}]+). - A digit or newline: emit the whitespace as its own span. No merging.
You don't need to implement any of this. You just tell the engine what kind of token each logos variant represents, and it applies the correct rule.
The building blocks
Three public items in wordchipper::spanners::span_lexers::logos::gpt2_family:
Gpt2FamilyTokenRole
An enum that classifies how each logos token interacts with preceding whitespace:
#![allow(unused)] fn main() { pub enum Gpt2FamilyTokenRole { Whitespace, Punctuation, Word { check_contraction: bool, first_char_is_letter: bool }, Standalone, Newline, Gap, } }
Each variant tells the engine a different whitespace rule:
Whitespace: this token is whitespace. The engine buffers it and decides later how to split it based on what comes next.Punctuation: absorbs a preceding ASCII space. The engine merges the last buffered space character into this token's span (matching the?regex prefix).Word: absorbs a preceding space iffirst_char_is_letteris true. If the token starts with a non-letter prefix (like"hellowhere"is the first char), setfirst_char_is_letterto false and the engine handles the prefix separately. Thecheck_contractionfield enables cl100k-style splitting where'Thebecomes'T+he.Standalone: never absorbs whitespace. Digits, explicit contractions. Any preceding whitespace becomes its own span.Newline: newline-containing whitespace (\s*[\r\n]+). Buffered separately; at end of string, adjacent Newline + Whitespace merge (matching the regex\s++$behavior).Gap: unrecognized bytes. Use this for logosErr(()).
Gpt2FamilyLogos trait
Maps a logos token enum to Gpt2FamilyTokenRole:
#![allow(unused)] fn main() { pub trait Gpt2FamilyLogos<'a>: Logos<'a> { fn family_role(&self) -> Gpt2FamilyTokenRole; } }
Implement this for your token enum and the engine handles all the post-processing.
contraction_split
An optional utility for cl100k-compatible lexers. Logos longest-match picks 'The as one token, but
cl100k's regex first-match picks 'T (contraction) then he (letters). This function detects the
contraction prefix and returns the split point.
Most custom lexers won't need this. Set check_contraction: false and ignore it.
Building a custom lexer: step by step
Let's build a lexer from scratch. We'll target a simplified pattern that handles letters, digits, punctuation, and whitespace.
Step 1: Define the logos token enum
#![allow(unused)] fn main() { use logos::Logos; #[derive(Logos, Debug, PartialEq, Clone)] enum MyToken { #[regex(r"\p{Letter}+")] Letters, #[regex(r"\p{Number}{1,3}")] Digits, #[regex(r" ?[^\s\p{Letter}\p{Number}]+[\r\n]*")] Punctuation, #[regex(r"\s*[\r\n]+")] Newline, #[regex(r"[^\S\r\n]+")] Whitespace, } }
Each variant maps to a regex fragment. Logos compiles all of them into a single DFA at build time.
Step 2: Implement Gpt2FamilyLogos
Map each token variant to its role. This is where you make design decisions: for each token type, ask "How should this interact with preceding whitespace?"
#![allow(unused)] fn main() { use wordchipper::spanners::span_lexers::logos::gpt2_family::{ Gpt2FamilyLogos, Gpt2FamilyTokenRole, }; impl Gpt2FamilyLogos<'_> for MyToken { fn family_role(&self) -> Gpt2FamilyTokenRole { match self { // Whitespace is buffered; last char may merge into next token. Self::Whitespace => Gpt2FamilyTokenRole::Whitespace, // Letters absorb a preceding space when the token starts with // a letter. No contraction splitting needed for our pattern. Self::Letters => Gpt2FamilyTokenRole::Word { check_contraction: false, first_char_is_letter: true, }, // Punctuation absorbs a preceding ASCII space (the ` ?` prefix). Self::Punctuation => Gpt2FamilyTokenRole::Punctuation, // Newlines are buffered separately. Self::Newline => Gpt2FamilyTokenRole::Newline, // Digits stand alone. They never merge with preceding whitespace. Self::Digits => Gpt2FamilyTokenRole::Standalone, } } } }
The key insight: you don't need to understand the whitespace-splitting algorithm. You just need to
know what each token is, and Gpt2FamilyTokenRole maps that to the correct behavior.
Step 3: Use the logos_lexer! macro
The logos_lexer! macro generates the SpanLexer impl and registers the lexer with inventory for
automatic acceleration:
#![allow(unused)] fn main() { logos_lexer! { /// Logos DFA word scanner for my custom pattern. pub struct MyLexer; token = MyToken; pattern = MY_PATTERN; } }
This generates a struct that implements SpanLexer, which returns an iterator of word byte ranges.
The macro also registers the lexer via inventory so it automatically replaces the regex spanner
when the pattern matches.
If you need a manual SpanLexer impl instead (e.g. for a non-standard pattern), you can call
logos_span_iter directly:
#![allow(unused)] fn main() { use wordchipper::spanners::span_lexers::{SpanLexer, logos::gpt2_family::logos_span_iter}; impl SpanLexer for MyLexer { fn find_span_iter<'a>( &'a self, text: &'a str, ) -> Box<dyn Iterator<Item = Range<usize>> + 'a> { Box::new(logos_span_iter(text, MyToken::lexer(text).spanned())) } } }
The real thing: cl100k in ~65 lines
The built-in Cl100kLexer follows exactly this pattern. The token enum has 7 variants. The
Gpt2FamilyLogos impl is 15 lines. The SpanLexer impl is generated by logos_lexer!. You can
read the full source at crates/wordchipper/src/spanners/span_lexers/logos/cl100k.rs.
Gpt2FamilyTokenRole reference
| Variant | Absorbs preceding whitespace? | Use for |
|---|---|---|
Whitespace | N/A (is whitespace) | Horizontal whitespace tokens ([^\S\r\n]+) |
Punctuation | Yes, ASCII space only | Punctuation with ? prefix ( ?[^\s\p{L}\p{N}]+) |
Word { check_contraction, first_char_is_letter } | Yes, if first_char_is_letter | Letter/word tokens (\p{L}+, case-sensitive word patterns) |
Standalone | No | Digits, explicit contractions, anything that stands alone |
Newline | N/A (buffered separately) | Newline-containing whitespace (\s*[\r\n]+) |
Gap | No | Unrecognized bytes (logos errors). Always use for Err(()) |
When in doubt, use Standalone. It's the safest default: the token is emitted as-is, and any
preceding whitespace becomes its own span.
Testing your lexer
A logos lexer must produce identical span boundaries to the regex it replaces. If they diverge on any input, benchmark comparisons are invalid because the two code paths tokenize different spans.
Equivalence testing with lexer-equivalence
The lexer-equivalence crate in dev-crates/ provides exhaustive combinatorial testing. It
generates all k-character strings (k=1..4) from a set of Unicode representative codepoints and
compares the span output of each logos lexer against the regex reference.
The representative set is derived from the Unicode predicates used in the OpenAI patterns (\p{Lu},
\p{Ll}, \p{N}, \s, etc.), which partition Unicode into 22 equivalence cells. Two codepoints in
the same cell are indistinguishable to every regex predicate, so testing one representative per cell
covers the full Unicode space. The test suite checks ~732,540 inputs per lexer and runs in under 10
seconds:
cargo test -p lexer-equivalence
The validate_representative_completeness test programmatically verifies that the representative
set covers every equivalence cell, so you can be confident the coverage is complete.
Adding tests for a custom lexer
To test a custom lexer against its regex, add a test that calls assert_k_tuple_equivalence from
the lexer_equivalence::harness module with your lexer and the regex pattern it targets. See
dev-crates/lexer-equivalence/tests/equivalence.rs for the pattern used by the built-in lexers.
Python & WASM Bindings
wordchipper has first-class bindings for Python and JavaScript/TypeScript. Both expose the same core operations: load a vocabulary, encode text to tokens, decode tokens back to text.
Python
Installation
pip install wordchipper
Basic usage
from wordchipper import Tokenizer
tok = Tokenizer.from_pretrained("cl100k_base")
# Encode and decode
tokens = tok.encode("hello world") # [15339, 1917]
text = tok.decode(tokens) # "hello world"
# Batch operations (parallel via rayon)
results = tok.encode_batch(["hello", "world", "foo bar"])
texts = tok.decode_batch(results)
Vocabulary inspection
tok.vocab_size # 100256
tok.token_to_id("hello") # 15339
tok.id_to_token(15339) # "hello"
tok.token_to_id("nonexistent") # None
# Special tokens
tok.get_special_tokens()
# [('<|endoftext|>', 100257), ('<|fim_prefix|>', 100258), ...]
Available models
Tokenizer.available_models()
# ['r50k_base',
# 'p50k_base',
# 'p50k_edit',
# 'cl100k_base',
# 'o200k_base',
# 'o200k_harmony']
Saving vocabularies
Export a vocabulary in tiktoken's base64 format:
tok.save_base64_vocab("vocab.tiktoken")
Compatibility wrappers
Drop-in replacements for tiktoken and HuggingFace tokenizers. Change one import line and
the rest of your code stays the same.
tiktoken
# Before
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# After
from wordchipper.compat import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("hello world")
text = enc.decode(tokens)
# Model lookup works too
enc = tiktoken.encoding_for_model("gpt-4o")
The Encoding class exposes encode, encode_ordinary, encode_batch, decode,
decode_batch, and properties name, n_vocab, max_token_value, eot_token, and
special_tokens_set. Parameters accepted for API compatibility but not implemented
(allowed_special, disallowed_special) raise NotImplementedError when set to
non-default values.
HuggingFace tokenizers
# Before
from tokenizers import Tokenizer
# After
from wordchipper.compat.tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("Xenova/gpt-4o")
output = tok.encode("hello world")
output.ids # [24912, 2375]
output.tokens # ["hello", " world"]
text = tok.decode(output.ids)
The Tokenizer class exposes encode, encode_batch, decode, decode_batch,
get_vocab_size, token_to_id, and id_to_token. Known HuggingFace identifiers
(e.g. Xenova/gpt-4o) are mapped automatically; bare encoding names like cl100k_base
also work. Parameters accepted for API compatibility but not implemented (e.g.
pair, is_pretokenized, add_special_tokens=False, skip_special_tokens=False)
raise NotImplementedError when set to non-default values.
Building from source
cd bindings/python
uv venv .venv && source .venv/bin/activate
uv pip install maturin pytest
maturin develop # debug build
maturin develop --release # release build for benchmarks
pytest tests/ -v
JavaScript / TypeScript (WASM)
wordchipper compiles to WebAssembly and runs in browsers and Node.js. The WASM build uses
default-features = false (no std, no parallelism, no file I/O), so all core tokenization works in
the browser without a server.
Quick start
import { Tokenizer } from "./js/dist/index.js";
const tok = await Tokenizer.fromPretrained("o200k_base");
const tokens = tok.encode("hello world"); // Uint32Array [24912, 2375]
const text = tok.decode(tokens); // "hello world"
tok.free(); // release WASM memory when done
Loading
Two ways to load a tokenizer:
// Fetch from OpenAI's CDN (convenience)
const tok1 = await Tokenizer.fromPretrained("cl100k_base");
// Or from your own vocab bytes (no network request)
const data = new Uint8Array(/* .tiktoken file contents */);
const tok2 = await Tokenizer.fromVocabData("cl100k_base", data);
fromPretrained uses fetch() internally, so it works in both browser and Node.js 18+
environments.
Encode and decode
// Single
const tokens = tok.encode("hello world"); // Uint32Array
const text = tok.decode(tokens); // string
// Batch
const results = tok.encodeBatch(["hello", "world"]); // Uint32Array[]
const texts = tok.decodeBatch(results); // string[]
Vocabulary inspection
tok.vocabSize; // 100256
tok.maxToken; // 100255 (or null)
tok.tokenToId("hello"); // 15339 (or null)
tok.idToToken(15339); // "hello" (or null)
tok.getSpecialTokens(); // [["<|endoftext|>", 100257], ...]
Tokenizer.availableModels(); // ["r50k_base", "p50k_base", ...]
Memory management
WASM objects must be freed manually. Call tok.free() when you're done with a tokenizer to release
its WASM memory.
Building from source
Requires Rust, wasm-pack, and Node.js:
# Build the WASM package
wasm-pack build bindings/wasm --target web
# Build the TypeScript wrapper
cd bindings/wasm/js
npm install
npm run build
Examples
Working examples are included in the repository:
- Node.js:
examples/wasm-node/- Encode/decode from a Node.js script - Browser:
examples/wasm-browser/- In-browser tokenization with a simple HTML page - Live demo: Interactive Tokenizer Demo - Try it directly in this book
API comparison
All three produce identical token sequences for the same input and model.
Rust
- Load:
load_vocab("cl100k_base", &mut cache) - Encode:
tok.try_encode(text) - Decode:
tok.try_decode_to_string(&tokens) - Batch:
tok.try_encode_batch(&texts) - Vocab size:
tok.vocab().len() - Special tokens:
tok.special_vocab().span_map()
Python
- Load:
Tokenizer.from_pretrained("cl100k_base") - Encode:
tok.encode(text) - Decode:
tok.decode(tokens) - Batch:
tok.encode_batch(texts) - Vocab size:
tok.vocab_size - Special tokens:
tok.get_special_tokens()
JavaScript
- Load:
await Tokenizer.fromPretrained("cl100k_base") - Encode:
tok.encode(text) - Decode:
tok.decode(tokens) - Batch:
tok.encodeBatch(texts) - Vocab size:
tok.vocabSize - Special tokens:
tok.getSpecialTokens()
no_std & Embedded
wordchipper's core tokenization pipeline works without the Rust standard library. This means you can
run BPE encoding and decoding on WASM targets, microcontrollers, and other no_std environments.
What works without std
The following all work with default-features = false:
- Vocabulary types (
UnifiedTokenVocab,ByteMapVocab,PairMapVocab,SpanMapVocab) - Text spanning (pre-tokenization)
- BPE encoding (
SpanEncoderimplementations) - Token decoding
- Logos DFA lexers (compile-time, no runtime regex)
- Special token handling
What requires std
These features are behind feature flags that imply std:
| Feature | Requires std because |
|---|---|
download | Network I/O, file system caching |
datagym | JSON parsing, file I/O |
concurrent | Thread pool, OS threads |
parallel | Rayon requires OS threads |
| Regex-based spanning | regex and fancy-regex crates |
Note that fast-hash does not require std. See Feature Flags for the
full reference.
Configuration
[dependencies]
wordchipper = { version = "0.7", default-features = false }
That's it. No separate no_std feature flag needed. The crate uses unconditional #![no_std] and
conditionally links std when the std feature is active.
How it works internally
wordchipper uses the "Reddit PSA" pattern for no_std support:
#![no_std]
#[cfg(feature = "std")]
extern crate std;
extern crate alloc;
This means the crate always starts in no_std mode. The std feature adds standard library support
on top. All collection types come from alloc (Vec, String, Box) via an internal prelude module.
For hash maps, hashbrown is always available as a non-optional dependency. When fast-hash is
enabled, foldhash is used as the hasher (with either std::collections::HashMap or
hashbrown::HashMap depending on whether std is active). Without fast-hash, the default hasher
is used.
WASM targets
The WASM bindings (bindings/wasm/) demonstrate a full no_std integration. The WASM crate uses
default-features = false and builds the entire tokenization pipeline without std:
[dependencies]
wordchipper = { path = "../../crates/wordchipper", default-features = false }
Vocabulary loading in WASM works by parsing tiktoken-format data from &[u8] directly, bypassing
the std-gated vocab::io module.
CI verifies the no_std build against two targets:
# WASM (browser/Node.js)
cargo check --target wasm32-unknown-unknown --no-default-features
# ARM Cortex-M (bare metal)
cargo check --target thumbv7m-none-eabi --no-default-features
Embedded considerations
On memory-constrained targets, be aware that:
- Vocabulary size matters.
cl100k_basehas ~100k entries. Each entry stores a byte sequence and merge pairs. Budget several megabytes for the vocabulary data structures. - No regex. Without
std, regex-based spanning is unavailable. The logos DFA lexers work inno_stdsince they're compiled at build time, but only for known patterns (r50k, cl100k, o200k). For custom patterns, you'd need to implementSpanLexerdirectly. - No parallelism. The
parallelandconcurrentfeatures requirestd. All encoding runs single-threaded. UseSingleThreadDefaultfor the best single-threaded span encoder. - Allocator required. wordchipper uses
alloc(Vec, String, Box, Arc). Your target must provide a global allocator.
Example: building a vocabulary in no_std
Since load_vocab and file I/O require std, you need to construct the vocabulary from raw data.
The WASM bindings show one approach: parse a tiktoken-format &[u8] buffer into a SpanMapVocab,
combine with a TextSpanningConfig, and build a UnifiedTokenVocab.
The key types involved:
use wordchipper::{
UnifiedTokenVocab,
vocab::{SpanMapVocab, ByteMapVocab},
spanners::TextSpanningConfig,
};
// 1. Parse your vocabulary data into a SpanMapVocab
// 2. Combine with a TextSpanningConfig for your pattern
// 3. Build: UnifiedTokenVocab::from_span_vocab(config, span_vocab)
The exact parsing depends on your data format. See bindings/wasm/src/lib.rs for a working
implementation.
Advanced: Span Encoders
This chapter covers wordchipper's span encoder architecture: what a span encoder does, how the five built-in implementations work, and when you might choose one over another.
What a span encoder does
After pre-tokenization splits text into spans, each span needs to be converted into a sequence of
token IDs. This is the job of the SpanEncoder trait:
pub trait SpanEncoder<T: TokenType>: Send {
fn encode_append_compound_span(
&mut self,
vocab: &UnifiedTokenVocab<T>,
span: &[u8],
tokens: &mut Vec<T>,
);
}
The encoder receives a byte slice (one span) and appends the resulting token IDs to a buffer. The vocabulary provides two resources for this:
- Span lookup (
lookup_token): Check if the entire span is a single known token. If so, return it directly. This is the fast path. - Pair merge table (
pair_vocab): A map from(token_a, token_b) -> merged_token. This drives the BPE merge loop.
When a span is a single token, encoding is a single hash map lookup. When it's not, the encoder must run BPE: start with byte-level tokens and iteratively merge pairs according to the merge table's rank ordering.
The BPE merge problem
The core BPE algorithm is simple to state: repeatedly merge the lowest-ranked pair in the sequence. But implementing it efficiently is tricky.
Consider the span "abcd" with merges ranked: (a,b)=0, (c,d)=1, (ab,cd)=2.
Start: [a, b, c, d]
Step 1: [ab, c, d] merge (a,b) at rank 0
Step 2: [ab, cd] merge (c,d) at rank 1
Step 3: [abcd] merge (ab,cd) at rank 2
The challenge is that after each merge, the neighboring pairs change. Merging (a,b) into ab
creates a new pair (ab,c) that didn't exist before. A naive implementation scans the entire
sequence after each merge to find the next lowest-ranked pair, giving O(n^2) behavior.
Why shift-reduce doesn't work
A tempting optimization is a shift-reduce parser: push tokens onto a stack, and whenever the top two tokens form a mergeable pair with a lower rank than the next input pair, reduce (merge) them. This runs in O(n) and works for most inputs.
But it fails on some inputs because BPE requires global rank ordering, not just local comparison. A local decision to merge might miss a lower-ranked pair further to the right. This was verified empirically against cl100k and o200k corpora.
The five implementations
BufferSweep (Reference)
The simplest correct implementation. Maintains a buffer of tokens and repeatedly sweeps through it, merging the globally lowest-ranked pair in each pass.
- Time: O(n * m) where m is the number of merge rounds
- Space: O(n)
- Use case: Testing and correctness comparison. Not optimized.
Selected via SpanEncoderSelector::Reference or SpanEncoderSelector::BufferSweep.
PriorityMerge (SingleThreadDefault)
Uses a priority queue (min-heap) to track mergeable pairs. Instead of scanning the whole buffer each round, it pops the lowest-ranked pair from the heap. After merging, it updates the neighboring pairs in the heap.
- Time: O(n log n) amortized
- Space: O(n) for the heap
- Use case: Single-threaded workloads. Best single-thread performance.
Selected via SpanEncoderSelector::SingleThreadDefault or SpanEncoderSelector::PriorityMerge.
MergeHeap (ConcurrentDefault)
Similar to PriorityMerge but optimized for concurrent access patterns. The internal state is designed to play well when multiple threads encode different spans through the same tokenizer.
- Time: O(n log n) amortized
- Space: O(n)
- Use case: Multi-threaded workloads with rayon. Default for
parallel(true).
Selected via SpanEncoderSelector::ConcurrentDefault or SpanEncoderSelector::MergeHeap.
TailSweep
An alternative scanning strategy that sweeps from the tail of the buffer. Useful in memory-constrained scenarios.
- Time: O(n * m)
- Space: O(n)
- Use case: When memory allocation patterns matter.
Selected via SpanEncoderSelector::TailSweep.
BpeBacktrack
A fundamentally different approach. Instead of starting with bytes and merging up, it starts with the longest possible tokens and backtracks when merge boundaries don't align.
The algorithm:
- Build an Aho-Corasick automaton from all vocabulary token byte sequences.
- Scan the input with leftmost-longest matching to find the longest token at each position.
- Validate boundaries: check that each token pair boundary is a valid BPE merge point using
is_valid_token_pair. If not, backtrack to a shorter token. - Walk the next-prefix chain when backtracking: each token has a precomputed "next longest prefix" that allows O(1) fallback.
This gives O(n) encoding time (amortized, since backtracking visits each byte at most twice).
- Time: O(n) amortized
- Space: O(V) for the AC automaton (built once, shared via Arc)
- Use case: When exact BPE semantics are required, or for very long spans.
- Trade-off: Higher upfront cost to build the automaton from the vocabulary.
Selected via SpanEncoderSelector::BpeBacktrack.
The implementation is based on the work by Hendrik van Antwerpen and Alexander Neubeck at GitHub Next (2023), described in the blog post "So many tokens, so little time."
Precomputed data structures
BpeBacktrack is the only encoder that pre-builds data structures from the vocabulary. The
BpeVocab struct contains:
pair_lookup: The standard(T, T) -> Tmerge table.split_table: Inverse ofpair_lookup. For each token, stores its canonical split into two sub-tokens.next_prefix: For each token, the longest prefix token that is shorter by one or more bytes. This enables O(1) backtracking.ac: The Aho-Corasick automaton over all token byte sequences.token_lens: Byte length of each token, indexed by token ID.
This is built once by BpeVocab::from_vocab and shared via Arc across all encoder instances.
Choosing the right encoder
| Scenario | Recommended | Why |
|---|---|---|
| Web server, many concurrent requests | ConcurrentDefault | MergeHeap plays well with thread pools |
| CLI tool, single-threaded | SingleThreadDefault | PriorityMerge is fastest single-threaded |
| Correctness testing | Reference | BufferSweep is simplest to reason about |
| Very long spans (>1KB) | BpeBacktrack | O(n) vs O(n log n) matters for long inputs |
| Memory-constrained embedded | TailSweep or SingleThreadDefault | Lower allocation pressure |
For most users, the defaults (ConcurrentDefault when parallel, SingleThreadDefault otherwise)
are the right choice. The SpanEncoderSelector enum makes switching a one-line change for
benchmarking.
The fast path: single-token spans
Before any BPE algorithm runs, the encoder checks whether the entire span is already a single token in the vocabulary. This is a single hash map lookup and is O(1). For common words like "the", "is", " world", this fast path handles the majority of spans.
The encode_append_span_ref method in the SpanEncoder trait implements this:
SpanRef::Word(range) => {
let span = &text[range].as_bytes();
if let Some(token) = vocab.lookup_token(span) {
tokens.push(token); // fast path: single lookup
} else {
self.encode_append_compound_span(vocab, span, tokens); // BPE merge loop
}
}
In practice, with large vocabularies like o200k_base (200k tokens), the vast majority of spans hit
the fast path. BPE only runs on uncommon words, misspellings, and multilingual text that doesn't
appear verbatim in the vocabulary.
Feature Flags
These are the Cargo features available on wordchipper. Features can be enabled in your
Cargo.toml:
[dependencies]
wordchipper = { version = "0.8", features = ["client"] }
The default features are std, fast-hash, and parallel.
features = ["std"]
Enabled by default.
Provide standard library integration: regex-based spanning via regex and fancy-regex, file I/O,
and std::collections::HashMap. Building with default-features = false removes the standard
library dependency, leaving a pure no_std crate that uses hashbrown for hash maps and logos DFA
lexers for pre-tokenization.
features = ["fast-hash"]
Enabled by default.
Use foldhash for faster hash maps. When combined with std, the standard library's HashMap is
used with foldhash's hasher. Without std, hashbrown::HashMap is used with foldhash's hasher.
This feature has no std requirement, so it works in no_std environments.
features = ["parallel"]
Enabled by default. Implies concurrent.
Enable rayon-based parallelism for batch encoding and decoding. Control the thread count with the
RAYON_NUM_THREADS environment variable.
features = ["concurrent"]
Implies std.
Enable the thread pool (PoolToy) and concurrency utilities used for concurrent encoder access.
The parallel feature enables this automatically; use concurrent directly if you want the thread
pool without pulling in rayon.
features = ["client"]
Implies download and datagym (both of which imply std).
Everything needed to load pretrained vocabularies: downloading from the network and parsing DataGym-format files.
features = ["download"]
Implies std.
Download and cache vocabulary files from the internet.
features = ["datagym"]
Implies std.
Load DataGym-format vocabularies (used by older OpenAI models like GPT-2). Pulls in serde_json.
features = ["tracing"]
Add tracing instrumentation points throughout the encoding pipeline. Only useful for profiling
the library itself.
features = ["testing"]
Export test utilities for downstream crates to use in their own test suites.
Interactive Tokenizer Demo
Try wordchipper's BPE tokenizer directly in the browser via WebAssembly. Select a model, type some text, and click Encode & Decode to see the token IDs, roundtrip verification, and timing.