Introduction

What is a tokenizer?

Large language models don't read text the way humans do. Before a model can process the sentence "The cat sat on the mat," it must convert that string of characters into a sequence of numbers called tokens. Each token is an integer that maps to an entry in a fixed vocabulary. The process of converting text to tokens is called tokenization, and the software that does it is a tokenizer.

Tokenization sits at the boundary between human-readable text and machine-readable numbers. Every time you send a prompt to an LLM, a tokenizer runs first. Every time the model generates a response, a tokenizer runs in reverse to turn the output tokens back into text.

A simple example

Consider the string "hello world". A tokenizer with a vocabulary like OpenAI's cl100k_base would encode it as two tokens:

"hello world" -> [15339, 1917]

The token 15339 maps to the bytes for "hello", and 1917 maps to " world" (note the leading space). Decoding reverses this: [15339, 1917] becomes the original string.

Not all strings split at word boundaries. The tokenizer might split "tokenization" into ["token", "ization"], or keep "the" as a single token. The exact splits depend on the vocabulary and algorithm.

Why does tokenization matter?

Tokenization affects LLM behavior in ways that are easy to overlook:

  • Cost. API pricing is per-token. A tokenizer that produces fewer tokens for the same text saves money. A tokenizer that splits your language inefficiently costs more.
  • Context window. Models have a fixed token limit. Efficient tokenization means more text fits in the window.
  • Latency. Each token requires a forward pass through the model. Fewer tokens means faster inference.
  • Multilingual fairness. Some tokenizers handle English efficiently but split Chinese, Arabic, or Hindi into many more tokens for the same semantic content.
  • Reproducibility. If you're building retrieval, caching, or evaluation pipelines, you need a tokenizer that produces exactly the same output as the model's own tokenizer. Approximate doesn't cut it.

What is BPE?

Byte Pair Encoding (BPE) is the most widely used tokenization algorithm for large language models. It works by starting with individual bytes and iteratively merging the most frequent adjacent pairs into new tokens. The result is a vocabulary of subword units that balances two goals: common words like "the" become a single token, while rare words like "defenestration" decompose into reusable pieces (["def", "en", "est", "ration"]).

BPE is used by GPT-2 through GPT-4o and most of their derivatives. The How Tokenizers Work chapter walks through the algorithm step by step.

What is wordchipper?

wordchipper is a high-performance BPE tokenizer library written in Rust. It encodes and decodes text using the same vocabularies as OpenAI's models (GPT-2, GPT-3.5, GPT-4, GPT-4o), producing identical output to the reference implementations.

Key properties:

  • Fast. Up to 2x the throughput of tiktoken-rs on encode/decode benchmarks, with 30-50x faster pre-tokenization via compile-time DFA lexers.
  • Correct. Validated against OpenAI's tiktoken and HuggingFace tokenizers on large corpora. Token-for-token identical output.
  • Portable. Core tokenization works in no_std environments. Runs on WASM, embedded targets, and bare-metal Rust.
  • Multi-language. Native Rust API, plus Python and JavaScript/TypeScript bindings.
  • Extensible. Bring your own vocabulary, write custom lexers, swap BPE algorithms.

Supported models

ModelUsed byVocab size
r50k_baseGPT-2~50k
p50k_baseCodex~50k
p50k_editCodex (edit)~50k
cl100k_baseGPT-3.5, GPT-4~100k
o200k_baseGPT-4o~200k
o200k_harmonyGPT-4o (harmony)~200k

Who is this book for?

This book is for anyone who wants to understand how tokenizers work and how to use wordchipper effectively. It's structured in layers:

Getting Started

Installation

Add wordchipper to your Cargo.toml:

[dependencies]
wordchipper = "0.7"

The default features include std, fast-hash (foldhash hasher), and parallel (rayon). Add client for vocabulary download. See Feature Flags for all options and no_std & Embedded for minimal configurations.

Encode and decode in five lines

use wordchipper::{
    Tokenizer, TokenizerOptions, TokenEncoder, TokenDecoder,
    load_vocab, disk_cache::WordchipperDiskCache,
};

fn main() -> wordchipper::WCResult<()> {
    // Load vocabulary (downloads and caches on first run)
    let mut cache = WordchipperDiskCache::default();
    let (_desc, vocab) = load_vocab("openai:cl100k_base", &mut cache)?;

    // Build a tokenizer
    let tok = TokenizerOptions::default().build(vocab);

    // Encode
    let tokens = tok.try_encode("hello world")?;
    assert_eq!(tokens, vec![15339, 1917]);

    // Decode
    let text = tok.try_decode_to_string(&tokens)?;
    assert_eq!(text, "hello world");

    Ok(())
}

The first call to load_vocab downloads the vocabulary file from OpenAI's CDN and caches it in ~/.cache/wordchipper/. Subsequent calls load from disk.

Available models

use wordchipper::list_models;

fn main() {
    for model in list_models() {
        println!("{}", model);
    }
}

This prints all registered models with their namespace prefix:

openai::r50k_base
openai::p50k_base
openai::p50k_edit
openai::cl100k_base
openai::o200k_base
openai::o200k_harmony

You can also use short names without the openai:: prefix when loading:

#![allow(unused)]
fn main() {
use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache};
let mut cache = WordchipperDiskCache::default();
let (_desc, vocab) = load_vocab("o200k_base", &mut cache).unwrap();
}

Batch encoding

For multiple strings, batch encoding runs spans in parallel across threads (when the parallel feature is enabled):

use wordchipper::{
    TokenEncoder, TokenizerOptions,
    load_vocab, disk_cache::WordchipperDiskCache,
};

fn main() -> wordchipper::WCResult<()> {
    let mut cache = WordchipperDiskCache::default();
    let (_desc, vocab) = load_vocab("openai:cl100k_base", &mut cache)?;
    let tok = TokenizerOptions::default()
        .with_parallel(true)
        .build(vocab);

    let texts = vec!["hello world", "the cat sat on the mat", "foo bar baz"];
    let batch = tok.try_encode_batch(&texts)?;

    for (text, tokens) in texts.iter().zip(batch.iter()) {
        println!("{:?} -> {:?}", text, tokens);
    }

    Ok(())
}

Choosing a token type

wordchipper is generic over the token integer type. Most users should use u32, which is the default. If your vocabulary fits in 16 bits (< 65,536 entries), you can use u16 to save memory:

use std::sync::Arc;
use wordchipper::{UnifiedTokenVocab, TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache};

fn main() -> wordchipper::WCResult<()> {
    let mut cache = WordchipperDiskCache::default();
    let (_desc, vocab) = load_vocab("openai:r50k_base", &mut cache)?;

    // Convert to u16 (works for r50k_base with ~50k tokens)
    let vocab_u16: Arc<UnifiedTokenVocab<u16>> = Arc::new(vocab.as_ref().to_token_type());
    let tok = TokenizerOptions::default().build(vocab_u16);

    Ok(())
}

For cl100k_base (~100k tokens) and o200k_base (~200k tokens), use u32.

What happens under the hood

When you call try_encode("hello world"), wordchipper runs a two-phase pipeline:

  1. Spanning. The text is split into coarse segments called spans. For OpenAI models, this uses a regex pattern (or a faster DFA lexer) that handles whitespace, punctuation, letters, and digits. "hello world" becomes ["hello", " world"].

  2. BPE encoding. Each span is encoded independently using Byte Pair Encoding. The encoder iteratively merges adjacent byte pairs according to the vocabulary's merge table, always picking the lowest-ranked (highest priority) pair first. The result is a sequence of token IDs.

Decoding reverses the process: each token ID maps to a byte sequence, and the byte sequences are concatenated to produce the original text.

The next chapter explains this pipeline in detail, with visual examples of how BPE works.

How Tokenizers Work

This chapter explains the algorithms behind BPE tokenization. If you already know how BPE works, skip to Pretrained Models. To see tokenization in action, try the Interactive Tokenizer Demo.

The problem: turning text into numbers

Neural networks operate on fixed-size numerical inputs. A language model needs a way to convert arbitrary text into a sequence of integers from a finite vocabulary. The simplest approaches have obvious downsides:

  • Character-level: Vocabulary is small (~256 for ASCII, ~150k for full Unicode), but sequences are long. "tokenization" becomes 12 tokens. Long sequences are expensive for attention mechanisms.
  • Word-level: "tokenization" is one token, but the vocabulary must include every possible word. Rare words, typos, and new terms become <UNK> (unknown). Multilingual text explodes the vocabulary.
  • Subword: Split text into pieces that balance vocabulary size against sequence length. "tokenization" might become ["token", "ization"]. Common words stay whole. Rare words decompose into reusable parts.

BPE is a subword tokenization algorithm. It's used by GPT-2 through GPT-4o, and it's what wordchipper implements.

Byte Pair Encoding (BPE)

BPE was originally a data compression algorithm (Gage, 1994). The NLP adaptation (Sennrich et al., 2016) works in two phases: training and encoding.

Training: building the vocabulary

Training starts with a base vocabulary of individual bytes (0-255) and iteratively discovers the most useful multi-byte tokens. In practice, the training corpus is first split into spans by a regex pattern (see Pre-tokenization below), and pair counting happens within each span independently. The regex determines what token boundaries are possible. The steps below describe the core merge algorithm that runs on those spans.

Step 1. Start with every unique byte as a token. The vocabulary has 256 entries.

Step 2. Count every adjacent pair of tokens across all spans. Find the most frequent pair.

Step 3. Create a new token that represents that pair merged together. Add it to the vocabulary. Record the merge rule: (token_a, token_b) -> new_token.

Step 4. Replace all occurrences of that pair in the corpus with the new token.

Step 5. Repeat from Step 2 until the vocabulary reaches the desired size.

Example: training on "aabaa aab"

Start with bytes: a=0, b=1, space=2

Corpus: [a, a, b, a, a, ' ', a, a, b]

Round 1: Most frequent pair is (a, a), count=3
  Merge: (a, a) -> aa (token 3)
  Corpus becomes: [aa, b, aa, ' ', aa, b]

Round 2: Most frequent pair is (aa, b), count=2
  Merge: (aa, b) -> aab (token 4)
  Corpus becomes: [aab, aa, ' ', aab]

Round 3: Most frequent pair is (aab, aa), count=1
  (tie with others; pick first)
  ...

After training, the vocabulary contains bytes plus all discovered merges, and the merge table records the order. The rank of a merge (its position in the table) determines priority during encoding.

Encoding: applying the merges

Given a trained vocabulary, encoding a string works as follows:

Step 1. Convert the input to a sequence of byte tokens.

Step 2. Find all adjacent pairs. Look up each pair in the merge table.

Step 3. Merge the pair with the lowest rank (highest priority). This is the pair that was learned earliest during training, meaning it's the most frequent.

Step 4. Repeat from Step 2 until no more merges are possible.

Example: encoding "aab"

Using the vocabulary from above:

Start:    [a, a, b]       (3 byte tokens)
Pairs:    (a,a)=rank 0    (a,b)=not in table
Merge:    [aa, b]         (merge (a,a) at rank 0)
Pairs:    (aa,b)=rank 1
Merge:    [aab]           (merge (aa,b) at rank 1)
No pairs: done.
Result:   [4]             (token 4 = "aab")

The key insight: BPE encoding is deterministic. Given the same vocabulary and merge table, the same input always produces the same tokens. This is why wordchipper can be validated against tiktoken and produce identical output.

Pre-tokenization: splitting before BPE

Running BPE directly on raw text would be wasteful. The string "hello world" would start as individual bytes and need many merge steps. Worse, the space between words could merge with the surrounding characters, creating tokens that span word boundaries (like "o w"), which hurts the model's ability to generalize.

The solution is pre-tokenization (also called spanning in wordchipper). Before BPE runs, the text is split into coarse segments using a regex pattern. Each segment is then encoded independently.

OpenAI's cl100k_base uses this regex pattern:

(?i:'s|'t|'re|'ve|'m|'ll|'d)
|[^\r\n\p{L}\p{N}]?\p{L}+
|\p{N}{1,3}
| ?[^\s\p{L}\p{N}]+
|\s+(?!\S)
|\s+

This splits text into:

  • Contractions: 's, 't, 're, etc.
  • Words (letters), optionally preceded by a non-letter character
  • Numbers, up to 3 digits at a time
  • Punctuation sequences, optionally preceded by a space
  • Whitespace (with a special rule for trailing space)

For example, "Hello, world! 123" becomes:

["Hello", ",", " world", "!", " 123"]

Each of these spans is then BPE-encoded independently. This keeps word boundaries clean and makes encoding much faster (each span is short).

The whitespace rule

The pattern \s+(?!\S) is worth understanding. It matches whitespace greedily, then uses a negative lookahead to back off one character. So " world" (3 spaces + "world") becomes:

["  ", " world"]

The last space "leaks" into the next word. This is an intentional design choice by OpenAI: it means the model sees leading-space tokens as word starts, improving prediction.

This lookahead is straightforward in a regex engine but impossible in a DFA (deterministic finite automaton), which is why wordchipper's logos-based lexers need a post-processing step. See Building Custom Logos Lexers for details.

The wordchipper pipeline

Putting it all together, here's what happens when you call try_encode("Hello, world!"):

Input:     "Hello, world!"
             |
             v
  ┌─────────────────────┐
  │   1. Pre-tokenize   │  (regex or DFA lexer)
  │   Split into spans  │
  └──────────┬──────────┘
             |
   ["Hello", ",", " world", "!"]
             |
             v
  ┌─────────────────────┐
  │   2. BPE encode     │  (per-span, possibly parallel)
  │   Merge byte pairs  │
  └──────────┬──────────┘
             |
   [9906, 11, 1917, 0]
             |
             v
         Token IDs

In wordchipper's architecture:

  • Step 1 is handled by a TextSpanner (the spanners module). The default uses regex. For known patterns (cl100k, o200k, r50k), a compile-time DFA lexer is used automatically for 30-50x faster spanning.
  • Step 2 is handled by a SpanEncoder (the encoders module). Multiple BPE algorithms are available, optimized for different use cases.

The Tokenizer type combines both steps behind a single try_encode / try_decode interface.

Decoding: tokens back to text

Decoding is simpler than encoding. Each token ID maps to a byte sequence in the vocabulary. The decoder concatenates all byte sequences and interprets the result as UTF-8:

[9906,   11,  1917, 0]
  |      |    |     |
  v      v    v     v
"Hello" "," " world" "!"
  |
  v
"Hello, world!"

No merging or splitting needed. Decoding is always O(n) in the number of tokens.

Other tokenization approaches

BPE is not the only subword algorithm. Two others are worth knowing:

WordPiece (BERT)

WordPiece is similar to BPE but uses a different training objective. Instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data. Subword pieces after the first are prefixed with ## (e.g., "token" + "##ization").

Used by: BERT, DistilBERT, ELECTRA.

SentencePiece / Unigram (Llama, Gemini)

SentencePiece treats the input as a raw byte stream and replaces spaces with a special \u2581 character before tokenization. The Unigram variant starts with a large vocabulary and prunes it down based on a unigram language model.

Used by: Llama, Gemini, T5, ALBERT.

wordchipper focuses on BPE tokenizers with OpenAI-style regex pre-tokenization. Tokenizers that use SentencePiece or WordPiece have fundamentally different pre-tokenization and don't need the spanning machinery described in this book.

Further reading

  • Sennrich et al. (2016), "Neural Machine Translation of Rare Words with Subword Units" (the paper that adapted BPE for NLP)
  • OpenAI's tiktoken repository for the reference regex patterns
  • The Performance chapter for how wordchipper optimizes BPE encoding

Pretrained Models

wordchipper ships with loaders for OpenAI's public BPE vocabularies. Each model defines three things: a regex pattern for pre-tokenization, a merge table (the vocabulary), and a set of special tokens.

Model overview

ModelVocab tokensPatternSpecial tokensUsed by
r50k_base~50kGPT-2 pattern<|endoftext|>GPT-2
p50k_base~50kGPT-2 pattern<|endoftext|>Codex
p50k_edit~50kGPT-2 patternendoftext, fim_prefix/middle/suffixCodex (edit)
cl100k_base~100kcl100k patternendoftext, fim_prefix/middle/suffix, endofpromptGPT-3.5, GPT-4
o200k_base~200ko200k patternendoftext, endofpromptGPT-4o
o200k_harmony~200ko200k patternendoftext, endofprompt, startoftext, + many reservedGPT-4o (harmony)

What changed between models

r50k / p50k (GPT-2 era). The simplest pattern. Contractions ('s, 't, etc.) are matched literally. Words and numbers are preceded by an optional space. Case-sensitive.

cl100k (GPT-3.5 / GPT-4). Case-insensitive contractions. Words can be preceded by a non-letter, non-newline character (allowing punctuation to attach). Numbers limited to 3 digits at a time. Newlines handled explicitly.

o200k (GPT-4o). Much more sophisticated word patterns. Recognizes casing transitions (CamelCase splits). Uses Unicode general categories (\p{Lu}, \p{Ll}, \p{Lt}, \p{Lm}, \p{Lo}, \p{M}) for precise letter classification. Contractions are appended to words rather than matched separately.

Shared vocabularies

Some models share the same vocabulary file but differ in special tokens:

  • p50k_edit uses the same vocab as p50k_base but adds FIM (fill-in-middle) tokens.
  • o200k_harmony uses the same vocab as o200k_base but adds many reserved and named special tokens for structured generation.

Loading a vocabulary

The standard way to load a model:

#![allow(unused)]
fn main() {
use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache};

let mut cache = WordchipperDiskCache::default();
let (desc, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
}

load_vocab returns a (VocabDescription, Arc<UnifiedTokenVocab<u32>>). The description contains metadata; the vocab contains everything needed for encoding and decoding.

Short names

The openai:: prefix is optional. Both "cl100k_base" and "openai:cl100k_base" work. Use list_vocabs() for all registered short names and list_models() for fully qualified names.

Loading from a file path

If you have a .tiktoken file on disk, you can skip the download:

#![allow(unused)]
fn main() {
use wordchipper::pretrained::openai::OATokenizer;

let vocab = OATokenizer::Cl100kBase
    .load_path::<u32>("/path/to/cl100k_base.tiktoken")
    .unwrap();
}

Loading from a reader

For maximum flexibility (e.g., loading from an in-memory buffer or a network stream):

#![allow(unused)]
fn main() {
use std::io::BufReader;
use wordchipper::pretrained::openai::OATokenizer;

let data: &[u8] = b"..."; // tiktoken base64 format
let reader = BufReader::new(data);
let vocab = OATokenizer::O200kBase.read_vocab::<u32, _>(reader).unwrap();
}

Special tokens

Special tokens are strings with reserved token IDs that are never produced by BPE encoding. They're used for control flow: marking end-of-text, fill-in-middle boundaries, prompt boundaries, etc.

#![allow(unused)]
fn main() {
use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache, TokenizerOptions, TokenEncoder};
let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
let tok = TokenizerOptions::default().build(vocab);
let specials = tok.special_vocab();
for (bytes, &id) in specials.span_map().iter() {
    let name = String::from_utf8_lossy(bytes);
    println!("{} -> {}", name, id);
}
}

For cl100k_base, this prints:

<|endoftext|> -> 100257
<|fim_prefix|> -> 100258
<|fim_middle|> -> 100259
<|fim_suffix|> -> 100260
<|endofprompt|> -> 100276

The OATokenizer enum

For programmatic access to all OpenAI models, use the OATokenizer enum:

#![allow(unused)]
fn main() {
use wordchipper::pretrained::openai::OATokenizer;

// Iterate over all models
#[cfg(feature = "std")]
{
use strum::IntoEnumIterator;
for model in OATokenizer::iter() {
    println!("{}", model);
}
}
}

Each variant provides:

  • pattern() - the regex pattern for pre-tokenization
  • special_tokens::<T>() - the special token list
  • load_vocab::<T>(loader) - load the vocabulary with download support
  • load_path::<T>(path) - load from a local file

The tiktoken format

OpenAI's vocabulary files use a simple base64 format. Each line contains a base64-encoded byte sequence and its integer token ID, separated by a space:

IQ== 0
Ig== 1
Iw== 2
...

The base64 decodes to the raw bytes that the token represents. This format is what WordchipperDiskCache downloads and caches from OpenAI's CDN.

Choosing a model

If you're building a tool that interacts with an OpenAI model, use the matching tokenizer:

If you use...Load...
GPT-4o, GPT-4o-minio200k_base
GPT-4, GPT-3.5-turbocl100k_base
GPT-3 (text-davinci-003)p50k_base
GPT-2r50k_base

If you're building your own model or just need token counting, o200k_base has the largest vocabulary and handles the widest range of languages efficiently.

Training Your Own Tokenizer

This chapter walks through training a BPE tokenizer from scratch. If you haven't read How Tokenizers Work yet, start there. This chapter assumes you know what BPE merges are and why pre-tokenization matters.

Why train a custom tokenizer?

Pretrained tokenizers like cl100k_base and o200k_base are trained on broad internet text. They work well for general English, but they have trade-offs:

  • Domain-specific text. If your corpus is primarily code, legal documents, or scientific notation, a general-purpose tokenizer wastes vocabulary slots on tokens your data rarely uses. A tokenizer trained on your data will encode it in fewer tokens.
  • Non-English languages. General tokenizers are biased toward English. The same sentence in Hindi or Korean may produce 3-5x more tokens. Training on your target language fixes this.
  • Smaller vocabularies. Pretrained models use 50k-200k tokens. If you're building a small model or running on constrained hardware, a 4k or 8k vocabulary trained on your data can be more efficient than a 100k vocabulary trained on someone else's.
  • Research. You want to study how vocabulary size, training data, or regex pattern affect downstream model performance.

The short version: train your own when the pretrained vocabulary doesn't fit your data well.

Quick start: the CLI

The fastest way to train a tokenizer is with wchipper. No Rust code needed.

Training on text files

Create a text file with your training data (one document per line, or free-form text):

cargo run --release -p wchipper -- \
    train \
    --vocab-size=8000 \
    --output=my_tokenizer.tiktoken \
    corpus.txt

This reads corpus.txt, splits it using the default regex pattern (r50k_base's pattern), runs BPE training to learn 8000 - 256 = 7744 merges, and writes the vocabulary to my_tokenizer.tiktoken in base64-encoded tiktoken format.

Training on Parquet files

For larger datasets, Parquet is more efficient. The file must have a column named "text":

cargo run --release -p wchipper -- \
    train \
    --input-format=parquet \
    --vocab-size=50281 \
    --output=nanochat.tiktoken \
    ~/Data/nanochat/dataset/*.parquet

Parquet files are read in streaming record batches, so memory usage stays bounded even for multi-gigabyte datasets.

What the output looks like

Training logs progress at 1% intervals:

INFO Reading shards:
INFO 0: shard_00000.parquet
INFO 1: shard_00001.parquet
INFO Training Tokenizer...
INFO Starting BPE training: 50025 merges to compute
INFO Building pair index...
INFO Building heap with 16044 unique pairs
INFO Starting merge loop
INFO Progress: 1% (501/50025 merges) - Last merge: (69, 120) -> 756 (frequency: 166814)
INFO Progress: 2% (1001/50025 merges) - Last merge: (66, 77) -> 1256 (frequency: 17847)
...
INFO Progress: 100% (50025/50025 merges) - Last merge: (302, 3405) -> 50280 (frequency: 18)
INFO Finished training: 50025 merges completed
INFO Vocabulary Size: 50280

The early merges have high frequencies (166k+). These are the most common byte pairs in your corpus, things like (space, t) or (e, space). By the end, merges are rare (frequency 18), scraping the bottom of what's statistically useful.

The training pipeline

Training has three phases. Understanding them helps you make better choices about regex pattern, vocabulary size, and data preparation.

Phase 1: counting spans

Before any merges happen, the trainer splits every input document using the regex pattern, then counts how many times each unique span appears. This is the same pre-tokenization step that happens during encoding, but here it builds a frequency table.

Input: "the cat sat on the mat"

Regex splits: ["the", " cat", " sat", " on", " the", " mat"]

Span counts:
  "the"  -> 1
  " the" -> 1
  " cat" -> 1
  " sat" -> 1
  " on"  -> 1
  " mat" -> 1

With more data, common spans like " the" accumulate large counts. The trainer calls update_from_samples() incrementally, so you can feed it data in batches without loading everything into memory.

Internally, this is handled by TextSpanCounter, which stores span strings as CompactString keys (stack-allocated for short strings, heap-allocated when longer) with u32 counts.

Phase 2: building the pair index

Once all data is counted, the trainer converts each unique span into a TokenSpanBuf: a buffer of byte-level token IDs. Every span starts as its raw UTF-8 bytes, mapped through a ByteMapVocab (the initial 256-entry vocabulary where token 0 = byte 0, token 1 = byte 1, etc.).

Then it scans every span to build two structures:

  • Pair counts: for every adjacent pair (a, b), the total count across all spans (weighted by span frequency). If " the" appears 50,000 times and contains the pair (116, 104) (t, h), that pair gets +50,000.
  • Pair index: for every pair, which span indices contain it. This avoids scanning all spans when a merge only affects a few.

These go into a PairSpanIndex, then seed an octonary heap (an 8-ary heap, faster than a binary heap for this workload) sorted by count descending.

Phase 3: the merge loop

This is the core of BPE training. It runs until the vocabulary reaches the target size:

while merges_done < num_merges:
    1. Pop the highest-count pair from the heap.
    2. If the count is stale (changed since it was pushed), refresh and re-push.
    3. Allocate a new token ID for this merge.
    4. For every span containing this pair:
       a. Replace all occurrences of (a, b) with the new token.
       b. Update pair counts: decrement removed pairs, increment new pairs.
       c. Track which spans now contain newly created pairs.
    5. Push new pairs onto the heap.

The "lazy refresh" in step 2 is important. When pair (a, b) is merged, it changes the neighbors of a and b in every span where they appeared. This can reduce the count of other pairs in the heap that haven't been popped yet. Rather than updating every affected heap entry (expensive), the trainer just checks the current count when a pair is popped. If it's stale, it's re-pushed with the correct count and the loop continues. This is a standard trick in BPE implementations.

A concrete merge example

Given the span "hello" (tokens: [104, 101, 108, 108, 111]) and the merge (108, 108) -> 256 (merging l, l):

Before: [104, 101, 108, 108, 111]
         h    e    l    l    o

Pairs removed: (101, 108), (108, 108), (108, 111)
Pairs added:   (101, 256), (256, 111)

After:  [104, 101, 256, 111]
         h    e    ll   o

The TokenSpanBuf::merge_pair_cb method handles this in a single pass through the span, reporting pair deltas through a callback so the trainer can update global counts incrementally.

Training via the Rust API

For more control, use wordchipper-training directly. Add it to your Cargo.toml:

[dependencies]
wordchipper = "0.8"
wordchipper-training = "0.8"

Minimal example

use std::sync::Arc;

use wordchipper::{
    TokenEncoder, TokenDecoder,
    Tokenizer, TokenizerOptions, UnifiedTokenVocab,
    pretrained::openai::OA_CL100K_BASE_PATTERN,
    vocab::ByteMapVocab,
};
use wordchipper_training::BPETRainerOptions;

fn main() {
    // 1. Configure the trainer
    let options = BPETRainerOptions::new(
        OA_CL100K_BASE_PATTERN,  // regex pattern for pre-tokenization
        1000,                     // target vocabulary size
    );
    let mut trainer = options.init();

    // 2. Feed it training data
    let samples = vec![
        "the cat sat on the mat",
        "the dog sat on the log",
        "the cat and the dog",
    ];
    trainer.update_from_samples(samples.iter());

    // 3. Train
    let byte_vocab: ByteMapVocab<u32> = Default::default();
    let vocab: Arc<UnifiedTokenVocab<u32>> = trainer
        .train(byte_vocab)
        .expect("training failed")
        .into();

    // 4. Use the tokenizer
    let tokenizer = TokenizerOptions::default().build(vocab);

    let tokens = tokenizer.try_encode("the cat").unwrap();
    let decoded = tokenizer.try_decode_to_string(&tokens).unwrap().unwrap();
    assert_eq!(decoded, "the cat");

    println!("'the cat' -> {:?}", tokens);
}

Feeding data in batches

For large datasets, call update_from_samples multiple times. The trainer accumulates span counts across calls:

#![allow(unused)]
fn main() {
use wordchipper::pretrained::openai::OA_CL100K_BASE_PATTERN;
use wordchipper_training::BPETRainerOptions;
let mut trainer = BPETRainerOptions::new(OA_CL100K_BASE_PATTERN, 1000).init();
// First batch
let batch1 = vec!["hello world", "hello there"];
trainer.update_from_samples(batch1.iter());

// Second batch
let batch2 = vec!["world peace", "hello again"];
trainer.update_from_samples(batch2.iter());

// Counts from both batches are combined before training
}

This is how the CLI processes Parquet files: each record batch is fed as a separate call to update_from_samples.

Saving the vocabulary

The trained vocabulary can be saved in tiktoken's base64 format:

use std::sync::Arc;

use wordchipper::{
    UnifiedTokenVocab, VocabIndex,
    pretrained::openai::OA_CL100K_BASE_PATTERN,
    vocab::{ByteMapVocab, io::save_base64_span_map_path},
};
use wordchipper_training::BPETRainerOptions;

fn main() {
    let mut trainer = BPETRainerOptions::new(OA_CL100K_BASE_PATTERN, 1000).init();
    trainer.update_from_samples(vec!["hello world"].iter());

    let vocab: Arc<UnifiedTokenVocab<u32>> = trainer
        .train(ByteMapVocab::default())
        .expect("training failed")
        .into();

    // Save as .tiktoken file
    save_base64_span_map_path(
        &vocab.span_vocab().span_map(),
        "my_vocab.tiktoken",
    ).expect("failed to save");
}

This produces a file compatible with OpenAI's tiktoken format: one line per token, each line is base64(byte_sequence) token_id.

Choosing a regex pattern

The regex pattern controls how text is split before BPE runs. This is one of the most consequential decisions in tokenizer design. The pattern determines what kinds of tokens can exist.

Available patterns

wordchipper provides the same patterns used by OpenAI's models as constants:

ConstantUsed byKey behavior
OA_R50K_BASE_PATTERNGPT-2Basic word/number/punctuation splitting
OA_CL100K_BASE_PATTERNGPT-3.5, GPT-4Adds case-insensitive contractions
OA_O200K_BASE_PATTERNGPT-4oUnicode-aware, broader letter categories

All patterns live in wordchipper::pretrained::openai.

What the pattern affects

Consider the difference between splitting on \w+ versus the cl100k pattern:

Input: "don't stop"

\w+ splits:          ["don", "t", "stop"]
cl100k pattern:      ["don", "'t", " stop"]

The cl100k pattern keeps the contraction 't together and attaches the leading space to stop. These design choices propagate through the entire vocabulary: a tokenizer trained with \w+ will never learn a token for 't or stop because the regex never produces those spans.

Custom patterns

You can use any valid regex:

#![allow(unused)]
fn main() {
use wordchipper_training::BPETRainerOptions;
// Split only on whitespace boundaries
let options = BPETRainerOptions::new(r"\S+|\s+", 4000);

// Split on individual characters (character-level BPE)
let options = BPETRainerOptions::new(r".", 1000);

// Domain-specific: keep email addresses and URLs intact
let options = BPETRainerOptions::new(
    r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|https?://\S+|\w+|\s+|.",
    8000,
);
}

As a starting point, use OA_CL100K_BASE_PATTERN. It's well-tested and handles contractions, whitespace, and Unicode reasonably. Change it only when you have a specific reason.

Choosing a vocabulary size

The vocabulary size is the total number of tokens, including the 256 byte tokens. So --vocab-size=1000 means 256 byte tokens + 744 learned merges.

Trade-offs

Vocab sizeMergesEffect
2560Pure byte-level. Every character is 1-4 tokens. Long sequences.
1,000744Common bigrams and short words are single tokens.
8,0007,744Most common words are single tokens. Good for small models.
50,00049,744Covers most English words. Standard for GPT-2 era models.
100,00099,744Covers English + common multilingual tokens. GPT-4 range.
200,000199,744Broad multilingual coverage. GPT-4o range.

Larger vocabularies mean shorter token sequences (faster inference, more text per context window) but larger embedding tables (more parameters, more memory). For a small model on a specific domain, 4k-16k is often a good range. For a general-purpose model, 50k-200k.

How merges tail off

Early merges capture high-frequency patterns like common letter pairs and short words. Late merges capture rare combinations. You can see this in the training output: the first merge might have frequency 166,000 while the last merge has frequency 18.

If your late merges have very low frequencies (single digits), you've likely exhausted the useful patterns in your data. A smaller vocabulary size would waste less of the embedding table on tokens that rarely appear.

Performance expectations

Training speed depends on two factors:

  1. Corpus size. Counting spans is O(n) in the input size. Expect roughly 1 second per 10 MB of input text for the counting phase.
  2. Vocabulary size. The merge loop runs one iteration per merge (vocab_size - 256 iterations). Each iteration touches only the spans affected by that merge, but the total cost grows with both vocabulary size and corpus diversity.

For reference, training a 50k vocabulary on the nanochat dataset (8 Parquet shards, roughly 80 MB of text) takes around 80 CPU-minutes on a single thread. The counting phase is fast; the merge loop dominates.

The trainer is single-threaded. If your data loading is slow (e.g., reading from disk or network), run it on a separate thread so the trainer isn't blocked on I/O. The CLI handles this for Parquet by streaming record batches.

Putting it all together

Here's a complete workflow: train a tokenizer on your data, save it, and use it for encoding.

use std::sync::Arc;

use wordchipper::{
    TokenEncoder, TokenDecoder,
    Tokenizer, TokenizerOptions, UnifiedTokenVocab, VocabIndex,
    pretrained::openai::OA_CL100K_BASE_PATTERN,
    vocab::{ByteMapVocab, io::save_base64_span_map_path},
};
use wordchipper_training::BPETRainerOptions;

fn main() -> wordchipper::WCResult<()> {
    // --- Train ---
    let mut trainer = BPETRainerOptions::new(OA_CL100K_BASE_PATTERN, 2000).init();

    // In practice, feed much more data
    let corpus = vec![
        "The quick brown fox jumps over the lazy dog.",
        "Pack my box with five dozen liquor jugs.",
        "How vexingly quick daft zebras jump!",
    ];
    trainer.update_from_samples(corpus.iter());

    let vocab: Arc<UnifiedTokenVocab<u32>> = trainer
        .train(ByteMapVocab::default())
        .expect("training failed")
        .into();

    // --- Save ---
    save_base64_span_map_path(
        &vocab.span_vocab().span_map(),
        "/tmp/my_vocab.tiktoken",
    )?;

    // --- Use ---
    let tokenizer = TokenizerOptions::default().build(vocab);
    let tokens = tokenizer.try_encode("The quick brown fox")?;
    let decoded = tokenizer.try_decode_to_string(&tokens)?.unwrap();
    assert_eq!(decoded, "The quick brown fox");

    println!("Encoded {} tokens: {:?}", tokens.len(), tokens);

    Ok(())
}

What's next

Once you have a trained tokenizer, you might want to:

  • Benchmark it against pretrained tokenizers. See Performance for how wordchipper measures throughput.
  • Build a custom DFA lexer for your regex pattern. See Building Custom Logos Lexers for 30-50x faster pre-tokenization.
  • Understand the BPE encoding algorithms available. See Advanced: Span Encoders for the different encoder strategies.

Performance

wordchipper is designed for throughput. This chapter covers the knobs you can turn: BPE algorithm selection, DFA-accelerated pre-tokenization, parallelism, and hash function choice.

The two bottlenecks

Tokenization has two phases, and each can be the bottleneck:

  1. Pre-tokenization (spanning). Splitting text into spans using a regex. This is I/O-bound: the regex engine scans every byte of the input.
  2. BPE encoding. Merging byte pairs within each span. This is compute-bound: each span requires iterative pair lookups and merges.

For short texts, BPE dominates. For long texts with many spans, pre-tokenization matters more. wordchipper optimizes both.

DFA-accelerated spanning (logos)

The biggest single optimization is replacing the regex engine with a compile-time DFA. wordchipper uses the logos crate to compile pre-tokenization patterns into deterministic finite automata at build time.

This is enabled by default. When you load a vocabulary whose regex pattern matches a known logos lexer (cl100k, o200k, r50k), the DFA lexer is used automatically.

Benchmarks

Spanning throughput on a single thread:

ModelRegexLogos DFASpeedup
cl100k_base~25 MB/s~732 MB/s29x
o200k_base~15 MB/s~765 MB/s52x

Disabling DFA acceleration

If you want to force regex-based spanning (e.g., for testing or debugging):

#![allow(unused)]
fn main() {
use wordchipper::{TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache};
let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
let tok = TokenizerOptions::default()
    .with_accelerated_lexers(false)
    .build(vocab);
}

BPE algorithm selection

wordchipper includes five span encoder implementations. Each trades off differently between single-threaded speed, concurrent access, memory usage, and implementation complexity. See Advanced: Span Encoders for a deep dive into how each works.

The algorithms

AlgorithmBest forNotes
MergeHeapConcurrent / multi-threadedDefault for ConcurrentDefault. Heap-based merge tracking.
PriorityMergeSingle-threadedDefault for SingleThreadDefault. Priority-queue merging.
BufferSweepTesting / referenceSimple and correct, not optimized.
TailSweepMemory-constrainedAlternative scanning strategy.
BpeBacktrackExact BPE semanticsO(n) via Aho-Corasick automaton + backtracking.

Selecting an algorithm

#![allow(unused)]
fn main() {
use wordchipper::{
    TokenizerOptions, TokenEncoderOptions,
    encoders::token_span_encoder::span_encoders::SpanEncoderSelector,
    load_vocab, disk_cache::WordchipperDiskCache,
};

let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();

let mut opts = TokenizerOptions::default();
opts.encoder.set_span_encoder(SpanEncoderSelector::BpeBacktrack);
let tok = opts.build(vocab);
}

Which algorithm should I use?

For most users, the defaults are correct:

  • Multi-threaded workloads (web servers, batch processing): Use the default ConcurrentDefault, which selects MergeHeap.
  • Single-threaded workloads (CLI tools, embedded): Use SingleThreadDefault, which selects PriorityMerge.
  • Exact BPE needed: Use BpeBacktrack. It uses an Aho-Corasick automaton pre-built from the vocabulary for O(n) encoding that exactly matches the theoretical BPE algorithm. The other algorithms also produce correct output for standard vocabularies; BpeBacktrack is the theoretical gold standard.

Parallelism with rayon

When the parallel feature is enabled (it is by default), batch operations parallelize across threads:

#![allow(unused)]
fn main() {
use wordchipper::{TokenizerOptions, TokenEncoder, load_vocab, disk_cache::WordchipperDiskCache};
let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
let tok = TokenizerOptions::default()
    .with_parallel(true)
    .build(vocab);

let texts: Vec<&str> = vec!["hello"; 1000];
let batch = tok.try_encode_batch(&texts).unwrap();
}

Setting parallel(true) affects both encoding and decoding. The concurrent(true) option additionally selects a span encoder optimized for concurrent access from multiple threads.

Controlling thread count

wordchipper uses rayon's global thread pool. Control it with the RAYON_NUM_THREADS environment variable:

RAYON_NUM_THREADS=8 cargo run --release

Or configure it programmatically via rayon::ThreadPoolBuilder.

Hash function selection

wordchipper uses hash maps extensively for vocabulary lookups. The fast-hash feature (enabled by default) swaps in foldhash for faster hashing:

FeatureHash functionNotes
fast-hashfoldhashGood general-purpose, works in no_std
(none)defaultSipHash (std) or hashbrown default

Unlike previous versions, fast-hash works in no_std environments. See Feature Flags for details.

End-to-end benchmarks

Encode + decode throughput on 90 MB shards, 48 threads:

Modelwordchippertiktoken-rsHuggingFace tokenizers
r50k_base239 MiB/s169 MiB/s22 MiB/s
p50k_base251 MiB/s163 MiB/s22 MiB/s
p50k_edit242 MiB/s170 MiB/s21 MiB/s
cl100k_base214 MiB/s125 MiB/s22 MiB/s
o200k_base119 MiB/s124 MiB/s22 MiB/s
o200k_harmony122 MiB/s122 MiB/s22 MiB/s

Running benchmarks yourself

The sample-timer tool runs wordchipper vs. tiktoken-rs side-by-side:

RAYON_NUM_THREADS=48 cargo run --release -p sample-timer -- \
    --dataset-dir $DATASET_DIR --shards 0 --model openai::cl100k_base

For criterion-based microbenchmarks:

cargo bench -p wordchipper-bench

Performance tips

  1. Use the default features. DFA acceleration and rayon parallelism (parallel) are both on by default.
  2. Batch your inputs. try_encode_batch is significantly faster than calling try_encode in a loop because it amortizes thread pool overhead.
  3. Reuse the tokenizer. Building a Tokenizer pre-computes data structures. Build it once, share via Arc.
  4. Match the encoder to your workload. Use ConcurrentDefault for multi-threaded, SingleThreadDefault for single-threaded.
  5. Profile spanning vs. encoding. If spanning is the bottleneck, make sure DFA acceleration is active. If encoding is the bottleneck, experiment with SpanEncoderSelector variants.

Building Custom Logos Lexers

This chapter assumes familiarity with the two-phase pipeline (spanning + BPE encoding) described in How Tokenizers Work.

BPE tokenizers split text in two phases: first into words (pre-tokenization), then each word into subword tokens. The first phase is typically a big regex. OpenAI's cl100k_base pattern, for example, uses alternations with Unicode property classes and lookaheads to segment "hello world" into ["hello", " world"].

Regex is correct but slow. Each match backtracks through Unicode property tables, and the engine runs single-threaded. The logos crate takes a different approach: it compiles regex patterns into a deterministic finite automaton (DFA) at build time via a derive macro. No backtracking, no runtime regex compilation. For wordchipper's cl100k and o200k patterns, this gives 30-50x speedups (700+ MB/s).

But logos DFA can't express everything a regex can. The OpenAI patterns use \s+(?!\S), a negative lookahead that backtracks so the last whitespace character becomes a prefix of the next word. Logos has no lookaheads. So we need a post-processing step that corrects the token stream after the DFA runs.

This post-processing is extracted into composable, public building blocks. You supply the patterns. We handle the rest.

When to use this

The building blocks in this chapter are designed for tokenizers that use OpenAI-style regex pre-tokenization with the \s+(?!\S) lookahead idiom. This includes:

  • cl100k_base (GPT-4, GPT-3.5-turbo)
  • o200k_base (GPT-4o)
  • p50k_base / p50k_edit (GPT-3, Codex)
  • r50k_base (GPT-2)
  • Any custom tokenizer that copies the OpenAI regex structure

Tokenizers with fundamentally different pre-tokenization don't need this machinery:

  • SentencePiece (Llama, Gemini) replaces spaces with before tokenization. No regex pre-tokenization.
  • Byte-level BPE (GPT-NeoX, Falcon) uses HuggingFace's ByteLevel pre-tokenizer.
  • Bert-style tokenizers split on whitespace and punctuation with simple rules, no lookaheads.

If your tokenizer's regex pattern doesn't use \s+(?!\S) or a similar whitespace-backtracking idiom, you can still use logos for DFA speed, but you won't need Gpt2FamilyTokenRole or for_each_classified_span. Just implement SpanLexer directly.

The whitespace problem, concretely

To understand why we need post-processing, consider what happens with the input "hello world".

The regex \s+(?!\S) matches whitespace greedily, then backtracks one character. So " " (three spaces) becomes " " (two spaces as one token) + " world" (the last space merges into the next word). This is how OpenAI's patterns work: whitespace "leaks" into the following word.

Logos has no lookaheads. Its DFA matches " " as a single Whitespace token and "world" as a separate Letters token. Without correction, you'd get different spans than the regex.

The post-processing engine fixes this. It buffers whitespace tokens and, when the next token arrives, decides how to split them. The rules depend on what kind of token follows:

  • A letter token (world): split off the last whitespace character and merge it with the word. Result: " " + " world". Matches the regex.
  • A punctuation token (!): if the last whitespace character is an ASCII space, merge it with the punctuation (matching the ? prefix in patterns like ?[^\s\p{L}\p{N}]+).
  • A digit or newline: emit the whitespace as its own span. No merging.

You don't need to implement any of this. You just tell the engine what kind of token each logos variant represents, and it applies the correct rule.

The building blocks

Three public items in wordchipper::spanners::span_lexers::logos::gpt2_family:

Gpt2FamilyTokenRole

An enum that classifies how each logos token interacts with preceding whitespace:

#![allow(unused)]
fn main() {
pub enum Gpt2FamilyTokenRole {
    Whitespace,
    Punctuation,
    Word { check_contraction: bool, first_char_is_letter: bool },
    Standalone,
    Newline,
    Gap,
}
}

Each variant tells the engine a different whitespace rule:

  • Whitespace: this token is whitespace. The engine buffers it and decides later how to split it based on what comes next.
  • Punctuation: absorbs a preceding ASCII space. The engine merges the last buffered space character into this token's span (matching the ? regex prefix).
  • Word: absorbs a preceding space if first_char_is_letter is true. If the token starts with a non-letter prefix (like "hello where " is the first char), set first_char_is_letter to false and the engine handles the prefix separately. The check_contraction field enables cl100k-style splitting where 'The becomes 'T + he.
  • Standalone: never absorbs whitespace. Digits, explicit contractions. Any preceding whitespace becomes its own span.
  • Newline: newline-containing whitespace (\s*[\r\n]+). Buffered separately; at end of string, adjacent Newline + Whitespace merge (matching the regex \s++$ behavior).
  • Gap: unrecognized bytes. Use this for logos Err(()).

Gpt2FamilyLogos trait

Maps a logos token enum to Gpt2FamilyTokenRole:

#![allow(unused)]
fn main() {
pub trait Gpt2FamilyLogos<'a>: Logos<'a> {
    fn family_role(&self) -> Gpt2FamilyTokenRole;
}
}

Implement this for your token enum and the engine handles all the post-processing.

contraction_split

An optional utility for cl100k-compatible lexers. Logos longest-match picks 'The as one token, but cl100k's regex first-match picks 'T (contraction) then he (letters). This function detects the contraction prefix and returns the split point.

Most custom lexers won't need this. Set check_contraction: false and ignore it.

Building a custom lexer: step by step

Let's build a lexer from scratch. We'll target a simplified pattern that handles letters, digits, punctuation, and whitespace.

Step 1: Define the logos token enum

#![allow(unused)]
fn main() {
use logos::Logos;

#[derive(Logos, Debug, PartialEq, Clone)]
enum MyToken {
    #[regex(r"\p{Letter}+")]
    Letters,

    #[regex(r"\p{Number}{1,3}")]
    Digits,

    #[regex(r" ?[^\s\p{Letter}\p{Number}]+[\r\n]*")]
    Punctuation,

    #[regex(r"\s*[\r\n]+")]
    Newline,

    #[regex(r"[^\S\r\n]+")]
    Whitespace,
}
}

Each variant maps to a regex fragment. Logos compiles all of them into a single DFA at build time.

Step 2: Implement Gpt2FamilyLogos

Map each token variant to its role. This is where you make design decisions: for each token type, ask "How should this interact with preceding whitespace?"

#![allow(unused)]
fn main() {
use wordchipper::spanners::span_lexers::logos::gpt2_family::{
    Gpt2FamilyLogos, Gpt2FamilyTokenRole,
};

impl Gpt2FamilyLogos<'_> for MyToken {
    fn family_role(&self) -> Gpt2FamilyTokenRole {
        match self {
            // Whitespace is buffered; last char may merge into next token.
            Self::Whitespace => Gpt2FamilyTokenRole::Whitespace,

            // Letters absorb a preceding space when the token starts with
            // a letter. No contraction splitting needed for our pattern.
            Self::Letters => Gpt2FamilyTokenRole::Word {
                check_contraction: false,
                first_char_is_letter: true,
            },

            // Punctuation absorbs a preceding ASCII space (the ` ?` prefix).
            Self::Punctuation => Gpt2FamilyTokenRole::Punctuation,

            // Newlines are buffered separately.
            Self::Newline => Gpt2FamilyTokenRole::Newline,

            // Digits stand alone. They never merge with preceding whitespace.
            Self::Digits => Gpt2FamilyTokenRole::Standalone,
        }
    }
}
}

The key insight: you don't need to understand the whitespace-splitting algorithm. You just need to know what each token is, and Gpt2FamilyTokenRole maps that to the correct behavior.

Step 3: Use the logos_lexer! macro

The logos_lexer! macro generates the SpanLexer impl and registers the lexer with inventory for automatic acceleration:

#![allow(unused)]
fn main() {
logos_lexer! {
    /// Logos DFA word scanner for my custom pattern.
    pub struct MyLexer;
    token = MyToken;
    pattern = MY_PATTERN;
}
}

This generates a struct that implements SpanLexer, which returns an iterator of word byte ranges. The macro also registers the lexer via inventory so it automatically replaces the regex spanner when the pattern matches.

If you need a manual SpanLexer impl instead (e.g. for a non-standard pattern), you can call logos_span_iter directly:

#![allow(unused)]
fn main() {
use wordchipper::spanners::span_lexers::{SpanLexer, logos::gpt2_family::logos_span_iter};

impl SpanLexer for MyLexer {
    fn find_span_iter<'a>(
        &'a self,
        text: &'a str,
    ) -> Box<dyn Iterator<Item = Range<usize>> + 'a> {
        Box::new(logos_span_iter(text, MyToken::lexer(text).spanned()))
    }
}
}

The real thing: cl100k in ~65 lines

The built-in Cl100kLexer follows exactly this pattern. The token enum has 7 variants. The Gpt2FamilyLogos impl is 15 lines. The SpanLexer impl is generated by logos_lexer!. You can read the full source at crates/wordchipper/src/spanners/span_lexers/logos/cl100k.rs.

Gpt2FamilyTokenRole reference

VariantAbsorbs preceding whitespace?Use for
WhitespaceN/A (is whitespace)Horizontal whitespace tokens ([^\S\r\n]+)
PunctuationYes, ASCII space onlyPunctuation with ? prefix ( ?[^\s\p{L}\p{N}]+)
Word { check_contraction, first_char_is_letter }Yes, if first_char_is_letterLetter/word tokens (\p{L}+, case-sensitive word patterns)
StandaloneNoDigits, explicit contractions, anything that stands alone
NewlineN/A (buffered separately)Newline-containing whitespace (\s*[\r\n]+)
GapNoUnrecognized bytes (logos errors). Always use for Err(())

When in doubt, use Standalone. It's the safest default: the token is emitted as-is, and any preceding whitespace becomes its own span.

Testing your lexer

A logos lexer must produce identical span boundaries to the regex it replaces. If they diverge on any input, benchmark comparisons are invalid because the two code paths tokenize different spans.

Equivalence testing with lexer-equivalence

The lexer-equivalence crate in dev-crates/ provides exhaustive combinatorial testing. It generates all k-character strings (k=1..4) from a set of Unicode representative codepoints and compares the span output of each logos lexer against the regex reference.

The representative set is derived from the Unicode predicates used in the OpenAI patterns (\p{Lu}, \p{Ll}, \p{N}, \s, etc.), which partition Unicode into 22 equivalence cells. Two codepoints in the same cell are indistinguishable to every regex predicate, so testing one representative per cell covers the full Unicode space. The test suite checks ~732,540 inputs per lexer and runs in under 10 seconds:

cargo test -p lexer-equivalence

The validate_representative_completeness test programmatically verifies that the representative set covers every equivalence cell, so you can be confident the coverage is complete.

Adding tests for a custom lexer

To test a custom lexer against its regex, add a test that calls assert_k_tuple_equivalence from the lexer_equivalence::harness module with your lexer and the regex pattern it targets. See dev-crates/lexer-equivalence/tests/equivalence.rs for the pattern used by the built-in lexers.

Python & WASM Bindings

wordchipper has first-class bindings for Python and JavaScript/TypeScript. Both expose the same core operations: load a vocabulary, encode text to tokens, decode tokens back to text.

Python

Installation

pip install wordchipper

Basic usage

from wordchipper import Tokenizer

tok = Tokenizer.from_pretrained("cl100k_base")

# Encode and decode
tokens = tok.encode("hello world")       # [15339, 1917]
text = tok.decode(tokens)                 # "hello world"

# Batch operations (parallel via rayon)
results = tok.encode_batch(["hello", "world", "foo bar"])
texts = tok.decode_batch(results)

Vocabulary inspection

tok.vocab_size                             # 100256
tok.token_to_id("hello")                   # 15339
tok.id_to_token(15339)                     # "hello"
tok.token_to_id("nonexistent")             # None

# Special tokens
tok.get_special_tokens()
# [('<|endoftext|>', 100257), ('<|fim_prefix|>', 100258), ...]

Available models

Tokenizer.available_models()
# ['r50k_base',
#  'p50k_base',
#  'p50k_edit',
#  'cl100k_base',
#  'o200k_base',
#  'o200k_harmony']

Saving vocabularies

Export a vocabulary in tiktoken's base64 format:

tok.save_base64_vocab("vocab.tiktoken")

Compatibility wrappers

Drop-in replacements for tiktoken and HuggingFace tokenizers. Change one import line and the rest of your code stays the same.

tiktoken

# Before
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

# After
from wordchipper.compat import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

tokens = enc.encode("hello world")
text = enc.decode(tokens)

# Model lookup works too
enc = tiktoken.encoding_for_model("gpt-4o")

The Encoding class exposes encode, encode_ordinary, encode_batch, decode, decode_batch, and properties name, n_vocab, max_token_value, eot_token, and special_tokens_set. Parameters accepted for API compatibility but not implemented (allowed_special, disallowed_special) raise NotImplementedError when set to non-default values.

HuggingFace tokenizers

# Before
from tokenizers import Tokenizer

# After
from wordchipper.compat.tokenizers import Tokenizer

tok = Tokenizer.from_pretrained("Xenova/gpt-4o")
output = tok.encode("hello world")
output.ids      # [24912, 2375]
output.tokens   # ["hello", " world"]
text = tok.decode(output.ids)

The Tokenizer class exposes encode, encode_batch, decode, decode_batch, get_vocab_size, token_to_id, and id_to_token. Known HuggingFace identifiers (e.g. Xenova/gpt-4o) are mapped automatically; bare encoding names like cl100k_base also work. Parameters accepted for API compatibility but not implemented (e.g. pair, is_pretokenized, add_special_tokens=False, skip_special_tokens=False) raise NotImplementedError when set to non-default values.

Building from source

Requires Rust and uv:

cd bindings/python
uv venv .venv && source .venv/bin/activate
uv pip install maturin pytest
maturin develop           # debug build
maturin develop --release # release build for benchmarks
pytest tests/ -v

JavaScript / TypeScript (WASM)

wordchipper compiles to WebAssembly and runs in browsers and Node.js. The WASM build uses default-features = false (no std, no parallelism, no file I/O), so all core tokenization works in the browser without a server.

Quick start

import { Tokenizer } from "./js/dist/index.js";

const tok = await Tokenizer.fromPretrained("o200k_base");

const tokens = tok.encode("hello world"); // Uint32Array [24912, 2375]
const text = tok.decode(tokens); // "hello world"

tok.free(); // release WASM memory when done

Loading

Two ways to load a tokenizer:

// Fetch from OpenAI's CDN (convenience)
const tok1 = await Tokenizer.fromPretrained("cl100k_base");

// Or from your own vocab bytes (no network request)
const data = new Uint8Array(/* .tiktoken file contents */);
const tok2 = await Tokenizer.fromVocabData("cl100k_base", data);

fromPretrained uses fetch() internally, so it works in both browser and Node.js 18+ environments.

Encode and decode

// Single
const tokens = tok.encode("hello world"); // Uint32Array
const text = tok.decode(tokens); // string

// Batch
const results = tok.encodeBatch(["hello", "world"]); // Uint32Array[]
const texts = tok.decodeBatch(results); // string[]

Vocabulary inspection

tok.vocabSize; // 100256
tok.maxToken; // 100255 (or null)
tok.tokenToId("hello"); // 15339 (or null)
tok.idToToken(15339); // "hello" (or null)
tok.getSpecialTokens(); // [["<|endoftext|>", 100257], ...]
Tokenizer.availableModels(); // ["r50k_base", "p50k_base", ...]

Memory management

WASM objects must be freed manually. Call tok.free() when you're done with a tokenizer to release its WASM memory.

Building from source

Requires Rust, wasm-pack, and Node.js:

# Build the WASM package
wasm-pack build bindings/wasm --target web

# Build the TypeScript wrapper
cd bindings/wasm/js
npm install
npm run build

Examples

Working examples are included in the repository:

  • Node.js: examples/wasm-node/ - Encode/decode from a Node.js script
  • Browser: examples/wasm-browser/ - In-browser tokenization with a simple HTML page
  • Live demo: Interactive Tokenizer Demo - Try it directly in this book

API comparison

All three produce identical token sequences for the same input and model.

Rust

  • Load: load_vocab("cl100k_base", &mut cache)
  • Encode: tok.try_encode(text)
  • Decode: tok.try_decode_to_string(&tokens)
  • Batch: tok.try_encode_batch(&texts)
  • Vocab size: tok.vocab().len()
  • Special tokens: tok.special_vocab().span_map()

Python

  • Load: Tokenizer.from_pretrained("cl100k_base")
  • Encode: tok.encode(text)
  • Decode: tok.decode(tokens)
  • Batch: tok.encode_batch(texts)
  • Vocab size: tok.vocab_size
  • Special tokens: tok.get_special_tokens()

JavaScript

  • Load: await Tokenizer.fromPretrained("cl100k_base")
  • Encode: tok.encode(text)
  • Decode: tok.decode(tokens)
  • Batch: tok.encodeBatch(texts)
  • Vocab size: tok.vocabSize
  • Special tokens: tok.getSpecialTokens()

no_std & Embedded

wordchipper's core tokenization pipeline works without the Rust standard library. This means you can run BPE encoding and decoding on WASM targets, microcontrollers, and other no_std environments.

What works without std

The following all work with default-features = false:

  • Vocabulary types (UnifiedTokenVocab, ByteMapVocab, PairMapVocab, SpanMapVocab)
  • Text spanning (pre-tokenization)
  • BPE encoding (SpanEncoder implementations)
  • Token decoding
  • Logos DFA lexers (compile-time, no runtime regex)
  • Special token handling

What requires std

These features are behind feature flags that imply std:

FeatureRequires std because
downloadNetwork I/O, file system caching
datagymJSON parsing, file I/O
concurrentThread pool, OS threads
parallelRayon requires OS threads
Regex-based spanningregex and fancy-regex crates

Note that fast-hash does not require std. See Feature Flags for the full reference.

Configuration

[dependencies]
wordchipper = { version = "0.7", default-features = false }

That's it. No separate no_std feature flag needed. The crate uses unconditional #![no_std] and conditionally links std when the std feature is active.

How it works internally

wordchipper uses the "Reddit PSA" pattern for no_std support:

#![no_std]

#[cfg(feature = "std")]
extern crate std;

extern crate alloc;

This means the crate always starts in no_std mode. The std feature adds standard library support on top. All collection types come from alloc (Vec, String, Box) via an internal prelude module.

For hash maps, hashbrown is always available as a non-optional dependency. When fast-hash is enabled, foldhash is used as the hasher (with either std::collections::HashMap or hashbrown::HashMap depending on whether std is active). Without fast-hash, the default hasher is used.

WASM targets

The WASM bindings (bindings/wasm/) demonstrate a full no_std integration. The WASM crate uses default-features = false and builds the entire tokenization pipeline without std:

[dependencies]
wordchipper = { path = "../../crates/wordchipper", default-features = false }

Vocabulary loading in WASM works by parsing tiktoken-format data from &[u8] directly, bypassing the std-gated vocab::io module.

CI verifies the no_std build against two targets:

# WASM (browser/Node.js)
cargo check --target wasm32-unknown-unknown --no-default-features

# ARM Cortex-M (bare metal)
cargo check --target thumbv7m-none-eabi --no-default-features

Embedded considerations

On memory-constrained targets, be aware that:

  • Vocabulary size matters. cl100k_base has ~100k entries. Each entry stores a byte sequence and merge pairs. Budget several megabytes for the vocabulary data structures.
  • No regex. Without std, regex-based spanning is unavailable. The logos DFA lexers work in no_std since they're compiled at build time, but only for known patterns (r50k, cl100k, o200k). For custom patterns, you'd need to implement SpanLexer directly.
  • No parallelism. The parallel and concurrent features require std. All encoding runs single-threaded. Use SingleThreadDefault for the best single-threaded span encoder.
  • Allocator required. wordchipper uses alloc (Vec, String, Box, Arc). Your target must provide a global allocator.

Example: building a vocabulary in no_std

Since load_vocab and file I/O require std, you need to construct the vocabulary from raw data. The WASM bindings show one approach: parse a tiktoken-format &[u8] buffer into a SpanMapVocab, combine with a TextSpanningConfig, and build a UnifiedTokenVocab.

The key types involved:

use wordchipper::{
    UnifiedTokenVocab,
    vocab::{SpanMapVocab, ByteMapVocab},
    spanners::TextSpanningConfig,
};

// 1. Parse your vocabulary data into a SpanMapVocab
// 2. Combine with a TextSpanningConfig for your pattern
// 3. Build: UnifiedTokenVocab::from_span_vocab(config, span_vocab)

The exact parsing depends on your data format. See bindings/wasm/src/lib.rs for a working implementation.

Advanced: Span Encoders

This chapter covers wordchipper's span encoder architecture: what a span encoder does, how the five built-in implementations work, and when you might choose one over another.

What a span encoder does

After pre-tokenization splits text into spans, each span needs to be converted into a sequence of token IDs. This is the job of the SpanEncoder trait:

pub trait SpanEncoder<T: TokenType>: Send {
    fn encode_append_compound_span(
        &mut self,
        vocab: &UnifiedTokenVocab<T>,
        span: &[u8],
        tokens: &mut Vec<T>,
    );
}

The encoder receives a byte slice (one span) and appends the resulting token IDs to a buffer. The vocabulary provides two resources for this:

  1. Span lookup (lookup_token): Check if the entire span is a single known token. If so, return it directly. This is the fast path.
  2. Pair merge table (pair_vocab): A map from (token_a, token_b) -> merged_token. This drives the BPE merge loop.

When a span is a single token, encoding is a single hash map lookup. When it's not, the encoder must run BPE: start with byte-level tokens and iteratively merge pairs according to the merge table's rank ordering.

The BPE merge problem

The core BPE algorithm is simple to state: repeatedly merge the lowest-ranked pair in the sequence. But implementing it efficiently is tricky.

Consider the span "abcd" with merges ranked: (a,b)=0, (c,d)=1, (ab,cd)=2.

Start:   [a, b, c, d]
Step 1:  [ab, c, d]       merge (a,b) at rank 0
Step 2:  [ab, cd]          merge (c,d) at rank 1
Step 3:  [abcd]            merge (ab,cd) at rank 2

The challenge is that after each merge, the neighboring pairs change. Merging (a,b) into ab creates a new pair (ab,c) that didn't exist before. A naive implementation scans the entire sequence after each merge to find the next lowest-ranked pair, giving O(n^2) behavior.

Why shift-reduce doesn't work

A tempting optimization is a shift-reduce parser: push tokens onto a stack, and whenever the top two tokens form a mergeable pair with a lower rank than the next input pair, reduce (merge) them. This runs in O(n) and works for most inputs.

But it fails on some inputs because BPE requires global rank ordering, not just local comparison. A local decision to merge might miss a lower-ranked pair further to the right. This was verified empirically against cl100k and o200k corpora.

The five implementations

BufferSweep (Reference)

The simplest correct implementation. Maintains a buffer of tokens and repeatedly sweeps through it, merging the globally lowest-ranked pair in each pass.

  • Time: O(n * m) where m is the number of merge rounds
  • Space: O(n)
  • Use case: Testing and correctness comparison. Not optimized.

Selected via SpanEncoderSelector::Reference or SpanEncoderSelector::BufferSweep.

PriorityMerge (SingleThreadDefault)

Uses a priority queue (min-heap) to track mergeable pairs. Instead of scanning the whole buffer each round, it pops the lowest-ranked pair from the heap. After merging, it updates the neighboring pairs in the heap.

  • Time: O(n log n) amortized
  • Space: O(n) for the heap
  • Use case: Single-threaded workloads. Best single-thread performance.

Selected via SpanEncoderSelector::SingleThreadDefault or SpanEncoderSelector::PriorityMerge.

MergeHeap (ConcurrentDefault)

Similar to PriorityMerge but optimized for concurrent access patterns. The internal state is designed to play well when multiple threads encode different spans through the same tokenizer.

  • Time: O(n log n) amortized
  • Space: O(n)
  • Use case: Multi-threaded workloads with rayon. Default for parallel(true).

Selected via SpanEncoderSelector::ConcurrentDefault or SpanEncoderSelector::MergeHeap.

TailSweep

An alternative scanning strategy that sweeps from the tail of the buffer. Useful in memory-constrained scenarios.

  • Time: O(n * m)
  • Space: O(n)
  • Use case: When memory allocation patterns matter.

Selected via SpanEncoderSelector::TailSweep.

BpeBacktrack

A fundamentally different approach. Instead of starting with bytes and merging up, it starts with the longest possible tokens and backtracks when merge boundaries don't align.

The algorithm:

  1. Build an Aho-Corasick automaton from all vocabulary token byte sequences.
  2. Scan the input with leftmost-longest matching to find the longest token at each position.
  3. Validate boundaries: check that each token pair boundary is a valid BPE merge point using is_valid_token_pair. If not, backtrack to a shorter token.
  4. Walk the next-prefix chain when backtracking: each token has a precomputed "next longest prefix" that allows O(1) fallback.

This gives O(n) encoding time (amortized, since backtracking visits each byte at most twice).

  • Time: O(n) amortized
  • Space: O(V) for the AC automaton (built once, shared via Arc)
  • Use case: When exact BPE semantics are required, or for very long spans.
  • Trade-off: Higher upfront cost to build the automaton from the vocabulary.

Selected via SpanEncoderSelector::BpeBacktrack.

The implementation is based on the work by Hendrik van Antwerpen and Alexander Neubeck at GitHub Next (2023), described in the blog post "So many tokens, so little time."

Precomputed data structures

BpeBacktrack is the only encoder that pre-builds data structures from the vocabulary. The BpeVocab struct contains:

  • pair_lookup: The standard (T, T) -> T merge table.
  • split_table: Inverse of pair_lookup. For each token, stores its canonical split into two sub-tokens.
  • next_prefix: For each token, the longest prefix token that is shorter by one or more bytes. This enables O(1) backtracking.
  • ac: The Aho-Corasick automaton over all token byte sequences.
  • token_lens: Byte length of each token, indexed by token ID.

This is built once by BpeVocab::from_vocab and shared via Arc across all encoder instances.

Choosing the right encoder

ScenarioRecommendedWhy
Web server, many concurrent requestsConcurrentDefaultMergeHeap plays well with thread pools
CLI tool, single-threadedSingleThreadDefaultPriorityMerge is fastest single-threaded
Correctness testingReferenceBufferSweep is simplest to reason about
Very long spans (>1KB)BpeBacktrackO(n) vs O(n log n) matters for long inputs
Memory-constrained embeddedTailSweep or SingleThreadDefaultLower allocation pressure

For most users, the defaults (ConcurrentDefault when parallel, SingleThreadDefault otherwise) are the right choice. The SpanEncoderSelector enum makes switching a one-line change for benchmarking.

The fast path: single-token spans

Before any BPE algorithm runs, the encoder checks whether the entire span is already a single token in the vocabulary. This is a single hash map lookup and is O(1). For common words like "the", "is", " world", this fast path handles the majority of spans.

The encode_append_span_ref method in the SpanEncoder trait implements this:

SpanRef::Word(range) => {
    let span = &text[range].as_bytes();
    if let Some(token) = vocab.lookup_token(span) {
        tokens.push(token);  // fast path: single lookup
    } else {
        self.encode_append_compound_span(vocab, span, tokens);  // BPE merge loop
    }
}

In practice, with large vocabularies like o200k_base (200k tokens), the vast majority of spans hit the fast path. BPE only runs on uncommon words, misspellings, and multilingual text that doesn't appear verbatim in the vocabulary.

Feature Flags

These are the Cargo features available on wordchipper. Features can be enabled in your Cargo.toml:

[dependencies]
wordchipper = { version = "0.8", features = ["client"] }

The default features are std, fast-hash, and parallel.


features = ["std"]

Enabled by default.

Provide standard library integration: regex-based spanning via regex and fancy-regex, file I/O, and std::collections::HashMap. Building with default-features = false removes the standard library dependency, leaving a pure no_std crate that uses hashbrown for hash maps and logos DFA lexers for pre-tokenization.

features = ["fast-hash"]

Enabled by default.

Use foldhash for faster hash maps. When combined with std, the standard library's HashMap is used with foldhash's hasher. Without std, hashbrown::HashMap is used with foldhash's hasher. This feature has no std requirement, so it works in no_std environments.

features = ["parallel"]

Enabled by default. Implies concurrent.

Enable rayon-based parallelism for batch encoding and decoding. Control the thread count with the RAYON_NUM_THREADS environment variable.

features = ["concurrent"]

Implies std.

Enable the thread pool (PoolToy) and concurrency utilities used for concurrent encoder access. The parallel feature enables this automatically; use concurrent directly if you want the thread pool without pulling in rayon.

features = ["client"]

Implies download and datagym (both of which imply std).

Everything needed to load pretrained vocabularies: downloading from the network and parsing DataGym-format files.

features = ["download"]

Implies std.

Download and cache vocabulary files from the internet.

features = ["datagym"]

Implies std.

Load DataGym-format vocabularies (used by older OpenAI models like GPT-2). Pulls in serde_json.

features = ["tracing"]

Add tracing instrumentation points throughout the encoding pipeline. Only useful for profiling the library itself.

features = ["testing"]

Export test utilities for downstream crates to use in their own test suites.

Interactive Tokenizer Demo

Try wordchipper's BPE tokenizer directly in the browser via WebAssembly. Select a model, type some text, and click Encode & Decode to see the token IDs, roundtrip verification, and timing.

Loading WASM...