Performance

wordchipper is designed for throughput. This chapter covers the knobs you can turn: BPE algorithm selection, DFA-accelerated pre-tokenization, parallelism, and hash function choice.

The two bottlenecks

Tokenization has two phases, and each can be the bottleneck:

  1. Pre-tokenization (spanning). Splitting text into spans using a regex. This is I/O-bound: the regex engine scans every byte of the input.
  2. BPE encoding. Merging byte pairs within each span. This is compute-bound: each span requires iterative pair lookups and merges.

For short texts, BPE dominates. For long texts with many spans, pre-tokenization matters more. wordchipper optimizes both.

DFA-accelerated spanning (logos)

The biggest single optimization is replacing the regex engine with a compile-time DFA. wordchipper uses the logos crate to compile pre-tokenization patterns into deterministic finite automata at build time.

This is enabled by default. When you load a vocabulary whose regex pattern matches a known logos lexer (cl100k, o200k, r50k), the DFA lexer is used automatically.

Benchmarks

Spanning throughput on a single thread:

ModelRegexLogos DFASpeedup
cl100k_base~25 MB/s~732 MB/s29x
o200k_base~15 MB/s~765 MB/s52x

Disabling DFA acceleration

If you want to force regex-based spanning (e.g., for testing or debugging):

#![allow(unused)]
fn main() {
use wordchipper::{TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache};
let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
let tok = TokenizerOptions::default()
    .with_accelerated_lexers(false)
    .build(vocab);
}

BPE algorithm selection

wordchipper includes five span encoder implementations. Each trades off differently between single-threaded speed, concurrent access, memory usage, and implementation complexity. See Advanced: Span Encoders for a deep dive into how each works.

The algorithms

AlgorithmBest forNotes
MergeHeapConcurrent / multi-threadedDefault for ConcurrentDefault. Heap-based merge tracking.
PriorityMergeSingle-threadedDefault for SingleThreadDefault. Priority-queue merging.
BufferSweepTesting / referenceSimple and correct, not optimized.
TailSweepMemory-constrainedAlternative scanning strategy.
BpeBacktrackExact BPE semanticsO(n) via Aho-Corasick automaton + backtracking.

Selecting an algorithm

#![allow(unused)]
fn main() {
use wordchipper::{
    TokenizerOptions, TokenEncoderOptions,
    encoders::token_span_encoder::span_encoders::SpanEncoderSelector,
    load_vocab, disk_cache::WordchipperDiskCache,
};

let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();

let mut opts = TokenizerOptions::default();
opts.encoder.set_span_encoder(SpanEncoderSelector::BpeBacktrack);
let tok = opts.build(vocab);
}

Which algorithm should I use?

For most users, the defaults are correct:

  • Multi-threaded workloads (web servers, batch processing): Use the default ConcurrentDefault, which selects MergeHeap.
  • Single-threaded workloads (CLI tools, embedded): Use SingleThreadDefault, which selects PriorityMerge.
  • Exact BPE needed: Use BpeBacktrack. It uses an Aho-Corasick automaton pre-built from the vocabulary for O(n) encoding that exactly matches the theoretical BPE algorithm. The other algorithms also produce correct output for standard vocabularies; BpeBacktrack is the theoretical gold standard.

Parallelism with rayon

When the parallel feature is enabled (it is by default), batch operations parallelize across threads:

#![allow(unused)]
fn main() {
use wordchipper::{TokenizerOptions, TokenEncoder, load_vocab, disk_cache::WordchipperDiskCache};
let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
let tok = TokenizerOptions::default()
    .with_parallel(true)
    .build(vocab);

let texts: Vec<&str> = vec!["hello"; 1000];
let batch = tok.try_encode_batch(&texts).unwrap();
}

Setting parallel(true) affects both encoding and decoding. The concurrent(true) option additionally selects a span encoder optimized for concurrent access from multiple threads.

Controlling thread count

wordchipper uses rayon's global thread pool. Control it with the RAYON_NUM_THREADS environment variable:

RAYON_NUM_THREADS=8 cargo run --release

Or configure it programmatically via rayon::ThreadPoolBuilder.

Hash function selection

wordchipper uses hash maps extensively for vocabulary lookups. The fast-hash feature (enabled by default) swaps in foldhash for faster hashing:

FeatureHash functionNotes
fast-hashfoldhashGood general-purpose, works in no_std
(none)defaultSipHash (std) or hashbrown default

Unlike previous versions, fast-hash works in no_std environments. See Feature Flags for details.

End-to-end benchmarks

Encode + decode throughput on 90 MB shards, 48 threads:

Modelwordchippertiktoken-rsHuggingFace tokenizers
r50k_base239 MiB/s169 MiB/s22 MiB/s
p50k_base251 MiB/s163 MiB/s22 MiB/s
p50k_edit242 MiB/s170 MiB/s21 MiB/s
cl100k_base214 MiB/s125 MiB/s22 MiB/s
o200k_base119 MiB/s124 MiB/s22 MiB/s
o200k_harmony122 MiB/s122 MiB/s22 MiB/s

Running benchmarks yourself

The sample-timer tool runs wordchipper vs. tiktoken-rs side-by-side:

RAYON_NUM_THREADS=48 cargo run --release -p sample-timer -- \
    --dataset-dir $DATASET_DIR --shards 0 --model openai::cl100k_base

For criterion-based microbenchmarks:

cargo bench -p wordchipper-bench

Performance tips

  1. Use the default features. DFA acceleration and rayon parallelism (parallel) are both on by default.
  2. Batch your inputs. try_encode_batch is significantly faster than calling try_encode in a loop because it amortizes thread pool overhead.
  3. Reuse the tokenizer. Building a Tokenizer pre-computes data structures. Build it once, share via Arc.
  4. Match the encoder to your workload. Use ConcurrentDefault for multi-threaded, SingleThreadDefault for single-threaded.
  5. Profile spanning vs. encoding. If spanning is the bottleneck, make sure DFA acceleration is active. If encoding is the bottleneck, experiment with SpanEncoderSelector variants.