Getting Started
Installation
Add wordchipper to your Cargo.toml:
[dependencies]
wordchipper = "0.7"
The default features include std, fast-hash (foldhash hasher), and parallel (rayon). Add
client for vocabulary download. See Feature Flags for all options and
no_std & Embedded for minimal configurations.
Encode and decode in five lines
use wordchipper::{ Tokenizer, TokenizerOptions, TokenEncoder, TokenDecoder, load_vocab, disk_cache::WordchipperDiskCache, }; fn main() -> wordchipper::WCResult<()> { // Load vocabulary (downloads and caches on first run) let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("openai:cl100k_base", &mut cache)?; // Build a tokenizer let tok = TokenizerOptions::default().build(vocab); // Encode let tokens = tok.try_encode("hello world")?; assert_eq!(tokens, vec![15339, 1917]); // Decode let text = tok.try_decode_to_string(&tokens)?; assert_eq!(text, "hello world"); Ok(()) }
The first call to load_vocab downloads the vocabulary file from OpenAI's CDN and caches it in
~/.cache/wordchipper/. Subsequent calls load from disk.
Available models
use wordchipper::list_models; fn main() { for model in list_models() { println!("{}", model); } }
This prints all registered models with their namespace prefix:
openai::r50k_base
openai::p50k_base
openai::p50k_edit
openai::cl100k_base
openai::o200k_base
openai::o200k_harmony
You can also use short names without the openai:: prefix when loading:
#![allow(unused)] fn main() { use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache}; let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("o200k_base", &mut cache).unwrap(); }
Batch encoding
For multiple strings, batch encoding runs spans in parallel across threads (when the parallel
feature is enabled):
use wordchipper::{ TokenEncoder, TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache, }; fn main() -> wordchipper::WCResult<()> { let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("openai:cl100k_base", &mut cache)?; let tok = TokenizerOptions::default() .with_parallel(true) .build(vocab); let texts = vec!["hello world", "the cat sat on the mat", "foo bar baz"]; let batch = tok.try_encode_batch(&texts)?; for (text, tokens) in texts.iter().zip(batch.iter()) { println!("{:?} -> {:?}", text, tokens); } Ok(()) }
Choosing a token type
wordchipper is generic over the token integer type. Most users should use u32, which is the
default. If your vocabulary fits in 16 bits (< 65,536 entries), you can use u16 to save memory:
use std::sync::Arc; use wordchipper::{UnifiedTokenVocab, TokenizerOptions, load_vocab, disk_cache::WordchipperDiskCache}; fn main() -> wordchipper::WCResult<()> { let mut cache = WordchipperDiskCache::default(); let (_desc, vocab) = load_vocab("openai:r50k_base", &mut cache)?; // Convert to u16 (works for r50k_base with ~50k tokens) let vocab_u16: Arc<UnifiedTokenVocab<u16>> = Arc::new(vocab.as_ref().to_token_type()); let tok = TokenizerOptions::default().build(vocab_u16); Ok(()) }
For cl100k_base (~100k tokens) and o200k_base (~200k tokens), use u32.
What happens under the hood
When you call try_encode("hello world"), wordchipper runs a two-phase pipeline:
-
Spanning. The text is split into coarse segments called spans. For OpenAI models, this uses a regex pattern (or a faster DFA lexer) that handles whitespace, punctuation, letters, and digits.
"hello world"becomes["hello", " world"]. -
BPE encoding. Each span is encoded independently using Byte Pair Encoding. The encoder iteratively merges adjacent byte pairs according to the vocabulary's merge table, always picking the lowest-ranked (highest priority) pair first. The result is a sequence of token IDs.
Decoding reverses the process: each token ID maps to a byte sequence, and the byte sequences are concatenated to produce the original text.
The next chapter explains this pipeline in detail, with visual examples of how BPE works.