Pretrained Models
wordchipper ships with loaders for OpenAI's public BPE vocabularies. Each model defines three things: a regex pattern for pre-tokenization, a merge table (the vocabulary), and a set of special tokens.
Model overview
| Model | Vocab tokens | Pattern | Special tokens | Used by |
|---|---|---|---|---|
r50k_base | ~50k | GPT-2 pattern | <|endoftext|> | GPT-2 |
p50k_base | ~50k | GPT-2 pattern | <|endoftext|> | Codex |
p50k_edit | ~50k | GPT-2 pattern | endoftext, fim_prefix/middle/suffix | Codex (edit) |
cl100k_base | ~100k | cl100k pattern | endoftext, fim_prefix/middle/suffix, endofprompt | GPT-3.5, GPT-4 |
o200k_base | ~200k | o200k pattern | endoftext, endofprompt | GPT-4o |
o200k_harmony | ~200k | o200k pattern | endoftext, endofprompt, startoftext, + many reserved | GPT-4o (harmony) |
What changed between models
r50k / p50k (GPT-2 era). The simplest pattern. Contractions ('s, 't, etc.) are matched
literally. Words and numbers are preceded by an optional space. Case-sensitive.
cl100k (GPT-3.5 / GPT-4). Case-insensitive contractions. Words can be preceded by a non-letter, non-newline character (allowing punctuation to attach). Numbers limited to 3 digits at a time. Newlines handled explicitly.
o200k (GPT-4o). Much more sophisticated word patterns. Recognizes casing transitions (CamelCase
splits). Uses Unicode general categories (\p{Lu}, \p{Ll}, \p{Lt}, \p{Lm}, \p{Lo}, \p{M})
for precise letter classification. Contractions are appended to words rather than matched
separately.
Shared vocabularies
Some models share the same vocabulary file but differ in special tokens:
p50k_edituses the same vocab asp50k_basebut adds FIM (fill-in-middle) tokens.o200k_harmonyuses the same vocab aso200k_basebut adds many reserved and named special tokens for structured generation.
Loading a vocabulary
The standard way to load a model:
#![allow(unused)] fn main() { use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache}; let mut cache = WordchipperDiskCache::default(); let (desc, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap(); }
load_vocab returns a (VocabDescription, Arc<UnifiedTokenVocab<u32>>). The description contains
metadata; the vocab contains everything needed for encoding and decoding.
Short names
The openai:: prefix is optional. Both "cl100k_base" and "openai:cl100k_base" work. Use
list_vocabs() for all registered short names and list_models() for fully qualified names.
Loading from a file path
If you have a .tiktoken file on disk, you can skip the download:
#![allow(unused)] fn main() { use wordchipper::pretrained::openai::OATokenizer; let vocab = OATokenizer::Cl100kBase .load_path::<u32>("/path/to/cl100k_base.tiktoken") .unwrap(); }
Loading from a reader
For maximum flexibility (e.g., loading from an in-memory buffer or a network stream):
#![allow(unused)] fn main() { use std::io::BufReader; use wordchipper::pretrained::openai::OATokenizer; let data: &[u8] = b"..."; // tiktoken base64 format let reader = BufReader::new(data); let vocab = OATokenizer::O200kBase.read_vocab::<u32, _>(reader).unwrap(); }
Special tokens
Special tokens are strings with reserved token IDs that are never produced by BPE encoding. They're used for control flow: marking end-of-text, fill-in-middle boundaries, prompt boundaries, etc.
#![allow(unused)] fn main() { use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache, TokenizerOptions, TokenEncoder}; let mut cache = WordchipperDiskCache::default(); let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap(); let tok = TokenizerOptions::default().build(vocab); let specials = tok.special_vocab(); for (bytes, &id) in specials.span_map().iter() { let name = String::from_utf8_lossy(bytes); println!("{} -> {}", name, id); } }
For cl100k_base, this prints:
<|endoftext|> -> 100257
<|fim_prefix|> -> 100258
<|fim_middle|> -> 100259
<|fim_suffix|> -> 100260
<|endofprompt|> -> 100276
The OATokenizer enum
For programmatic access to all OpenAI models, use the OATokenizer enum:
#![allow(unused)] fn main() { use wordchipper::pretrained::openai::OATokenizer; // Iterate over all models #[cfg(feature = "std")] { use strum::IntoEnumIterator; for model in OATokenizer::iter() { println!("{}", model); } } }
Each variant provides:
pattern()- the regex pattern for pre-tokenizationspecial_tokens::<T>()- the special token listload_vocab::<T>(loader)- load the vocabulary with download supportload_path::<T>(path)- load from a local file
The tiktoken format
OpenAI's vocabulary files use a simple base64 format. Each line contains a base64-encoded byte sequence and its integer token ID, separated by a space:
IQ== 0
Ig== 1
Iw== 2
...
The base64 decodes to the raw bytes that the token represents. This format is what
WordchipperDiskCache downloads and caches from OpenAI's CDN.
Choosing a model
If you're building a tool that interacts with an OpenAI model, use the matching tokenizer:
| If you use... | Load... |
|---|---|
| GPT-4o, GPT-4o-mini | o200k_base |
| GPT-4, GPT-3.5-turbo | cl100k_base |
| GPT-3 (text-davinci-003) | p50k_base |
| GPT-2 | r50k_base |
If you're building your own model or just need token counting, o200k_base has the largest
vocabulary and handles the widest range of languages efficiently.