Pretrained Models

wordchipper ships with loaders for OpenAI's public BPE vocabularies. Each model defines three things: a regex pattern for pre-tokenization, a merge table (the vocabulary), and a set of special tokens.

Model overview

ModelVocab tokensPatternSpecial tokensUsed by
r50k_base~50kGPT-2 pattern<|endoftext|>GPT-2
p50k_base~50kGPT-2 pattern<|endoftext|>Codex
p50k_edit~50kGPT-2 patternendoftext, fim_prefix/middle/suffixCodex (edit)
cl100k_base~100kcl100k patternendoftext, fim_prefix/middle/suffix, endofpromptGPT-3.5, GPT-4
o200k_base~200ko200k patternendoftext, endofpromptGPT-4o
o200k_harmony~200ko200k patternendoftext, endofprompt, startoftext, + many reservedGPT-4o (harmony)

What changed between models

r50k / p50k (GPT-2 era). The simplest pattern. Contractions ('s, 't, etc.) are matched literally. Words and numbers are preceded by an optional space. Case-sensitive.

cl100k (GPT-3.5 / GPT-4). Case-insensitive contractions. Words can be preceded by a non-letter, non-newline character (allowing punctuation to attach). Numbers limited to 3 digits at a time. Newlines handled explicitly.

o200k (GPT-4o). Much more sophisticated word patterns. Recognizes casing transitions (CamelCase splits). Uses Unicode general categories (\p{Lu}, \p{Ll}, \p{Lt}, \p{Lm}, \p{Lo}, \p{M}) for precise letter classification. Contractions are appended to words rather than matched separately.

Shared vocabularies

Some models share the same vocabulary file but differ in special tokens:

  • p50k_edit uses the same vocab as p50k_base but adds FIM (fill-in-middle) tokens.
  • o200k_harmony uses the same vocab as o200k_base but adds many reserved and named special tokens for structured generation.

Loading a vocabulary

The standard way to load a model:

#![allow(unused)]
fn main() {
use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache};

let mut cache = WordchipperDiskCache::default();
let (desc, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
}

load_vocab returns a (VocabDescription, Arc<UnifiedTokenVocab<u32>>). The description contains metadata; the vocab contains everything needed for encoding and decoding.

Short names

The openai:: prefix is optional. Both "cl100k_base" and "openai:cl100k_base" work. Use list_vocabs() for all registered short names and list_models() for fully qualified names.

Loading from a file path

If you have a .tiktoken file on disk, you can skip the download:

#![allow(unused)]
fn main() {
use wordchipper::pretrained::openai::OATokenizer;

let vocab = OATokenizer::Cl100kBase
    .load_path::<u32>("/path/to/cl100k_base.tiktoken")
    .unwrap();
}

Loading from a reader

For maximum flexibility (e.g., loading from an in-memory buffer or a network stream):

#![allow(unused)]
fn main() {
use std::io::BufReader;
use wordchipper::pretrained::openai::OATokenizer;

let data: &[u8] = b"..."; // tiktoken base64 format
let reader = BufReader::new(data);
let vocab = OATokenizer::O200kBase.read_vocab::<u32, _>(reader).unwrap();
}

Special tokens

Special tokens are strings with reserved token IDs that are never produced by BPE encoding. They're used for control flow: marking end-of-text, fill-in-middle boundaries, prompt boundaries, etc.

#![allow(unused)]
fn main() {
use wordchipper::{load_vocab, disk_cache::WordchipperDiskCache, TokenizerOptions, TokenEncoder};
let mut cache = WordchipperDiskCache::default();
let (_, vocab) = load_vocab("openai:cl100k_base", &mut cache).unwrap();
let tok = TokenizerOptions::default().build(vocab);
let specials = tok.special_vocab();
for (bytes, &id) in specials.span_map().iter() {
    let name = String::from_utf8_lossy(bytes);
    println!("{} -> {}", name, id);
}
}

For cl100k_base, this prints:

<|endoftext|> -> 100257
<|fim_prefix|> -> 100258
<|fim_middle|> -> 100259
<|fim_suffix|> -> 100260
<|endofprompt|> -> 100276

The OATokenizer enum

For programmatic access to all OpenAI models, use the OATokenizer enum:

#![allow(unused)]
fn main() {
use wordchipper::pretrained::openai::OATokenizer;

// Iterate over all models
#[cfg(feature = "std")]
{
use strum::IntoEnumIterator;
for model in OATokenizer::iter() {
    println!("{}", model);
}
}
}

Each variant provides:

  • pattern() - the regex pattern for pre-tokenization
  • special_tokens::<T>() - the special token list
  • load_vocab::<T>(loader) - load the vocabulary with download support
  • load_path::<T>(path) - load from a local file

The tiktoken format

OpenAI's vocabulary files use a simple base64 format. Each line contains a base64-encoded byte sequence and its integer token ID, separated by a space:

IQ== 0
Ig== 1
Iw== 2
...

The base64 decodes to the raw bytes that the token represents. This format is what WordchipperDiskCache downloads and caches from OpenAI's CDN.

Choosing a model

If you're building a tool that interacts with an OpenAI model, use the matching tokenizer:

If you use...Load...
GPT-4o, GPT-4o-minio200k_base
GPT-4, GPT-3.5-turbocl100k_base
GPT-3 (text-davinci-003)p50k_base
GPT-2r50k_base

If you're building your own model or just need token counting, o200k_base has the largest vocabulary and handles the widest range of languages efficiently.