History
The Big Project: The Hot Dense Grid
The over-arching goal of the ZSpaceLabs work is something I call the “Hot Dense Grid”.
“The Hot Dense Grid” is shorthand for:
- coherent dense finite-grid space-time dynamics,
- hosted on GPU-accelerated tensors,
- distributed across a cluster of GPUs,
- such that the compute and network throughput is maximized,
- and iterative development cost is minimized.
This problem is interesting because many important physics and engineering problems can be represented on a finite-grid space-time dynamics model; so common foundations can be shared across many different problems.
Interestingly, there’s a very strong overlap between the Hot Dense Grid and distributed AI problems. Both need high-performance tensor processing, cluster GPU acceleration, and effective development tools.
The pathway I’ve been pursuing is to identify the tensor expression environment which is closest aligned to high-performance distributed tensor simulation; and then work backwards to identify gateway problems (maximum payoff to the community for minimum investment) in the current rust tensor/AI stacks which would garner additional R&D towards the long term problem.
Despite its popularity for AI development at present and for benchtop notebook evaluation, Python is not a good fit for high-performance computing. It lacks meaningful compiler tooling, strong asynchronous programming, and the kind of strong typing and tooling support needed for rapid development thread-safe highly parallel environments.
I’ve explored a number of different approaches to solving this problem,
and I’ve come to the conclusion that the rust burn
library has the best chance of growing into a viable high-performance
tensor expression environment.
This document is an attempt to document the journey.
The Road So Far
The work spans roughly 15 months, from early 2025 through May 2026, across six repositories. The
through-line is the Hot Dense Grid thesis: build burn into a capable distributed tensor computing
platform, validate the API with real model training and physics simulation, and establish tooling
that makes the ecosystem usable by others.
There are three main threads that run concurrently and reinforce each other:
- Upstream
burncontributions — adding missing tensor primitives, fixing API ergonomics, and extending the distributed/training infrastructure. bimm+bimm-contracts— an image model training framework that exercises the burn API, acts as a test bench for shape-contract tooling, and pushes toward SOTA image classification.wordchipper/zsl-chat— building a production-quality Rust BPE tokenizer (compatible with tiktoken/OpenAI vocabularies) and LLM training infrastructure, which drives further burn optimizer and record API work.clockmill— physics simulation demos (Conway’s Game of Life, D2Q9 Lattice-Boltzmann) that demonstrate theTensor::unfold()machinery and prove the folded-window simulation approach works.
February–May 2025: Laying Groundwork in Burn
The earliest contributions establish the tensor primitives that everything else depends on. The focus is on filling gaps that any serious numerical library needs: spatial utilities, linalg operations, and ergonomic slice APIs.
- #3107 ports
numpy/pytorchmeshgrid()toburn_tensor::grid, enabling coordinate-grid construction needed for spatial models. - #3131 adds
burn::linalg::{vector_norm, l2_norm}, foundational for any normalization-heavy model. - #3191 adds
meshgrid_stackas a convenience wrapper. - #3201 adds
matmulforTensor<_, _, Int>, needed for integer-domain graph and indexing operations. - #3221 / #3235
add
Tensor::slice_fill()andTensor::slice_dim()with theRangeArgtrait — ergonomic in-place slice assignment that will be used heavily in simulation boundary conditions. - #3147 extracts
Linear::forwardintonn::functional::linear, a structural step toward functional-style layer composition.
July 2025: bimm Starts; DataLoader Improvements in Burn
bimm begins (#1) as an industrial-grade image
model training framework. The initial focus is the firehose data pipeline: a columnar operator
DSL for loading and transforming training data, designed to feed burn models efficiently. This
immediately exposes gaps in burn’s dataset tooling.
Concurrently, upstream contributions address those gaps:
- #3281 adds
tensor.roll()for circular-shift operations needed in spatial models. - #3390 pre-shuffles multithreaded DataLoaders on shuffle, eliminating a performance bottleneck.
- #3406 adds
SelectionDatasetand refactorsShuffledDataset, giving downstream users more composable dataset primitives.
The bimm firehose pipeline takes shape through July: operator registration via inventory,
symbolic CallBuilder API for schema planning, ImageLoader, and early image augmentation stages
(#15–#28).
August 2025: bimm Matures; Burn Activation and Norm Layer Architecture
A concentrated month of API work in both repos.
In burn, the activation and normalization layer architecture gets a major overhaul:
- #3452 replaces the magic
tensor^Ttranspose marker with an explicit.T()method. - #3562 lifts
.full()/.full_like()into the baseTensortype, adding support forTensor<B, D, Bool>. - #3603 introduces the
nn.activationmodule with a unifiedActivationenum. - #3619 / #3620
extend
swap_dims,permute, andflattento useAsIndexdim arguments — a significant ergonomics improvement for dynamic dimension handling. - #3625 removes the
Dconst generic fromBatchNorm<B, D>, eliminating a common source of type-level friction. - #3630 introduces
NormLayeras a unified normalization abstraction, mirroring whatbimmneeds. - #3490 fixes
SamplerDatasetdistribution and adds a builder — directly motivated bybimm’s training pipeline needs.
In bimm, the image augmentation pipeline stabilizes
(#58–#73):
speckle noise, ChooseOneStage, a dir-scanning experimental dataset, and the bimm-firehose-image
crate extraction. The SwinTransformer Tiny training
pipeline appears as a real workload driving all this tooling.
The month closes with bimm’s first published releases:
v0.2.4, including
ActivationLayer (#95),
configurable activation functions and RPB MLP support (#96),
DropBlock2d (#91, #97),
and dynamic ModuleMapper/ModuleVisitor wrappers (#94).
September 2025: Tensor::unfold() and the Clockmill Demo
The pivotal month. The Tensor::unfold() PR lands in burn, and immediately becomes the engine for
physics simulation.
In burn:
- #3688 adds
NormalizationConfig::with_num_features(). - #3694 adds
Shape::into_iter(),into_ranges(),to_vec(),slice()— makingShapea proper iterable. - #3751
Tensor::unfold(dim, size, step)— the key primitive. Creates no-copy sliding-window views over a tensor dimension, enabling vectorized neighborhood operations without data movement. - #3782 / #3783 fix the unfold underflow edge case and simplify the ndarray backend impl.
- #3785 adds
bool_xorfor boolean tensor logic.
clockmill launches (2025-09-23) as a direct demonstration of Tensor::unfold(). Conway’s
Game of Life is the first proof: convolve_func_2d applies unfold twice — once per spatial
dimension — to produce a [batch, h_wins, w_wins, c_in, kernel_h, kernel_w] tensor, running the
entire neighbor-count in a single vectorized operation. The interactive fishbowl visualization
confirms the approach is fast enough for real-time rendering.
bimm-contracts (#1) splits into its
own repo, providing shape-contract macros (unpack_shape_contract!, assert_shape_contract_periodically!)
that both bimm and clockmill depend on.
October 2025: LBM Fluid Simulation; Burn API Expansion
clockmill scales from cellular automata to a full D2Q9 Lattice-Boltzmann fluid dynamics
simulation. The dist_windows function (dist.unfold(0, 3, 1).unfold(1, 3, 1)) creates
[H, W, VY, VX, WIN_Y, WIN_X] neighbor windows for every cell simultaneously, making the LBM
streaming step a single tensor operation. BGK collision, spatially-varying omega relaxation,
solid-mask boundaries, and energy conservation checks follow. Demo videos are published.
Burn sees a dense cluster of API improvements, most directly motivated by the simulation and model work:
- #3797
Tensor::num_dims(). - #3811
Tensor::<agg>_dims()aggregation variants. - #3817
Tensor::sum_and_squeeze_dims(). - #3869
cautious_weight_decayfor AdamW — relevant for stable training of vision models. - #3879
Shape::ravelfor row-major index raveling. - #3923 generalizes
linalg::outersemantics and addsouter_dim. - #3953 adds comparable
Int/Floatdtypes, no-op casts, and*_like()dtype preservation — important for mixed-precision simulation.
November–December 2025: Slice/Shape/Distributed Infrastructure
With the core tensor API in good shape, attention turns to the distributed compute layer and shape/slice ergonomics needed for production training pipelines.
- #3983 implements
FromStrforSlice, enabling slice specification via config strings. - #4041 adds warmup epochs to
MetricEarlyStoppingStrategy. - #4042 implements a full
Sliceiterator and utility methods. - #4113 refactors
RemoteDeviceto use a thread-safe global address registry — a prerequisite for multi-node training. - #4127 adds
slice_dyn,slice_assign_dyn, andslice_fill_dyn— dynamic rank-agnostic slice operations needed by simulation code. - #4157 adds
tracing::instrumentto collective operations for distributed debugging. - #4189 adds
flatten_dimstoShape. - #4196 replaces
canonicalize_dimwithexpect_dimfor cleaner error messages. - #4218 consolidates shape and slice error handling
into
ExpressionError. - #4221 unifies
ReshapeArgs/Shape::reshape(). - #4234 plumbs
tracingas a deep optional feature.
January 2026: Tokenizer Work Begins; wordchipper Is Born
The work pivots toward LLM infrastructure. A BPE tokenizer (wordchuck, later renamed
wordchipper) begins in zsl-chat and quickly spins out into its own repo.
In zsl-chat, the January work establishes the full tokenizer stack:
TextSegmentor (#18),
byte tables and ByteTokenTable (#61),
BPE pair training (BinaryPairVocabTrainer → bpe_trainer
#63),
UnifiedTokenVocab (#79),
OpenAI GPT vocabulary special-token handling
(#52),
and a critical bug fix in the diff calculation
(#78 — “Eureka. My diff calc was wrong!”).
By late January, wordchipper exists as its own repo with initial crates, disk cache, pretrained
OpenAI model loaders, and the first public release. Test coverage and API stabilization work runs
through the end of the month.
Burn contributions this month focus on training infrastructure:
February 2026: wordchipper Reaches Production Quality
An intense month of performance and API work in wordchipper. The major milestones:
- #38 — a massive overhaul tracking down a
speed regression vs
tiktoken-rs. Root cause: a wrong version offancy_regex. Nearly allArcusage eliminated;TextSegmentoroverhauled. This brings wordchipper to competitive performance. - #44 / #47 refactor encoders toward span-based encoding, abstracting over a common base.
- #45 adds
LruPoolToyfor thread-local caching. - #185 adds accelerated DFA-backed lexer
support using
regex-automata, significantly improving segmentation throughput. - #192 enables parallel processing via Rayon
in
TokenizerOptions. - #221 adds the
wordchipper-clicrate with a full CLI. - #238 adds the
traincommand to the CLI. - #267 / #272
add Python bindings via PyO3, exporting
TokenizerOptionsand core encoding/decoding to Python.
Multiple releases land: v0.2.1, v0.4.0, v0.5.0, v0.6.2, v0.7.0.
March 2026: Benchmarks, Cross-Platform, and Python Parity
wordchipper shifts into benchmarking and comparison work. The goal is to establish that it is a
viable drop-in for tiktoken with significant performance advantages on multi-core workloads.
- Benchmark tooling (#284, #287–#297) builds a full reporting and plotting pipeline with SVG/CSV output.
- #305 adds HuggingFace tokenizers as a benchmark comparison target.
- Cross-platform benchmark data lands for M2 Mac (#296) and AMD (#337, #342).
- #358 adds
allowed_specialssupport toTextSpanner, matching tiktoken’s interface. - #365 / #368 complete Python bindings with compat test suite.
- v0.9 pre-release (#345), v1.0 release (#380).
April 2026: LLM Training Infrastructure; Burn Optimizer Work
Attention shifts to zsl-chat as an LLM training framework, and burn’s optimizer and record APIs
get the extensions needed to support it.
In burn, the multi-group optimizer work:
- #4818 preps the optimizer infrastructure for group-based multi-optimizer support.
- #4822 cleans up
OptimizerAdaptor/GradAdaptorAPI. - #4823 removes the unused
Mtype parameter fromSimpleOptimizerMapper. - #4825 adds
Record<(R0,)>for single-element tuple records.
In zsl-chat, the LLM architecture takes shape:
- #106–#109 elaborate the DataLoader machinery around iterators and shuffle configuration.
- #115–#118
build the module tree builder and XPath-style parameter selection (
Linear/*[2]syntax for selecting specific parameter groups in a model tree). - #122 adds per-group learning rate selectors with fixed and named LR support.
- #123 begins nanochat-alike partial tuning experiments.
May 2026: LrScheduler and Record API Completion
The final burn PRs in this timeline complete the infrastructure needed for full LLM fine-tuning workflows:
- #4881 adds
ParamId::try_deserialize(), enabling checkpoint loading with partial parameter matching — essential for transfer learning and partial fine-tuning. - #4905 adds
Clone + 'staticbounds toLrScheduler::Recordand derivesClonefor scheduler records, a prerequisite for serializing and restoring training state across multi-group optimizers.
These land as zsl-chat experiments with partial fine-tuning of chat models — the first real
end-to-end LLM training runs against the infrastructure that has been building since July 2025.