The Big Project: The Hot Dense Grid

The over-arching goal of the ZSpaceLabs work is something I call the “Hot Dense Grid”.

“The Hot Dense Grid” is shorthand for:

  • coherent dense finite-grid space-time dynamics,
  • hosted on GPU-accelerated tensors,
  • distributed across a cluster of GPUs,
  • such that the compute and network throughput is maximized,
  • and iterative development cost is minimized.

This problem is interesting because many important physics and engineering problems can be represented on a finite-grid space-time dynamics model; so common foundations can be shared across many different problems.

Interestingly, there’s a very strong overlap between the Hot Dense Grid and distributed AI problems. Both need high-performance tensor processing, cluster GPU acceleration, and effective development tools.

The pathway I’ve been pursuing is to identify the tensor expression environment which is closest aligned to high-performance distributed tensor simulation; and then work backwards to identify gateway problems (maximum payoff to the community for minimum investment) in the current rust tensor/AI stacks which would garner additional R&D towards the long term problem.

Despite its popularity for AI development at present and for benchtop notebook evaluation, Python is not a good fit for high-performance computing. It lacks meaningful compiler tooling, strong asynchronous programming, and the kind of strong typing and tooling support needed for rapid development thread-safe highly parallel environments.

I’ve explored a number of different approaches to solving this problem, and I’ve come to the conclusion that the rust burn library has the best chance of growing into a viable high-performance tensor expression environment.

This document is an attempt to document the journey.

The Road So Far

The work spans roughly 15 months, from early 2025 through May 2026, across six repositories. The through-line is the Hot Dense Grid thesis: build burn into a capable distributed tensor computing platform, validate the API with real model training and physics simulation, and establish tooling that makes the ecosystem usable by others.

There are three main threads that run concurrently and reinforce each other:

  1. Upstream burn contributions — adding missing tensor primitives, fixing API ergonomics, and extending the distributed/training infrastructure.
  2. bimm + bimm-contracts — an image model training framework that exercises the burn API, acts as a test bench for shape-contract tooling, and pushes toward SOTA image classification.
  3. wordchipper / zsl-chat — building a production-quality Rust BPE tokenizer (compatible with tiktoken/OpenAI vocabularies) and LLM training infrastructure, which drives further burn optimizer and record API work.
  4. clockmill — physics simulation demos (Conway’s Game of Life, D2Q9 Lattice-Boltzmann) that demonstrate the Tensor::unfold() machinery and prove the folded-window simulation approach works.

February–May 2025: Laying Groundwork in Burn

The earliest contributions establish the tensor primitives that everything else depends on. The focus is on filling gaps that any serious numerical library needs: spatial utilities, linalg operations, and ergonomic slice APIs.

  • #3107 ports numpy/pytorch meshgrid() to burn_tensor::grid, enabling coordinate-grid construction needed for spatial models.
  • #3131 adds burn::linalg::{vector_norm, l2_norm}, foundational for any normalization-heavy model.
  • #3191 adds meshgrid_stack as a convenience wrapper.
  • #3201 adds matmul for Tensor<_, _, Int>, needed for integer-domain graph and indexing operations.
  • #3221 / #3235 add Tensor::slice_fill() and Tensor::slice_dim() with the RangeArg trait — ergonomic in-place slice assignment that will be used heavily in simulation boundary conditions.
  • #3147 extracts Linear::forward into nn::functional::linear, a structural step toward functional-style layer composition.

July 2025: bimm Starts; DataLoader Improvements in Burn

bimm begins (#1) as an industrial-grade image model training framework. The initial focus is the firehose data pipeline: a columnar operator DSL for loading and transforming training data, designed to feed burn models efficiently. This immediately exposes gaps in burn’s dataset tooling.

Concurrently, upstream contributions address those gaps:

  • #3281 adds tensor.roll() for circular-shift operations needed in spatial models.
  • #3390 pre-shuffles multithreaded DataLoaders on shuffle, eliminating a performance bottleneck.
  • #3406 adds SelectionDataset and refactors ShuffledDataset, giving downstream users more composable dataset primitives.

The bimm firehose pipeline takes shape through July: operator registration via inventory, symbolic CallBuilder API for schema planning, ImageLoader, and early image augmentation stages (#15#28).


August 2025: bimm Matures; Burn Activation and Norm Layer Architecture

A concentrated month of API work in both repos.

In burn, the activation and normalization layer architecture gets a major overhaul:

  • #3452 replaces the magic tensor^T transpose marker with an explicit .T() method.
  • #3562 lifts .full()/.full_like() into the base Tensor type, adding support for Tensor<B, D, Bool>.
  • #3603 introduces the nn.activation module with a unified Activation enum.
  • #3619 / #3620 extend swap_dims, permute, and flatten to use AsIndex dim arguments — a significant ergonomics improvement for dynamic dimension handling.
  • #3625 removes the D const generic from BatchNorm<B, D>, eliminating a common source of type-level friction.
  • #3630 introduces NormLayer as a unified normalization abstraction, mirroring what bimm needs.
  • #3490 fixes SamplerDataset distribution and adds a builder — directly motivated by bimm’s training pipeline needs.

In bimm, the image augmentation pipeline stabilizes (#58#73): speckle noise, ChooseOneStage, a dir-scanning experimental dataset, and the bimm-firehose-image crate extraction. The SwinTransformer Tiny training pipeline appears as a real workload driving all this tooling.

The month closes with bimm’s first published releases: v0.2.4, including ActivationLayer (#95), configurable activation functions and RPB MLP support (#96), DropBlock2d (#91, #97), and dynamic ModuleMapper/ModuleVisitor wrappers (#94).


September 2025: Tensor::unfold() and the Clockmill Demo

The pivotal month. The Tensor::unfold() PR lands in burn, and immediately becomes the engine for physics simulation.

In burn:

  • #3688 adds NormalizationConfig::with_num_features().
  • #3694 adds Shape::into_iter(), into_ranges(), to_vec(), slice() — making Shape a proper iterable.
  • #3751 Tensor::unfold(dim, size, step) — the key primitive. Creates no-copy sliding-window views over a tensor dimension, enabling vectorized neighborhood operations without data movement.
  • #3782 / #3783 fix the unfold underflow edge case and simplify the ndarray backend impl.
  • #3785 adds bool_xor for boolean tensor logic.

clockmill launches (2025-09-23) as a direct demonstration of Tensor::unfold(). Conway’s Game of Life is the first proof: convolve_func_2d applies unfold twice — once per spatial dimension — to produce a [batch, h_wins, w_wins, c_in, kernel_h, kernel_w] tensor, running the entire neighbor-count in a single vectorized operation. The interactive fishbowl visualization confirms the approach is fast enough for real-time rendering.

bimm-contracts (#1) splits into its own repo, providing shape-contract macros (unpack_shape_contract!, assert_shape_contract_periodically!) that both bimm and clockmill depend on.


October 2025: LBM Fluid Simulation; Burn API Expansion

clockmill scales from cellular automata to a full D2Q9 Lattice-Boltzmann fluid dynamics simulation. The dist_windows function (dist.unfold(0, 3, 1).unfold(1, 3, 1)) creates [H, W, VY, VX, WIN_Y, WIN_X] neighbor windows for every cell simultaneously, making the LBM streaming step a single tensor operation. BGK collision, spatially-varying omega relaxation, solid-mask boundaries, and energy conservation checks follow. Demo videos are published.

Burn sees a dense cluster of API improvements, most directly motivated by the simulation and model work:

  • #3797 Tensor::num_dims().
  • #3811 Tensor::<agg>_dims() aggregation variants.
  • #3817 Tensor::sum_and_squeeze_dims().
  • #3869 cautious_weight_decay for AdamW — relevant for stable training of vision models.
  • #3879 Shape::ravel for row-major index raveling.
  • #3923 generalizes linalg::outer semantics and adds outer_dim.
  • #3953 adds comparable Int/Float dtypes, no-op casts, and *_like() dtype preservation — important for mixed-precision simulation.

November–December 2025: Slice/Shape/Distributed Infrastructure

With the core tensor API in good shape, attention turns to the distributed compute layer and shape/slice ergonomics needed for production training pipelines.

  • #3983 implements FromStr for Slice, enabling slice specification via config strings.
  • #4041 adds warmup epochs to MetricEarlyStoppingStrategy.
  • #4042 implements a full Slice iterator and utility methods.
  • #4113 refactors RemoteDevice to use a thread-safe global address registry — a prerequisite for multi-node training.
  • #4127 adds slice_dyn, slice_assign_dyn, and slice_fill_dyn — dynamic rank-agnostic slice operations needed by simulation code.
  • #4157 adds tracing::instrument to collective operations for distributed debugging.
  • #4189 adds flatten_dims to Shape.
  • #4196 replaces canonicalize_dim with expect_dim for cleaner error messages.
  • #4218 consolidates shape and slice error handling into ExpressionError.
  • #4221 unifies ReshapeArgs / Shape::reshape().
  • #4234 plumbs tracing as a deep optional feature.

January 2026: Tokenizer Work Begins; wordchipper Is Born

The work pivots toward LLM infrastructure. A BPE tokenizer (wordchuck, later renamed wordchipper) begins in zsl-chat and quickly spins out into its own repo.

In zsl-chat, the January work establishes the full tokenizer stack: TextSegmentor (#18), byte tables and ByteTokenTable (#61), BPE pair training (BinaryPairVocabTrainerbpe_trainer #63), UnifiedTokenVocab (#79), OpenAI GPT vocabulary special-token handling (#52), and a critical bug fix in the diff calculation (#78 — “Eureka. My diff calc was wrong!”).

By late January, wordchipper exists as its own repo with initial crates, disk cache, pretrained OpenAI model loaders, and the first public release. Test coverage and API stabilization work runs through the end of the month.

Burn contributions this month focus on training infrastructure:

  • #4288 refactors dop_timer for warmup trials.
  • #4318 performance tweaks to lp_norm.

February 2026: wordchipper Reaches Production Quality

An intense month of performance and API work in wordchipper. The major milestones:

  • #38 — a massive overhaul tracking down a speed regression vs tiktoken-rs. Root cause: a wrong version of fancy_regex. Nearly all Arc usage eliminated; TextSegmentor overhauled. This brings wordchipper to competitive performance.
  • #44 / #47 refactor encoders toward span-based encoding, abstracting over a common base.
  • #45 adds LruPoolToy for thread-local caching.
  • #185 adds accelerated DFA-backed lexer support using regex-automata, significantly improving segmentation throughput.
  • #192 enables parallel processing via Rayon in TokenizerOptions.
  • #221 adds the wordchipper-cli crate with a full CLI.
  • #238 adds the train command to the CLI.
  • #267 / #272 add Python bindings via PyO3, exporting TokenizerOptions and core encoding/decoding to Python.

Multiple releases land: v0.2.1, v0.4.0, v0.5.0, v0.6.2, v0.7.0.


March 2026: Benchmarks, Cross-Platform, and Python Parity

wordchipper shifts into benchmarking and comparison work. The goal is to establish that it is a viable drop-in for tiktoken with significant performance advantages on multi-core workloads.

  • Benchmark tooling (#284, #287#297) builds a full reporting and plotting pipeline with SVG/CSV output.
  • #305 adds HuggingFace tokenizers as a benchmark comparison target.
  • Cross-platform benchmark data lands for M2 Mac (#296) and AMD (#337, #342).
  • #358 adds allowed_specials support to TextSpanner, matching tiktoken’s interface.
  • #365 / #368 complete Python bindings with compat test suite.
  • v0.9 pre-release (#345), v1.0 release (#380).

April 2026: LLM Training Infrastructure; Burn Optimizer Work

Attention shifts to zsl-chat as an LLM training framework, and burn’s optimizer and record APIs get the extensions needed to support it.

In burn, the multi-group optimizer work:

  • #4818 preps the optimizer infrastructure for group-based multi-optimizer support.
  • #4822 cleans up OptimizerAdaptor / GradAdaptor API.
  • #4823 removes the unused M type parameter from SimpleOptimizerMapper.
  • #4825 adds Record<(R0,)> for single-element tuple records.

In zsl-chat, the LLM architecture takes shape:

  • #106#109 elaborate the DataLoader machinery around iterators and shuffle configuration.
  • #115#118 build the module tree builder and XPath-style parameter selection (Linear/*[2] syntax for selecting specific parameter groups in a model tree).
  • #122 adds per-group learning rate selectors with fixed and named LR support.
  • #123 begins nanochat-alike partial tuning experiments.

May 2026: LrScheduler and Record API Completion

The final burn PRs in this timeline complete the infrastructure needed for full LLM fine-tuning workflows:

  • #4881 adds ParamId::try_deserialize(), enabling checkpoint loading with partial parameter matching — essential for transfer learning and partial fine-tuning.
  • #4905 adds Clone + 'static bounds to LrScheduler::Record and derives Clone for scheduler records, a prerequisite for serializing and restoring training state across multi-group optimizers.

These land as zsl-chat experiments with partial fine-tuning of chat models — the first real end-to-end LLM training runs against the infrastructure that has been building since July 2025.