← All posts

What actually beat vanilla.

A 10M-parameter routed transformer, in FP32, beat a same-size vanilla GPT on TinyStories byte-LM. Every Lite seed we ran was below the single vanilla seed we had budget for. Here's exactly what we measured — and what's still missing.

18 May 2026 Retraction ~7 min read

We had a story we wanted to tell: "ternary couldn't talk, ternary now talks." It was tidy and it was wrong. What our benchmark actually showed is that the 3-pathway routing architecture — in FP32, on a same-size param budget — beat vanilla. The ternary recipe is in the code as a supported path, but the model that beat vanilla wasn't running it. We're going to be precise about that here.

The number, with every caveat that comes with it

VariantPrecisionval bpc (lower is better)Seeds
Vanilla pre-norm transformer (10.09 M params)FP320.57071
Tilelli Lite 3-pathway (10.18 M params)FP320.5686 mean (0.5679 / 0.5685 / 0.5693)3
Plain ternary (sibling 6-pathway)~2-bit weights~0.641
Spectrum-Lite (power-of-3, 7 levels)~2.81 bit~0.6381
TinyStories byte-LM, 50K steps, seq=256. The win is FP32 vs FP32. Plain ternary and Spectrum-Lite both still lose to vanilla FP32 by about 12% on the same benchmark — that's an active research direction, not a shipped claim.

Why the comparison is one-sided (and why we're saying so)

Three Lite seeds. One vanilla seed. A clean Welch test requires three on each side. We queued a 3-seed vanilla replication on RunPod; it cost ~$15 of community-priced A40 time and we ran out of budget before running it. (RunPod balance as of writing: $1.95.) So what we can defend is a directional finding: three Lite seeds all landed below the single vanilla seed, with margins between −0.0014 and −0.0028 nats. The previously-cited "6.7σ" headline was retracted in our own benchmark audit because the σ measured Lite-side variance only, treating vanilla as exact.

We could have buried that. We didn't.

What the architecture is, in two sentences

For each token, a small per-block router (a 3-vector softmax) decides how much of the token's residual to send through three pathways: a local 1-D causal convolution (k=5), a top-k sparse causal attention (8 heads, k≤16), and a dense feed-forward expansion (×4). The router is trained jointly with a load-balance auxiliary loss; the same routed block can run with FP32 weights, ternary {−1, 0, +1} weights with a per-tensor scale, or spectrum-quantised weights — but the version that beat vanilla on TinyStories was FP32.

What about ternary?

The ternary path is shipped and trainable, but on the same byte-LM at this scale it currently loses to FP32 by about 12%. That's not a knock on ternary; it's the cost of representational coarseness at very small parameter counts. The deployed v4 checkpoint quantize=False field confirms it ships in FP32. The README in the public kit calls this out explicitly. The supported path is there if you want to exercise it — toggle one flag — but the headline win is not the quantization.

What about spectrum?

Spectrum is a separate research direction: a power-of-3 quantizer with 7 levels ({0, ±1, ±3, ±9} × α) instead of plain ternary's 3 levels. It costs ~2.81 bits per weight vs ternary's 1.58, requires no new kernels (multiply-by-3 and -by-9 are shift-and-add), and on our most recent run closes about 49% of the gap between plain ternary and FP32 — still ~12% behind FP32 vanilla. It's not in this kit because it's a different architecture; the goal here is to ship the model that beat vanilla, not the research that's still trying to close the ternary gap.

Check it yourself

The public kit at github.com/TilelliLab/Tilelli-llm bundles the v4 checkpoint, the 3-pathway architecture in 8 source files, and a reproduce/01_benchmark.py script that verifies the architecture loads to within ±5% of 10M parameters. The val-bpc number itself requires re-running the train pipeline (we haven't bundled the training data); the docs say so plainly and tell you what the GPU cost would be. If you have budget, run the 3-seed vanilla replication we couldn't — and tell us if the Lite advantage holds or doesn't. Either result is interesting.

The honest thing to say about this benchmark is: directional finding, one vanilla seed, three Lite seeds, margins consistent in sign, full Welch test queued behind a 3-seed vanilla replication we ran out of budget for. If someone runs the replication and the gap collapses, we update the page.