We had a story we wanted to tell: "ternary couldn't talk, ternary now talks." It was tidy and it was wrong. What our benchmark actually showed is that the 3-pathway routing architecture — in FP32, on a same-size param budget — beat vanilla. The ternary recipe is in the code as a supported path, but the model that beat vanilla wasn't running it. We're going to be precise about that here.
The number, with every caveat that comes with it
| Variant | Precision | val bpc (lower is better) | Seeds |
|---|---|---|---|
| Vanilla pre-norm transformer (10.09 M params) | FP32 | 0.5707 | 1 |
| Tilelli Lite 3-pathway (10.18 M params) | FP32 | 0.5686 mean (0.5679 / 0.5685 / 0.5693) | 3 |
| Plain ternary (sibling 6-pathway) | ~2-bit weights | ~0.64 | 1 |
| Spectrum-Lite (power-of-3, 7 levels) | ~2.81 bit | ~0.638 | 1 |
Why the comparison is one-sided (and why we're saying so)
Three Lite seeds. One vanilla seed. A clean Welch test requires three on each side. We queued
a 3-seed vanilla replication on RunPod; it cost ~$15 of community-priced A40 time and we ran out of
budget before running it. (RunPod balance as of writing: $1.95.) So what we can defend is a
directional finding: three Lite seeds all landed below the single vanilla seed, with margins between
−0.0014 and −0.0028 nats. The previously-cited "6.7σ" headline was retracted in our own benchmark audit
because the σ measured Lite-side variance only, treating vanilla as exact.
We could have buried that. We didn't.
What the architecture is, in two sentences
For each token, a small per-block router (a 3-vector softmax) decides how much of the token's residual to send
through three pathways: a local 1-D causal convolution (k=5), a top-k sparse causal attention (8 heads, k≤16),
and a dense feed-forward expansion (×4). The router is trained jointly with a load-balance auxiliary loss; the
same routed block can run with FP32 weights, ternary {−1, 0, +1} weights with a per-tensor scale, or
spectrum-quantised weights — but the version that beat vanilla on TinyStories was FP32.
What about ternary?
The ternary path is shipped and trainable, but on the same byte-LM at this scale it currently loses to
FP32 by about 12%. That's not a knock on ternary; it's the cost of representational coarseness at
very small parameter counts. The deployed v4 checkpoint quantize=False field
confirms it ships in FP32. The README in the public kit calls this out explicitly. The supported path is there
if you want to exercise it — toggle one flag — but the headline win is not the quantization.
What about spectrum?
Spectrum is a separate research direction: a power-of-3 quantizer with 7 levels ({0, ±1, ±3, ±9} × α) instead of plain ternary's 3 levels. It costs ~2.81 bits per weight vs ternary's 1.58, requires
no new kernels (multiply-by-3 and -by-9 are shift-and-add), and on our most recent run closes about 49%
of the gap between plain ternary and FP32 — still ~12% behind FP32 vanilla. It's not in this
kit because it's a different architecture; the goal here is to ship the model that beat vanilla, not the
research that's still trying to close the ternary gap.
Check it yourself
The public kit at github.com/TilelliLab/Tilelli-llm
bundles the v4 checkpoint, the 3-pathway architecture in 8 source files, and a reproduce/01_benchmark.py
script that verifies the architecture loads to within ±5% of 10M parameters. The val-bpc number itself requires
re-running the train pipeline (we haven't bundled the training data); the docs say so plainly and tell you what
the GPU cost would be. If you have budget, run the 3-seed vanilla replication we couldn't — and tell us if the
Lite advantage holds or doesn't. Either result is interesting.
The honest thing to say about this benchmark is: directional finding, one vanilla seed, three Lite seeds, margins consistent in sign, full Welch test queued behind a 3-seed vanilla replication we ran out of budget for. If someone runs the replication and the gap collapses, we update the page.