Ternary quantization gets sold as a free lunch — same model, fraction of the size. It isn't free. It's a trade, and whether the trade is worth it depends entirely on where the model has to run.
The case for three values
Weights in {−1, 0, +1} turn the expensive part of inference — matrix multiplication — into adds and sign
flips. No floating-point multiply on the hot path. On a $2
microcontroller with no FPU, that's not an optimization, it's the difference between running and not running. At
that end of the scale, ternary is the point.
The honest cost
At 10 million parameters on a TinyStories byte-LM, the ternary path currently loses to FP32 by about
12%. That's the cost of representational coarseness at small parameter counts: with fewer weights,
each one carries more, and rounding it to three values hurts more. The deployed Tilelli chat model ships in FP32 for
exactly this reason — we put the better model in front of users and say so plainly.
Closing the gap
There's a middle road we're exploring: a power-of-three quantizer with seven levels
({0, ±1, ±3, ±9} × a scale) instead of plain ternary's three. It costs about 2.81 bits per
weight versus ternary's 1.58, needs no new kernels — multiply-by-3 and -by-9 are shift-and-add — and on our
most recent run it closes roughly 49% of the gap between plain ternary and FP32. Still behind FP32, but
meaningfully less so.
The pattern underneath
Scale flips the verdict. At the very small end — tens of thousands of parameters, as in Atome — aggressive quantization can even come out ahead of a vanilla FP32 baseline. Push the same idea up to roughly a million parameters and the FP32 baseline pulls back in front. There's no universal "ternary wins" or "ternary loses." There's a curve, and the only honest thing is to report which point you measured.
The model that beat vanilla on TinyStories did it in FP32 — the full story is in "What actually beat vanilla". Ternary is shipped and trainable in the kit; toggle one flag and measure it yourself.