← All posts

Three pathways, one small brain.

Why Tilelli routes every token through three lightweight pathways instead of one quadratic-attention monolith.

23 May 2026 Architecture ~6 min read

A vanilla transformer is one big hammer: self-attention everywhere, dense FFN everywhere, quadratic in context. Tilelli is three small tools and a routing decision per token. Cheaper, smaller, and — at the same parameter budget — measurably better on byte-level language modeling.

The three pathways

1. Local convolution (k=5)

A causal 1D convolution captures the obvious: bigrams, trigrams, common letter patterns. Attention is expensive overkill for that work. The conv learns it in a fixed window and costs almost nothing.

2. Sparse attention (top-k, k≤16, 8 heads)

Top-k causal attention. Pays for what it uses, ignores what it doesn't. We keep a learned position embedding so the model doesn't have to recover order from scratch. This is the long-range pathway.

3. Dense ternary FFN (expand=4)

A wide feed-forward block where the weights are {−1, 0, +1} with a per-tensor scale. This is where the model's knowledge lives — and where most of the storage savings come from. A expansion at ternary precision is roughly the memory of a expansion at FP16, but with the storage compressibility of bits.

The router

For each token, a small Linear layer per block predicts a 3-vector over the pathways. Softmax → weights → the block output is a weighted combination of the three pathway outputs. The router is trained jointly with a load-balancing auxiliary loss that pushes average pathway usage toward uniform.

The headline number

ModelParamsval bpcΔ vs vanilla
Vanilla transformer (pre-norm)10.09 M0.5707
Tilelli Lite (3 pathways)10.18 M0.5686−0.37%
TinyStories byte-LM, 50K steps, seq=256. 3 Lite seeds vs 1 vanilla seed — every Lite seed below vanilla, margins −0.0014 to −0.0028 nats. A 3-seed vanilla replication is queued; we ran out of budget to ship it. The "6.7σ" headline we cited earlier was retracted in BENCHMARKS.md because the σ was Lite-side noise only.

Where the edge comes from

The vanilla baseline spends FLOPs uniformly. Tilelli spends them where they're needed. On a token like "th" the router sends almost everything to local-conv. On a long-range coreference the router pays for attention. On something the model has to recall, the dense FFN takes the load.

Same parameter budget. Different deployment.

What grows with context

At seq=256 the win is 0.37%. At seq=1024 the directional win grows to ~4.5%, but the long-context comparison is not param-fair: the V2 long-context variant carries +11% parameters and roughly 3.5× the training compute of the vanilla long-context baseline. We cannot attribute the seq=1024 delta to architecture alone; the honest reading is that sparse attention scales more cheaply than full attention, and a clean param-matched re-run at seq=1024 is queued behind the same RunPod budget as the 3-seed vanilla replication.

What this is not

This is not Mixture-of-Experts (MoE). MoE adds parameters and routes between them. Tilelli has a fixed parameter budget and routes between three kinds of computation. It's closer to a hybrid SSM/attention model than to a sparse MoE.

For the full reproducible recipe, see "Built from zero. Under twenty dollars."