A vanilla transformer is one big hammer: self-attention everywhere, dense FFN everywhere, quadratic in context. Tilelli is three small tools and a routing decision per token. Cheaper, smaller, and — at the same parameter budget — measurably better on byte-level language modeling.
The three pathways
1. Local convolution (k=5)
A causal 1D convolution captures the obvious: bigrams, trigrams, common letter patterns. Attention is expensive overkill for that work. The conv learns it in a fixed window and costs almost nothing.
2. Sparse attention (top-k, k≤16, 8 heads)
Top-k causal attention. Pays for what it uses, ignores what it doesn't. We keep a learned position embedding so the model doesn't have to recover order from scratch. This is the long-range pathway.
3. Dense ternary FFN (expand=4)
A wide feed-forward block where the weights are {−1, 0, +1} with a per-tensor scale.
This is where the model's knowledge lives — and where most of the storage savings come from. A 4× expansion at
ternary precision is roughly the memory of a 2× expansion at FP16, but with the storage compressibility of bits.
The router
For each token, a small Linear layer per block predicts a 3-vector over the pathways. Softmax → weights → the block output is a weighted combination of the three pathway outputs. The router is trained jointly with a load-balancing auxiliary loss that pushes average pathway usage toward uniform.
The headline number
| Model | Params | val bpc | Δ vs vanilla |
|---|---|---|---|
| Vanilla transformer (pre-norm) | 10.09 M | 0.5707 | — |
| Tilelli Lite (3 pathways) | 10.18 M | 0.5686 | −0.37% |
Where the edge comes from
The vanilla baseline spends FLOPs uniformly. Tilelli spends them where they're needed. On a token like
"th" the router sends almost everything to local-conv. On a long-range coreference
the router pays for attention. On something the model has to recall, the dense FFN takes the load.
Same parameter budget. Different deployment.
What grows with context
At seq=256 the win is 0.37%. At seq=1024 the directional win grows to ~4.5%, but the long-context comparison is not param-fair: the V2 long-context variant carries +11% parameters and roughly 3.5× the training compute of the vanilla long-context baseline. We cannot attribute the seq=1024 delta to architecture alone; the honest reading is that sparse attention scales more cheaply than full attention, and a clean param-matched re-run at seq=1024 is queued behind the same RunPod budget as the 3-seed vanilla replication.
What this is not
This is not Mixture-of-Experts (MoE). MoE adds parameters and routes between them. Tilelli has a fixed parameter budget and routes between three kinds of computation. It's closer to a hybrid SSM/attention model than to a sparse MoE.
For the full reproducible recipe, see "Built from zero. Under twenty dollars."