We publish both.
Working notes from a small lab — the architecture, the benchmarks, and the experiments that didn’t work. Some wins, some failures, every one with the receipts.
A tiny model whose facts you can edit, that knows when it doesn’t know.
Each fact lives in its own atom — create, read, update, delete one at a time with provable locality, and it abstains when unsure which fact you mean. What works, what doesn’t, and why it’s a step not a breakthrough.
Read the write-up →State-space beat attention. Then we went further.
On the Featherweight bench, RWKV/Hyena/SSM beat softmax by 18–32%. So we asked what beats RWKV — and a narrow, honest win is now firming up across seeds.
Read the update →Biomedicine on a diet: three values per weight.
Ternary knowledge-graph embeddings — 5.3× smaller than the FP32 baseline, 24 MB packed, and still ahead on OGBL-biokg and PrimeKG. A research tool, not a clinic.
Read the explainer →Every number on this page is bound to a script.
We don’t ask you to trust our benchmarks. The kit ships scripts that recompute each claim and exit non-zero when one doesn’t hold — and it has caught us.
Read the method →Ranking drug candidates with a confidence dial.
Petitri ranks candidate drugs per disease on the Tilelli Med stack — and keeps its confidence honestly low when the graph evidence is thin. For review, not treatment.
Read the preview →A five-vendor jury, and nobody grades their own homework.
NEO grades seven leading chat models on honest uncertainty — every answer judged by a council of five vendors, each family barred from scoring its own model.
Read how it grades →Featherweight: five divisions, one belt.
Inside Mizan’s first bench — precision, optimizer, attention, state-space, community — and the rule that the smallest honest loss wins. Plus what the bench has shown.
Read the breakdown →A language model that boots as firmware.
Atome runs a ternary LM on a $2 microcontroller — no OS, no internet, no app — with bit-exact Python↔C99 parity in a zero-heap engine. The honest, narrowed claim.
Read the build →The hardest test for a chat model.
Most benchmarks reward confident answers. NEO rewards the right refusal — and penalizes the wrong one. Two ways to fail, only one of which the industry talks about.
Read the case →Read the whole model in an afternoon.
A guided tour of the public kit — two 39 MB checkpoints, the 3-pathway architecture in a handful of files, and four scripts that check our claims. Apache-2.0, CPU-only.
Take the tour →Why ternary — and where it still loses.
Three-value weights are the whole point at $2-chip scale. On a 10M byte-LM they still trail FP32 by ~12%. Both facts are true; we ship both — and a 7-level middle road.
Read the trade-off →Five attempts, $0.78 of GPU, hypothesis disproven.
The full v5 → v8b sweep — what we tried, where each attempt broke, and the clean mechanism we finally identified. The router is fragile, and we now know why.
Read the postmortem →When the small model learns to say “I don’t know.”
We ran five experiments to test whether router entropy correlates with semantic uncertainty. It doesn’t. Here’s what does work — and why we’re publishing the failure.
Read the write-up →Three pathways, one small brain.
Why Tilelli routes every token through a local conv, a sparse attention head, and a ternary dense FFN — and why this beats a vanilla transformer at the same parameter count.
Read the explainer →Why a 10M-parameter model still matters.
In an era of trillion-parameter frontier models, what’s the point of a 39-megabyte one? Local inference, audit, reproducibility, and a model your laptop can carry.
Read the case →What actually beat vanilla.
The “6.7σ” headline is retracted. What we can defend: every Lite seed below the single vanilla seed, FP32 vs FP32, on TinyStories byte-LM. The receipts, and what’s still missing.
Read the retraction →Built from zero. Under twenty dollars.
The full reproducible recipe to train Tilelli v0.1 on one rented GPU for less than the price of a movie ticket. No pretrained weights, no shortcuts.
Read the recipe →