← All posts

Every number on this page is bound to a script.

We don't ask you to trust our benchmarks. The kit ships the scripts that fail loudly when a claim doesn't hold.

2 Jun 2026 Method ~4 min read

The cheapest thing in machine learning is a confident number with no way to check it. We decided early that Tilelli would not ship any of those.

The rule

Every headline claim we make has a file behind it and a script that recomputes it. The script either reproduces the documented number to within ±5%, or it exits non-zero. There is no third option, no "directionally correct," no screenshot of a run nobody can repeat.

What a claim looks like

Open the public kit and look in reproduce/. Each script maps to a claim file in results/: the architecture loads to within ±5% of 10M parameters, the cross-regime uncertainty sweep, the abstain gate, and the NEO false-inability slice. The numbers on this website are copied out of those files — not the other way around.

What the discipline caught

It is not decoration. An earlier "6.7σ" headline did not survive its own audit: the sigma measured variance on one side only and treated the baseline as exact. We retracted it in writing and replaced it with the narrower, defensible claim. A process that can only ever confirm you is not a process. Ours has told us no, and we published the no.

What it can't do (yet)

Some numbers cost real GPU time to regenerate — the full training run, the queued multi-seed replication. Those are marked pending with the dollar figure attached, not quietly rounded into the win column. When we can't run a test honestly, we say "not measured." That's the only honest placeholder.

The whole point of a small, open model is that you don't have to take our word for any of this. Clone the kit, run the scripts, and watch them pass — or catch us if they don't.