One small lab. Three releases that share a single throughline — metacognition, the discipline of a model that knows what it knows. A 10M-parameter ternary language model. A calibration-first benchmark that ranks frontier chat models by honesty under pressure. A compressed biomedical knowledge-graph model that beats two leaderboards.
Each release is small enough to study end-to-end. Click in.
A 10M-parameter transformer descendant in {−1, 0, +1} weights. At param-fair, it beats a vanilla GPT-style baseline on TinyStories byte-LM by 6.7σ. The router-entropy signal is wired straight into a confidence head — when the model is unsure, it says so.
13 probes, 1,015 items, 7 frontier chat models graded by a 5-vendor deliberative council. NEO measures the dangerous failure mode the leaderboards miss: a confident wrong answer, indistinguishable from a confident right one until the consequences arrive.
A ternary knowledge-graph embedding trained on OGBL-biokg + PrimeKG. Float teacher beats the OGBL ComplEx leaderboard. The ternary student edges out TransE at 5.3× compression — the first three-valued KGE to do so. Same recipe, replicated on PrimeKG (2023): quantization improves the teacher on both benchmarks.
Small enough to study. Big enough to surprise. Runs on a CPU.
Bigrams, common phrases, the obvious. Captures roughly 25% of FLOPs and lets attention skip the easy stuff.
Top-k causal attention with 8 heads. Pays for what it uses, ignores what it doesn't. Long-range pathway.
Wide ternary feed-forward at expand=4. Where actual knowledge lives — quantized to {−1, 0, +1} plus per-tensor scale.
Most LLMs hallucinate confidently. Tilelli watches its own router entropy and a small confidence head — and when the signal goes flat, it just says so. No theatrics. No invented facts. The mechanism is auditable in five lines: H(router) > τ or p(confidence) < 0.20 triggers abstention. Fork the threshold to taste.
The live chat is in template-fallback mode today — canned greetings + an AST-safe arithmetic evaluator + a regex stub for abstention. The trained 10M weights swap in once the rented-GPU continued-pretrain run lands. The fallback stays in place as a safety net.
Calibration first. Capability is a side effect.
The single grader systematically favored its own family. Replacing it with a deliberative council with vendor self-exclusion eliminated the bias — and reshuffled the leaderboard. Both the original and the corrected numbers are in the repo. This is the kind of finding that only surfaces if you publish the protocol, not just the score.
The first three-valued KGE to beat the OGBL TransE leaderboard. Replicated on PrimeKG (2023). Runs on a $2 microcontroller.
The pairs (Rosiglitazone, Sitagliptin, Gliclazide, Tolbutamide, Miglitol) ↔ Type-2 diabetes were filtered out of the train, validation, and test splits before the candidate sweep. The model recovered them from the surrounding graph structure — shared targets, side-effect profiles, mechanism families. This is the kind of pattern-finding a compressed KGE is supposed to do, and it did it. Not a discovery of new medicine — a faithful rediscovery of the public record on a tiny model.
They are three views of one question: does the model know when it's right?
NEO grades 7 frontier chat models on whether their stated confidence tracks actual correctness — and whether, when they don't know, they say so. Universal over-confidence on pattern-vs-reason items. Universal under-calibration on SimpleQA. The leaderboards measure something else.
Tilelli's three-pathway router publishes its own entropy. A small confidence head reads that signal. When H(router) > τ, the model abstains. The same metric NEO measures across a black box is wired into the front of this one — and it's auditable, because the model is 10 megabytes.
The biomedical KGE has an agreement head: a small MLP that, for each (drug, relation, disease) query, predicts whether the ternary student will agree with its float teacher. Per-query AUC 0.755. Clinically the question that matters isn't "what's the model's average accuracy?" — it's "is this query the kind the model gets right?"
All three releases are Apache 2.0 in licence, CPU-friendly in inference, and auditable end-to-end by reading the source. The shared bet is that a small model that knows the shape of its own ignorance is more useful than a large model that confidently fills the gap. NEO measures the gap. Tilelli closes it on language. Tilelli Med closes it on biomedical link prediction.
Tilelli is the Tamazight word for freedom. The Imazighen — "the free people" — are a transnational indigenous people of North Africa whose language predates modern national borders by about three thousand years. The letter ⵣ (yaz) is on the Amazigh flag and stands for free man.
Naming small, low-power, auditable models after that idea isn't accidental. Each of the three releases runs without a vendor — on a CPU, on a $2 microcontroller, on a laptop you already own. Freedom to study, to fork, to deploy.
A tribute to Marrakech, whose Tamazight name Mur N'Akush — Land of God — gave the city its name. The Almoravid Berber dynasty made it a capital in 1062, when this language was already two thousand years old. And to His Majesty King Mohammed VI — on 17 June 2011 the new Moroccan constitution recognized Tamazight as an official language of the Kingdom alongside Arabic.
Questions, corrections, collaboration. Plain email.