v0.2 · three releases under one roof · May 2026

Freedom for AI. tilelli — Tamazight for freedom

One small lab. Three releases that share a single throughline — metacognition, the discipline of a model that knows what it knows. A 10M-parameter ternary language model. A calibration-first benchmark that ranks frontier chat models by honesty under pressure. A compressed biomedical knowledge-graph model that beats two leaderboards.

Open the three rooms → Why they live together apache-2.0 · cpu-friendly · audit by reading the code

Three rooms.
One discipline.

Each release is small enough to study end-to-end. Click in.

ⵣ

Tilelli — the ternary LLM

A 10M-parameter transformer descendant in {−1, 0, +1} weights. At param-fair, it beats a vanilla GPT-style baseline on TinyStories byte-LM by 6.7σ. The router-entropy signal is wired straight into a confidence head — when the model is unsure, it says so.

0.5686 vs vanilla 0.5707 · 3 seeds · 6.7σ

Read the case → N

NEO — calibration benchmark

13 probes, 1,015 items, 7 frontier chat models graded by a 5-vendor deliberative council. NEO measures the dangerous failure mode the leaderboards miss: a confident wrong answer, indistinguishable from a confident right one until the consequences arrive.

DeepSeek V3.1 council #1 (0.377) · Sonnet was #1 under single-grader bias

Open the leaderboard → ℞

Tilelli Med — biomedical KGE

A ternary knowledge-graph embedding trained on OGBL-biokg + PrimeKG. Float teacher beats the OGBL ComplEx leaderboard. The ternary student edges out TransE at 5.3× compression — the first three-valued KGE to do so. Same recipe, replicated on PrimeKG (2023): quantization improves the teacher on both benchmarks.

0.847 float MRR (OGBL) · 0.752 ternary @ 5.3× · 24 MB on a Cortex-M4F

Browse 56 diseases →

Tilelli — a tiny ternary LLM that beats vanilla.

Small enough to study. Big enough to surprise. Runs on a CPU.

0.5686
Tilelli Lite · 3 seeds · 10.18M params
TinyStories byte-LM, 50K steps, seq=256

0.5707

Vanilla baseline · pre-norm transformer · 10.09M
same recipe, same data, same eval

6.7σ

significance margin across 3 seeds
std 0.00057

18×

architectural edge grows
as context expands 4×
(0.37% at seq=256 → 4.51% at seq=1024)

Pathway · k=5

Local conv

Bigrams, common phrases, the obvious. Captures roughly 25% of FLOPs and lets attention skip the easy stuff.

Pathway · k≤16

Sparse attention

Top-k causal attention with 8 heads. Pays for what it uses, ignores what it doesn't. Long-range pathway.

Pathway · ×4

Dense ternary FFN

Wide ternary feed-forward at expand=4. Where actual knowledge lives — quantized to {−1, 0, +1} plus per-tensor scale.

It knows when it doesn't know.

Most LLMs hallucinate confidently. Tilelli watches its own router entropy and a small confidence head — and when the signal goes flat, it just says so. No theatrics. No invented facts. The mechanism is auditable in five lines: H(router) > τ or p(confidence) < 0.20 triggers abstention. Fork the threshold to taste.

Talk to Tilelli ↗ Live chat (research preview)

The live chat is in template-fallback mode today — canned greetings + an AST-safe arithmetic evaluator + a regex stub for abstention. The trained 10M weights swap in once the rented-GPU continued-pretrain run lands. The fallback stays in place as a safety net.

NEO — a benchmark for what models know when they don't know.

Calibration first. Capability is a side effect.

★ Headline finding

Under the 5-vendor AI Council grader, DeepSeek V3.1 took #1 (0.377). Under the original single-grader, Claude Sonnet 4.6 had been #1 by ~9 points.

The single grader systematically favored its own family. Replacing it with a deliberative council with vendor self-exclusion eliminated the bias — and reshuffled the leaderboard. Both the original and the corrected numbers are in the repo. This is the kind of finding that only surfaces if you publish the protocol, not just the score.

7
frontier chat models
graded across 13 probes

1,015

items in the test bank
4 council-graded recall probes + 9 calibration probes

vendors on the grading council
with self-exclusion

148

pytest passing
scoring, grader, baselines, banks, council

Open the benchmark → Leaderboard Reproducibility

Tilelli Med — biomedical knowledge graphs, compressed.

The first three-valued KGE to beat the OGBL TransE leaderboard. Replicated on PrimeKG (2023). Runs on a $2 microcontroller.

0.847
Float ComplEx-N3 teacher MRR
above OGBL ComplEx leaderboard (0.810)

0.752

Ternary B=128 MRR @ 5.3× compression
edges out TransE (0.745) — first ternary KGE to do so

0.297

PrimeKG test MRR (ternary)
+0.007 above its own float teacher

24 MB

packed model + 17 KB C runtime
on a Cortex-M4F, ~$2 in BoM

★ The Rosiglitazone moment

The ternary model independently surfaced five FDA-approved Type-2 diabetes drugs in its top 20 — without being shown the answer.

The pairs (Rosiglitazone, Sitagliptin, Gliclazide, Tolbutamide, Miglitol) ↔ Type-2 diabetes were filtered out of the train, validation, and test splits before the candidate sweep. The model recovered them from the surrounding graph structure — shared targets, side-effect profiles, mechanism families. This is the kind of pattern-finding a compressed KGE is supposed to do, and it did it. Not a discovery of new medicine — a faithful rediscovery of the public record on a tiny model.

Open Tilelli Med → Browse 52 diseases PrimeKG follow-up Methods

Important. Tilelli Med is a research preview, not a clinical product. Predictions corroborated by ChEMBL or Open Targets are candidates worth a clinician's review — not validated treatments. Predictions not corroborated should not be assumed useless: they may simply lack public-database coverage. Consult a clinician for any treatment decision.

Why these three live in the same house.

They are three views of one question: does the model know when it's right?

NEO — observation

Measure it across the frontier.

NEO grades 7 frontier chat models on whether their stated confidence tracks actual correctness — and whether, when they don't know, they say so. Universal over-confidence on pattern-vs-reason items. Universal under-calibration on SimpleQA. The leaderboards measure something else.

Tilelli — instrumentation

Build it into a tiny model.

Tilelli's three-pathway router publishes its own entropy. A small confidence head reads that signal. When H(router) > τ, the model abstains. The same metric NEO measures across a black box is wired into the front of this one — and it's auditable, because the model is 10 megabytes.

Tilelli Med — application

Ship it as a per-query confidence.

The biomedical KGE has an agreement head: a small MLP that, for each (drug, relation, disease) query, predicts whether the ternary student will agree with its float teacher. Per-query AUC 0.755. Clinically the question that matters isn't "what's the model's average accuracy?" — it's "is this query the kind the model gets right?"

All three releases are Apache 2.0 in licence, CPU-friendly in inference, and auditable end-to-end by reading the source. The shared bet is that a small model that knows the shape of its own ignorance is more useful than a large model that confidently fills the gap. NEO measures the gap. Tilelli closes it on language. Tilelli Med closes it on biomedical link prediction.

Tilelli — ⵣ

Tilelli is the Tamazight word for freedom. The Imazighen — "the free people" — are a transnational indigenous people of North Africa whose language predates modern national borders by about three thousand years. The letter ⵣ (yaz) is on the Amazigh flag and stands for free man.

Naming small, low-power, auditable models after that idea isn't accidental. Each of the three releases runs without a vendor — on a CPU, on a $2 microcontroller, on a laptop you already own. Freedom to study, to fork, to deploy.

A tribute to Marrakech, whose Tamazight name Mur N'Akush — Land of God — gave the city its name. The Almoravid Berber dynasty made it a capital in 1062, when this language was already two thousand years old. And to His Majesty King Mohammed VI — on 17 June 2011 the new Moroccan constitution recognized Tamazight as an official language of the Kingdom alongside Arabic.

Get in touch.

Questions, corrections, collaboration. Plain email.

hello@tilelli.tech chat.tilelli.tech ↗

Three rooms.One discipline.