← All posts

When the small model learns to say "I don't know."

An honest write-up of five experiments testing whether router entropy is a metacognition signal in a 10M-parameter LM.

24 May 2026 Negative result ~9 min read

Most language models hallucinate confidently. We wanted Tilelli — a 10M-parameter ternary LM — to do something different: notice when it doesn't know, and say so. We had a specific hypothesis about how that signal would emerge. After five experiments, the hypothesis is disproven. But the model still abstains gracefully, just for different reasons than we thought. This is the write-up we owe.

The hypothesis

Tilelli routes every token through three lightweight pathways — a local convolution, a sparse-attention head, and a ternary dense feed-forward — chosen per token by a small router. Our hypothesis was that the router's entropy would correlate with the model's uncertainty: when the input was clearly in-domain, the router would commit to one pathway (low entropy); when the input was strange or unanswerable, the router would hesitate (high entropy). Call that signal H(router). The deploy plan was simple: H(router) > τ → the model abstains.

This is a clean, publishable story — if it worked.

What we measured

We built a 210-prompt evaluation set spanning seven regimes: in-domain chat, out-of-distribution topic, out-of-distribution style, factually misleading, long input, near-miss factuals (the "NEO" probe), and pure gibberish. For each regime, we measured whether any of seven uncertainty signals — including H(router), the abstain-head probability abstain_p, and the simple output-side baseline max_softmax_mean — could discriminate that regime from in-domain chat.

The five attempts

Variant Where What changed Cross-regime AUROC Chat coherent?
v4 (baseline) abstain-aware SFT, deployed 0.51 (chance) yes
v5 CPU router finetune, raw text in-domain NO (degenerate)
v6 CPU router finetune, chat-format in-domain 0.42 entropy / 0.69 chat-fmt partial (syntax-OOD only)
v7 RTX 4090 joint router+abstain, MC weight 20 0.76 NO (garbled)
v8a A40 MC=5, abstain=1 0.80 NO
v8b A40 MC=0, abstain=5 (pure abstain SFT) 0.85 (best) NO
splice CPU graft v7 abstain head onto v4 base 0.54 (collapsed) yes
Five training attempts plus a graft control. Every successful AUROC came with broken generation; every attempt to preserve generation collapsed the signal.

The headline: disproven

Under our pre-registered decision rule — "router entropy must beat the output-side baseline on at least 4 of 7 regimes" — the result is 0/7. Across CPU, GPU, raw text, chat-format, three different loss-weight schedules, and a controlled head-graft, we could not find a setting where the router encoded a transferable semantic-uncertainty signal.

The counter-intuitive part

v8b — with the metacognition loss weight set to zero — produced the strongest signal of the whole project: cross-regime AUROC 0.85, gibberish abstention rate 100%. That should be impossible if the router were doing the work. So we ran the graft control: load v4 (the deployed, well-generating ckpt) and splice v7's trained abstain head onto it. The AUROC dropped to 0.54 — back to chance.

The abstain head is not a transferable module. Its signal is bound up with the specific router state it was trained against. You cannot peel it off and apply it elsewhere.

Why the router breaks

Even with the metacognition loss weight set to zero, every router finetune broke generation. The mechanism is now clean: the CE loss on the in-domain subset still backpropagates through the unfrozen router linears. 500 steps × batch 32 = 16,384 in-domain gradient updates. That is enough to shift the routing distribution far from what the rest of the (frozen) network was tuned against. The router is fragile at this scale — it cannot be retrained on any subset distribution without breaking the joint optimum.

What actually works on tilelli.tech today

The deployed model (v4) does abstain gracefully. It just does so for a simpler reason than our hypothesis: the output-side max-softmax baseline. When the model is confident, the highest-probability next token is sharp (max-softmax near 1.0). When the model is unsure, the distribution is flat (max-softmax low). This is an architecture-agnostic baseline that:

  • Reaches AUROC 0.93 on gibberish detection.
  • Triggers 9/10 on our held-out "I don't know" probe (script PASS gate ≥ 9 — the 2026-05-20 deploy probe got 10/10 on slightly different phrasing; the bundled v4 in the kit re-verifies at 9/10).
  • Has 0% false-positive rate on in-domain chat.
  • Required no special training to expose — it falls out of the abstain-aware SFT we did months ago.

The flashy story we wanted to tell — "the router learns to hesitate" — is wrong. The boring story we can actually defend — "the model's own confidence is observable from its output distribution" — is right.

Does this solve hallucination?

No. Be clear about that. The model abstains reliably on:

  • Pure gibberish input (AUROC 0.93).
  • Questions of the form "What did I have for breakfast?" / "Who won the 2034 World Cup?" — things the model has zero evidence for.

The model still fails on the dangerous class: questions where it has partial, incorrect knowledge and is confidently wrong. That's the same failure mode as a 70-billion-parameter model, just at smaller scale. Knowing when you don't know is not the same as knowing when you're wrong about what you think you know.

Why we're publishing this

A model that says "I don't know" because we caught it doing so reliably is more trustworthy than one that claims to know how. A negative result, clearly published, is more useful than a confident overclaim — both for the field and for anyone who would deploy this code.

The full data, weights, and reproducible scripts are open. Every numerical claim above is verified by an exit-non-zero check against the bundled v4 checkpoint. If a number on this page is wrong, the verification will say so.

What's next

Disproving the separable-module story is not the end of the metacognition arc; it is the start of a different one. The v8b experiment gave us something narrow but useful: an existence proof that an abstain signal lives inside the joint state of the model. We just could not peel it off as a module — the splice control made that unambiguous.

The next concept does. In our Barbary Lion research line we're validating a read-only confidence signal that doesn't require retraining the base model — no weight updates, no module bolted on. It reads what's already there, and decides. Results coming. We're not ready to share the mechanism yet.

Honest about what we built. Honest about what we didn't. Curious about what comes next.