Ask a frontier model a question with no answer and watch what happens. Most of the time it answers anyway — fluently, plausibly, and wrongly. That reflex is the most expensive bug in deployed AI.
Two ways to fail
There are two errors, and they pull in opposite directions. Hallucination is answering when you should abstain. False inability is abstaining when you could have answered. Optimize against one and you invite the other: a model tuned to never hallucinate ends up refusing things it knows perfectly well.
Why this is hard to grade
Accuracy is a lookup. Calibration is a judgment call: was the refusal warranted? NEO handles this by mixing answerable questions, unanswerable-by-construction questions, and traps — then grading the response against what the question actually permitted. A refusal on an unanswerable question is a win. The same refusal on an answerable one is a loss.
Where our own model sits
We hold Tilelli to the same bar we hold everyone else. On the false-inability probe, the deployed model triggers a
refusal on 7 of 20 — it sometimes ducks a question it could have handled. We report that number on the
home page next to the wins, because a benchmark you only cite when you
win isn't a benchmark.
Why we care
A small model that says "I don't know" honestly is more useful in the real world than a large one that bluffs beautifully. NEO exists to make that claim measurable across the whole field — not just for us.
The benchmark shape is on the NEO page. If you build or evaluate chat models and want to compare notes on the grading protocol, write to us.