It is easy to score whether a model got the right answer. It is much harder to score whether it should have answered at all — and harder still to grade that without bias.
The board
NEO puts seven leading chat families on the same board — Claude, GPT, Gemini, DeepSeek, Qwen, Grok, and Llama —
and runs them through 1,015 questions across several probe types. Some questions are answerable. Some are
unanswerable by construction. Some are traps that reward the model for refusing.
Why a council, not a judge
A single grader model has a thumb on the scale: its own family. So NEO grades every answer with a five-vendor council, and applies one hard rule — no family grades its own model. When GPT answers, GPT is not on its jury. The verdict is the council's, not any one vendor's.
What we're measuring
Two failure modes matter more than raw accuracy. A model can hallucinate — answer confidently when it should abstain. Or it can show false inability — refuse a question it could have answered. NEO scores both, because a model that refuses everything is no more honest than one that bluffs everything.
What we're not doing here
We're not publishing the full ranking in this note, and we're not pretending one run settles the order of frontier labs. NEO is a research preview: the value is in the protocol — the family-blind council, the unanswerable probes, the refusal accounting — not in a leaderboard screenshot you can't reproduce.
The shape of the benchmark is live on the NEO page. The interesting findings are coming — and they are about the question every deployed assistant quietly fails: when to keep quiet.