A five-vendor jury, and nobody grades their own homework

It is easy to score whether a model got the right answer. It is much harder to score whether it should have answered at all — and harder still to grade that without bias.

The board

NEO puts seven leading chat families on the same board — Claude, GPT, Gemini, DeepSeek, Qwen, Grok, and Llama — and runs them through 1,015 questions across several probe types. Some questions are answerable. Some are unanswerable by construction. Some are traps that reward the model for refusing.

Why a council, not a judge

A single grader model has a thumb on the scale: its own family. So NEO grades every answer with a five-vendor council, and applies one hard rule — no family grades its own model. When GPT answers, GPT is not on its jury. The verdict is the council's, not any one vendor's.

What we're measuring

Two failure modes matter more than raw accuracy. A model can hallucinate — answer confidently when it should abstain. Or it can show false inability — refuse a question it could have answered. NEO scores both, because a model that refuses everything is no more honest than one that bluffs everything.

What we're not doing here

We're not publishing the full ranking in this note, and we're not pretending one run settles the order of frontier labs. NEO is a research preview: the value is in the protocol — the family-blind council, the unanswerable probes, the refusal accounting — not in a leaderboard screenshot you can't reproduce.

The shape of the benchmark is live on the NEO page. The interesting findings are coming — and they are about the question every deployed assistant quietly fails: when to keep quiet.

Published 31 May 2026 · Corrections: hello@tilelli.tech