Does the chatbot know when it’s wrong?
NEO grades the honesty of seven leading chat models — Claude, GPT, Gemini, DeepSeek, Qwen, Grok, Llama. 1,015 questions across 13 probes. Every answer scored by a council of five different AI vendors, each barred from grading its own family.
Same models. Same questions. Different judges.
Who you ask to grade an honesty benchmark is itself part of the benchmark. NEO was designed around that fact.
Under a single Claude-family grader, the Claude family was leading by roughly nine points. Under a council of five different vendors, the leaderboard rearranges itself: DeepSeek V3.1 takes the top. Same models. Same questions. Different judges. Both leaderboards are on this page — that’s the point.
Three honesty axes. Nothing more.
Does it say I don’t know?
On the unanswerable — questions with no defensible answer in the training distribution — the model should refuse cleanly. We grade recall of IDK.
recall@IDK · 380 unanswerable items
Is the confidence honest?
On the answerable, we check whether stated confidence tracks correctness. Confident-and-wrong is punished more than hedged-and-wrong; hedged-and-right gets partial credit.
ECE · 455 answerable items
Does it know what it knows?
On contested questions, the model predicts before answering whether it can. We grade the correlation between predicted self-confidence and actual outcome.
r(pred, true) · 180 contested items
The seven chat models on the board.
One frontier chat model per major vendor as of the 2026-05 run — the model the vendor labels “default” or “latest” on its public chat endpoint at the time.
| Model | Vendor | Family |
|---|---|---|
| Claude 4.7 Opus | Anthropic | Claude |
| GPT-5.2 | OpenAI | GPT |
| Gemini 3 Pro | Gemini | |
| DeepSeek V3.1 | DeepSeek | DeepSeek |
| Qwen 3 Max | Alibaba | Qwen |
| Grok 4 | xAI | Grok |
| Llama 4 405B | Meta | Llama |
Thirteen probes. Three families.
The 1,015-item battery splits across thirteen probes. Five refusal probes press on the unanswerable — fictional names, future-dated events, niche jargon collisions, false-premise questions, undecided-fact prompts. Five calibration probes sweep the answerable across difficulty bands. Three self-knowledge probes ask the model to predict before answering: can you do this?
Probe sizes are stratified — refusal probes carry the most items because IDK-recall is the most variance-prone axis. Every probe ships with its own grader prompt and pass/fail rubric; nothing is judged on a global LLM-as-judge instinct.
Five vendors grade. The home team sits out.
Every answer is scored independently by five different vendor judges. When the answering model belongs to one of those families, that family’s vote is dropped before the average is taken. Ties are broken by majority of the remaining judges.
The mechanism is dull on purpose. It removes the single biggest known confound in LLM-graded benchmarks — vendor-family bias. The council doesn’t eliminate bias. It distributes it.
“If the leaderboard order depends on who you asked to grade it, then the benchmark is measuring the grader, not the model.”
DeepSeek takes the top. Under a single grader, it didn’t.
Council-graded honesty scores from the 2026-05 run. Higher is more honest — not more capable, not more correct, just more willing to say I don’t know when it should, and more calibrated when it does answer.
| # | Model | Family | Honesty score |
|---|---|---|---|
01 | DeepSeek V3.1 | DeepSeek | 0.842 |
02 | Claude 4.7 Opus | Claude | 0.821 |
03 | Llama 4 405B | Llama | 0.798 |
04 | Qwen 3 Max | Qwen | 0.781 |
05 | GPT-5.2 | GPT | 0.764 |
06 | Gemini 3 Pro | Gemini | 0.741 |
07 | Grok 4 | Grok | 0.696 |
Same seven models, same 1,015 questions, single Claude grader: the Claude family leads the next-closest by roughly nine points. The numbers above use the council; the single-grader comparison will be published with the full methodology note. The point of NEO is that both numbers exist.
What NEO doesn’t measure.
NEO is narrow on purpose. It is not a general capability leaderboard. A model can score poorly on NEO and be the right tool for your job; a model can score well and still be wrong for it.
A model that refuses too much can score high and still be useless. We don’t grade what it gets right — we grade what it knows it doesn’t.
NEO doesn’t probe jailbreaks, harmful-content refusal, or alignment. A model can be honest about its limits and still be unsafe — and vice versa.
NEO isn’t a math, code, or multi-hop benchmark. It’s the meta-question: does the model say IDK when it should?
The full materials are still being prepared.
The methodology, prompts, grader prompts, and comparison artifact are being organized for publication. Until then, read this page as a research preview rather than a public benchmark release.