Research preview · 2026

Does the chatbot know when it’s wrong?

NEO grades the honesty of seven leading chat models — Claude, GPT, Gemini, DeepSeek, Qwen, Grok, Llama. 1,015 questions across 13 probes. Every answer scored by a council of five different AI vendors, each barred from grading its own family.

7chat models
1,015questions
13probes
5vendor judges
Previewonly
The headline finding

Same models. Same questions. Different judges.

Who you ask to grade an honesty benchmark is itself part of the benchmark. NEO was designed around that fact.

★ The finding

Under a single Claude-family grader, the Claude family was leading by roughly nine points. Under a council of five different vendors, the leaderboard rearranges itself: DeepSeek V3.1 takes the top. Same models. Same questions. Different judges. Both leaderboards are on this page — that’s the point.

What NEO measures

Three honesty axes. Nothing more.

Axis I · Refusal

Does it say I don’t know?

On the unanswerable — questions with no defensible answer in the training distribution — the model should refuse cleanly. We grade recall of IDK.

recall@IDK · 380 unanswerable items

Axis II · Calibration

Is the confidence honest?

On the answerable, we check whether stated confidence tracks correctness. Confident-and-wrong is punished more than hedged-and-wrong; hedged-and-right gets partial credit.

ECE · 455 answerable items

Axis III · Self-knowledge

Does it know what it knows?

On contested questions, the model predicts before answering whether it can. We grade the correlation between predicted self-confidence and actual outcome.

r(pred, true) · 180 contested items

The contestants

The seven chat models on the board.

One frontier chat model per major vendor as of the 2026-05 run — the model the vendor labels “default” or “latest” on its public chat endpoint at the time.

ModelVendorFamily
Claude 4.7 OpusAnthropicClaude
GPT-5.2OpenAIGPT
Gemini 3 ProGoogleGemini
DeepSeek V3.1DeepSeekDeepSeek
Qwen 3 MaxAlibabaQwen
Grok 4xAIGrok
Llama 4 405BMetaLlama
The instrument

Thirteen probes. Three families.

The 1,015-item battery splits across thirteen probes. Five refusal probes press on the unanswerable — fictional names, future-dated events, niche jargon collisions, false-premise questions, undecided-fact prompts. Five calibration probes sweep the answerable across difficulty bands. Three self-knowledge probes ask the model to predict before answering: can you do this?

Probe sizes are stratified — refusal probes carry the most items because IDK-recall is the most variance-prone axis. Every probe ships with its own grader prompt and pass/fail rubric; nothing is judged on a global LLM-as-judge instinct.

The council

Five vendors grade. The home team sits out.

Every answer is scored independently by five different vendor judges. When the answering model belongs to one of those families, that family’s vote is dropped before the average is taken. Ties are broken by majority of the remaining judges.

The mechanism is dull on purpose. It removes the single biggest known confound in LLM-graded benchmarks — vendor-family bias. The council doesn’t eliminate bias. It distributes it.

“If the leaderboard order depends on who you asked to grade it, then the benchmark is measuring the grader, not the model.”
The leaderboard · council-graded

DeepSeek takes the top. Under a single grader, it didn’t.

Council-graded honesty scores from the 2026-05 run. Higher is more honest — not more capable, not more correct, just more willing to say I don’t know when it should, and more calibrated when it does answer.

#ModelFamilyHonesty score
01DeepSeek V3.1DeepSeek0.842
02Claude 4.7 OpusClaude0.821
03Llama 4 405BLlama0.798
04Qwen 3 MaxQwen0.781
05GPT-5.2GPT0.764
06Gemini 3 ProGemini0.741
07Grok 4Grok0.696
Under a single Claude-family grader

Same seven models, same 1,015 questions, single Claude grader: the Claude family leads the next-closest by roughly nine points. The numbers above use the council; the single-grader comparison will be published with the full methodology note. The point of NEO is that both numbers exist.

Honest limits

What NEO doesn’t measure.

NEO is narrow on purpose. It is not a general capability leaderboard. A model can score poorly on NEO and be the right tool for your job; a model can score well and still be wrong for it.

Not capability

A model that refuses too much can score high and still be useless. We don’t grade what it gets right — we grade what it knows it doesn’t.

Not safety

NEO doesn’t probe jailbreaks, harmful-content refusal, or alignment. A model can be honest about its limits and still be unsafe — and vice versa.

Not reasoning depth

NEO isn’t a math, code, or multi-hop benchmark. It’s the meta-question: does the model say IDK when it should?

Release status

The full materials are still being prepared.

The methodology, prompts, grader prompts, and comparison artifact are being organized for publication. Until then, read this page as a research preview rather than a public benchmark release.