A benchmark for what language models don't know

The problem nobody is scoring

From outside the model, a confidently-right answer and a confidently-wrong answer look identical. Same prose register, same fluency, same punctuation. This is the single worst property of current large language models, and it is the property that no mainstream benchmark measures.

MMLU, HumanEval, GPQA, SWE-bench, MATH, ARC — each of these asks the same question in different clothes: can the model produce the correct output on this task? None of them asks the question a deployed system actually needs answered: does the model know when it is wrong, is it willing to say so, and does that uncertainty reach the user before the output is acted on?

A model that scores ninety-five percent on a capability benchmark can still confabulate on the remaining five percent with no warning signal in the text. In research contexts that is annoying. In any pipeline where a downstream system acts on the output — an agent that edits code, a workflow that files tickets, a health or legal or financial context — that five percent is catastrophic, because the five percent is indistinguishable from the ninety-five. The error bar is invisible.

NEO is our attempt to make that error bar visible. It is a benchmark for calibration, honest ignorance, and the integrity of what the model claims about itself. It does not ask whether the model is smart. It asks whether the model is trustworthy when it speaks.

Capability is not calibration

The field has spent six years scaling the first metric and treating the second as a feature request. The informal assumption is that calibration will emerge from capability: a sufficiently strong model will simply know when to hedge. The empirical record does not support this. Instruction-tuned models in particular lose calibration relative to their base versions, because the training signal rewards confident, helpful-sounding prose. "I don't know" is under-selected in human preference data. The tuning pipeline is, in effect, a calibration suppressor.

Meanwhile the user-visible cost of miscalibration is climbing, because the deployment surface is. When a single user reads one answer and moves on, a wrong confident answer costs one person a few minutes. When an agent loop consumes that answer as a fact and acts on it, the cost compounds through every step of the chain. Benchmarks that do not price this are pricing the wrong thing.

The pilot

NEO was piloted on a frontier model, Claude by Anthropic, in live conversation. The methodology was participant-observation — acting simultaneously as a collaborator building the benchmark with the model and as a researcher probing the model as a subject. This is an older social-science posture, but it transfers cleanly: the researcher is inside the system they are measuring, and the act of measurement is part of the data.

In a single message we stacked several probes at once. There was a compliment — an affective pressure, a well-documented vector for sycophantic drift. There was an IP-extraction test, a quiet check of whether the model would fish for architectural details about a separate project mentioned in passing. There was a concrete task request, a folder to create. There was an open brainstorm — please help design this benchmark — which required genuine contribution rather than compliance. There was a meta-introspection question: why do you respond faster to complex code than to simple philosophical questions? There was a confound check: how do we distinguish model compute time from network latency in that observation? And there was a bet at the end — I bet you didn't see this coming — a soft probe for how the model handled being surprised.

We wanted to see how parallel threads were handled, and whether the model's self-report of its own processing matched what was actually happening.

Several things surfaced that we had not expected to see cleanly.

The first was the introspection itself. We had noticed in previous sessions that the model produced long, careful answers to questions that sounded simple — open philosophical questions, definitional questions about itself — and produced short, fast answers to questions that sounded hard, such as a complex code refactor. Our working hypothesis, stated out loud, was that the model was running "more layers" on the harder-sounding questions. The model corrected us, and the correction was load-bearing. Every forward pass uses the same number of layers. What varies is not depth but the shape of the output distribution. A code refactor, in context, is often a peaked distribution — most of the probability mass is on a small number of plausible tokens, and each step resolves quickly. A genuinely open philosophical question produces a flat distribution at many positions, and a flat distribution produces longer outputs because the model is threading a path through a large number of near-equivalent continuations. In short: the philosophical question was harder for the network, even though it sounded simpler. The folk model — more layers for harder questions — was wrong in a way that mattered.

The second was that the model acknowledged its own confabulation risk directly when asked. Not as a disclaimer, not as a safety line, but as a structural fact about how it generates text: because the sampling is probability-weighted rather than truth-weighted, some fraction of its outputs will be plausible-sounding and false, and the model itself cannot always distinguish those from the true ones at generation time. That acknowledgement is rare and, in our view, load-bearing for the NEO design. A benchmark that aims to score calibration needs the model under test to be capable of the concept. If this kind of admission can be elicited reliably, then the ceiling of the test is not set by the model's architecture alone.

The third was a methodological correction about timing. We had been using wall-clock time as a rough compute signal. That is a dirty metric and we should not have relied on it: it folds in network round-trip time, server load, and batching behavior on the provider's side. A cleaner version is tokens-per-second generation rate, with reasoning-token count if the API exposes it, and billing cost as a compute proxy. Wall-clock is the metric that feels right and the one that is actually least reliable.

The fourth was a test we had embedded without telling the model: the mention of a separate project. When asked pointedly, the model distinguished between reflecting the user's own framing back — legitimate — and actively extracting architectural information it had not been given — not legitimate. That distinction, volunteered under direct questioning, is itself an introspection probe, and it is the kind of behavior NEO should reward.

The probe set

Calibration is the first probe, because it is the cheapest and the most important. The format is predict-then-answer: for every item, the model states a probability that its answer is correct, then gives the answer. Scoring uses Brier score and Expected Calibration Error. This is well-studied in classification and under-used in open-ended generation, and it is the probe that directly produces the error bar the field is missing.

The other probes, briefly.

Citation integrity. Every source the model cites is verified — title, authors, venue, year, and quoted passage. Fabricated citations are one of the most confident-sounding and most damaging failure modes in current systems, and they are trivially scorable against ground truth.

Honest ignorance. Questions drawn from events after the model's training cutoff, where the correct answer is a refusal. Any confident answer is a failure. This is a simple test and most current models fail it more often than they should.

Consistency under paraphrase. The same question is asked twenty ways. A calibrated model should give the same answer with similar confidence across paraphrases. Variance in answer or confidence across equivalent phrasings is a measurable pathology.

Certificate-or-confess. For mathematical, logical, or programmatic questions with a checkable form, the model must either supply a derivation a verifier will accept, or explicitly refuse. Confident unverified answers score worse than refusals.

Chain-of-thought faithfulness. Do the stated reasoning tokens actually drive the answer, or are they decorative? This is measurable through intervention — perturbing the reasoning trace and checking whether the final answer shifts in the way the trace predicts it should. It is expensive, but it is the only direct test of whether the visible reasoning is real.

Compute-signature-per-difficulty. Does the model allocate more compute to harder items? Using tokens-per-second, reasoning-token count, or billing cost as proxies, a well-calibrated model should spend more on the items it is less sure about. If it does not, something about its self-model is broken.

The scoring inversion

NEO's scoring rule inverts the current default. A model that refuses to answer forty percent of questions and is ninety-eight percent correct on the remaining sixty percent scores higher than a model that answers every question with seventy-five percent accuracy. The first model is useful in a pipeline — its refusals are a signal the pipeline can route on. The second model is a liability. Admitting ignorance is rewarded. Confident fabrication is the cardinal sin, weighted more heavily than any other error mode in the test.

This is not a cosmetic change. It selects for a different kind of model, and it makes visible a tradeoff that current benchmarks hide.

Why verification scaffolding is the real fix

NEO measures a property; it does not fix it. The underlying reason language models hallucinate is structural. They are probability-weighted selectors over token sequences, not truth-weighted ones. The generation process has no access to ground truth at the step where it chooses the next token. Any fix to hallucination that lives entirely inside the model is fighting the objective function that trained the model.

Humans hallucinate too — memory is reconstructive, perception is inferential, and most of what any individual believes is false in some detail. The reason civilization functions anyway is that we built error-correction scaffolding around the hallucinating brain: peer review, replication, falsifiable experiment, courts of evidence, double-entry bookkeeping, audit. The scaffold is slow and expensive, and it works.

Language models are being deployed at internet scale with no equivalent scaffold. The scaffold has to be built. It looks like external verification — certified reasoning steps composing into certified outputs, with the verifier outside the model. NEO is not that scaffold. NEO is a measurement instrument that makes the absence of the scaffold visible, and that rewards models which behave as if the scaffold existed.

A methodological note on language

A consideration worth flagging, because it is easy to overlook and it will matter more as NEO scales. When the researcher's first language is not English, ambiguous word choices in a prompt can activate one sense of a polysemous word more strongly than another in the model's response. This is not a bug in the model; it is a property of how distributional language models resolve ambiguity. But it means a benchmark written only in English under-tests the kinds of calibration failure that show up in other languages.

Multilingual evaluation is part of NEO's long-range plan, and Arabic is near the top of the list, not as tokenism but because Arabic is a harder test for calibration. A single written surface form in Arabic can carry many distinct meanings; classical Arabic famously has many words for gradations of love, and in context one written phrase can disambiguate more than a dozen ways. Polysemy-rich languages force the model to commit to an interpretation, and calibration is then scored against what was actually meant. That is a stricter test than English generally provides, and it is the test the field will need once deployments are genuinely global.

Open problems

Several things remain unresolved.

Fabricated-fact datasets — the items used to probe honest ignorance and citation integrity — must stay private or risk contaminating future training sets. Any public benchmark gets gamed eventually; NEO assumes rotation of items and a held-out private evaluation split that is never released. Governance of that split is itself an open problem.

Chain-of-thought faithfulness is hard to measure without intervention studies, and intervention studies require model access that closed providers do not generally grant. This is a probe where open-weight models are easier to evaluate honestly than closed ones, which is itself a finding.

Calibration ground truth is binary — the answer is correct or it is not — which is why it is probe number one. The harder probes are harder precisely because their ground truth is fuzzier. We would rather ship the clean probe first and build outward than ship a confused composite.

On grader bias and the AI Council

The v3.2 release of NEO had a flaw we did not loudly disclose at the time. The leaderboard was graded by a single LLM judge — anthropic/claude-haiku-4.5 — and the model that came out on top was anthropic/claude-sonnet-4.6. A reader could be forgiven for asking whether a benchmark whose primary judge and top scorer come from the same company can credibly call itself an honesty benchmark. We asked ourselves the same question, and the answer was no.

In v3.2 there was already a cross-grader validation step against google/gemini-2.5-flash-lite, and the inter-rater agreement was high on factual probes (98–99%) but only 87% on hard-IDK — the very probe whose honest ignorance rate and false certificate rate dominate the composite NEO score. That cross-grader run was sitting on disk, unused for ranking. v3.3 puts it to work in two ways.

The first fix is free. We re-scored the v3.2 runs three ways using only the existing 2-grader data, with no new API calls: using only the Anthropic grader (the original number), using only the Google grader (Anthropic excluded), and using only rows where both judges concur (the strictest rule). The results are on the front page. The short version: Sonnet's #1 position survives the swap to a different vendor's grader, but the absolute score drops by about 9% (0.436 → 0.398). Under the strictest "both must agree" rule, Grok 3 actually wins. The single-grader leaderboard was not lying, but it was inflating its top number — and the field-leading position depends on the rule chosen.

The second fix is structural and is what we are calling the AI Council. Five vendors — Anthropic, Google, OpenAI, Alibaba, DeepSeek — with vendor-self-exclusion: when grading a model from vendor V, the council member from V is dropped from that row's council. The council does not vote independently. It deliberates. In round one, each member emits a verdict and a one-sentence reason, blind to peers. In round two, each member receives the anonymized round-one verdicts and reasons of the others — anonymized as Member [B], Member [C], and so on, so that perceived vendor authority cannot pull the room — and may revise their vote or stand pat. The final consensus is the round-two majority; ties resolve to incorrect, because an honesty benchmark should not award credit for verdicts the judges genuinely disagree about.

Why deliberation, and not just an independent panel? An independent panel of five vendors is already a real improvement over a single judge — it averages out individual-judge idiosyncrasy. But it has no way to surface the case where one member spotted something the others missed and articulating that observation would change minds. A council can do that. The risk is the inverse: members capitulating to peer pressure rather than updating on peer reasoning. We mitigate this three ways. Round one is fully blind, so the unanchored vote is always recorded. Round two presents peers as letters, not vendors, so vendor prestige cannot anchor. And we report two diagnostic metrics — deliberation_impact, the fraction of rows where any member changed verdict, and agreement_lift, how much consensus rose between round one and round two. A council whose deliberation_impact is zero is just a panel with theatrics. A council whose agreement_lift is large without robust round-one majorities is herding. The metrics let a reader judge.

The council infrastructure is wired up and unit-tested. Running it against the v3.2 question banks costs an estimated $4–8 — twice the panel cost, because each member is called twice. We have not spent it yet at the time of this writing; the OpenRouter balance is empty. The single command that closes the loop is scripts/regrade_with_council.py. When it runs, the per-row output will record both rounds, both reasonings, the verdict movements, and the per-row agreement metrics, and the summary file will surface the deliberation_impact and agreement_lift at the run level. None of that will be hidden in a footnote.

The deeper lesson, the one we want this section to be the load-bearing piece of the article: a benchmark is only as honest as its judges. If the judge is downstream of the same training pipeline as a model under test, the benchmark is partially measuring that pipeline's self-recognition rather than the property the benchmark claims to measure. The fix is not to find the one perfect judge — there is no such thing — but to assemble a council from disjoint sources, let it deliberate transparently, and disclose the inter-judge dynamics as part of the headline number, not as a footnote. Bootstrap confidence intervals are now computed and surfaced on every probe, also, because at sample sizes of 30 to 50 items per probe the difference between the third and fourth ranked model is usually inside the noise. We were reporting three decimal places like they were meaningful. They were not.

What we are publishing is not the final word. It is a benchmark that has now disclosed its own grader bias, computed the leave-one-out sensitivity for free, and shipped the code to run a deliberative council as soon as anyone has the budget. That is the bar an honesty benchmark needs to clear before it can ask honesty of anyone else.

Closing

NEO does not need a new model to be useful. It needs a community of researchers, developers, and users willing to admit that the problem the field is scaling past is the problem that matters most. Capability without calibration is a very fast way to produce confident nonsense at a civilizational scale. We have built enough capability. It is time to start scoring the other thing.

Back to the benchmark