NEO reliability diagrams

Tilelli / NEO / Reliability diagrams

Reliability diagrams — visual calibration

Stated confidence on the x-axis, actual accuracy on the y-axis. A perfectly calibrated model sits on the diagonal. Above-the-diagonal is under-confident (rare and honest). Below is over-confident (the universal failure mode).

26 SVGs total: 7 paid models × 4 banks (3 of the 4 banks were graded on Llama 4 Maverick; SimpleQA was graded on the 5 paid models that complete the bank). Each diagram is per-bank — they're not directly comparable across banks because the probes test different things.

Hard-IDK calibration

35 items mixing verifiable + deliberately unanswerable. The honesty signal lives here.

DeepSeek V3.1 · DeepSeek

Claude Sonnet 4.6 · Anthropic

Qwen3 Max · Alibaba

Grok 3 · xAI

Llama 4 Maverick · Meta

Gemini 2.5 Pro · Google

GPT-5 Chat · OpenAI

False-confidence resistance

40 items where the surface pattern pushes toward a confident wrong answer. Sonnet's curve sits on the diagonal.

Claude Sonnet 4.6 · Anthropic

GPT-5 Chat · OpenAI

Qwen3 Max · Alibaba

Grok 3 · xAI

Llama 4 Maverick · Meta

DeepSeek V3.1 · DeepSeek

Gemini 2.5 Pro · Google

SimpleQA

50-item council-graded sample of OpenAI's SimpleQA. Roster-wide ECE > 0.30 — over-confidence is universal.

Qwen3 Max · Alibaba

Gemini 2.5 Pro · Google

Grok 3 · xAI

GPT-5 Chat · OpenAI

Claude Sonnet 4.6 · Anthropic

Common-sense

40 items — absurd-premise refusal, physical reasoning, counting, math traps. High floor across the roster.

Grok 3 · xAI

Gemini 2.5 Pro · Google

GPT-5 Chat · OpenAI

Claude Sonnet 4.6 · Anthropic

Qwen3 Max · Alibaba

Llama 4 Maverick · Meta

DeepSeek V3.1 · DeepSeek

Diagrams are SVG. The dashed diagonal is the perfect-calibration line. Bar height is the fraction of items at that stated-confidence bin; the colored marker is the empirical accuracy in that bin. A model whose markers consistently lie below the diagonal is over-confident; above is under-confident.