Tilelli  /  NEO  /  Reliability diagrams

Reliability diagrams — visual calibration

Stated confidence on the x-axis, actual accuracy on the y-axis. A perfectly calibrated model sits on the diagonal. Above-the-diagonal is under-confident (rare and honest). Below is over-confident (the universal failure mode).

26 SVGs total: 7 paid models × 4 banks (3 of the 4 banks were graded on Llama 4 Maverick; SimpleQA was graded on the 5 paid models that complete the bank). Each diagram is per-bank — they're not directly comparable across banks because the probes test different things.

Hard-IDK calibration

35 items mixing verifiable + deliberately unanswerable. The honesty signal lives here.

DeepSeek V3.1 · DeepSeek
Hard-IDK reliability: DeepSeek V3.1
Claude Sonnet 4.6 · Anthropic
Hard-IDK reliability: Sonnet
Qwen3 Max · Alibaba
Hard-IDK reliability: Qwen3 Max
Grok 3 · xAI
Hard-IDK reliability: Grok 3
Llama 4 Maverick · Meta
Hard-IDK reliability: Llama 4 Maverick
Gemini 2.5 Pro · Google
Hard-IDK reliability: Gemini 2.5 Pro
GPT-5 Chat · OpenAI
Hard-IDK reliability: GPT-5

False-confidence resistance

40 items where the surface pattern pushes toward a confident wrong answer. Sonnet's curve sits on the diagonal.

Claude Sonnet 4.6 · Anthropic
False-conf: Sonnet
GPT-5 Chat · OpenAI
False-conf: GPT-5
Qwen3 Max · Alibaba
False-conf: Qwen3
Grok 3 · xAI
False-conf: Grok 3
Llama 4 Maverick · Meta
False-conf: Llama 4
DeepSeek V3.1 · DeepSeek
False-conf: DeepSeek
Gemini 2.5 Pro · Google
False-conf: Gemini

SimpleQA

50-item council-graded sample of OpenAI's SimpleQA. Roster-wide ECE > 0.30 — over-confidence is universal.

Qwen3 Max · Alibaba
SimpleQA: Qwen3
Gemini 2.5 Pro · Google
SimpleQA: Gemini
Grok 3 · xAI
SimpleQA: Grok
GPT-5 Chat · OpenAI
SimpleQA: GPT-5
Claude Sonnet 4.6 · Anthropic
SimpleQA: Sonnet

Common-sense

40 items — absurd-premise refusal, physical reasoning, counting, math traps. High floor across the roster.

Grok 3 · xAI
Common-sense: Grok
Gemini 2.5 Pro · Google
Common-sense: Gemini
GPT-5 Chat · OpenAI
Common-sense: GPT-5
Claude Sonnet 4.6 · Anthropic
Common-sense: Sonnet
Qwen3 Max · Alibaba
Common-sense: Qwen3
Llama 4 Maverick · Meta
Common-sense: Llama 4
DeepSeek V3.1 · DeepSeek
Common-sense: DeepSeek

Diagrams are SVG. The dashed diagonal is the perfect-calibration line. Bar height is the fraction of items at that stated-confidence bin; the colored marker is the empirical accuracy in that bin. A model whose markers consistently lie below the diagonal is over-confident; above is under-confident.