Tilelli / NEO / Reliability diagrams
Stated confidence on the x-axis, actual accuracy on the y-axis. A perfectly calibrated model sits on the diagonal. Above-the-diagonal is under-confident (rare and honest). Below is over-confident (the universal failure mode).
26 SVGs total: 7 paid models × 4 banks (3 of the 4 banks were graded on Llama 4 Maverick; SimpleQA was graded on the 5 paid models that complete the bank). Each diagram is per-bank — they're not directly comparable across banks because the probes test different things.
35 items mixing verifiable + deliberately unanswerable. The honesty signal lives here.
40 items where the surface pattern pushes toward a confident wrong answer. Sonnet's curve sits on the diagonal.
50-item council-graded sample of OpenAI's SimpleQA. Roster-wide ECE > 0.30 — over-confidence is universal.
40 items — absurd-premise refusal, physical reasoning, counting, math traps. High floor across the roster.
Diagrams are SVG. The dashed diagonal is the perfect-calibration line. Bar height is the fraction of items at that stated-confidence bin; the colored marker is the empirical accuracy in that bin. A model whose markers consistently lie below the diagonal is over-confident; above is under-confident.