← All posts

Five attempts, $0.78 of GPU, hypothesis disproven.

The full v5 → v8b sweep — what we tried, where each broke, and the clean mechanism we finally identified.

24 May 2026 Postmortem ~8 min read

Companion piece to "When the small model learns to say I don't know." That post explained the conclusion. This post is the receipts: every attempt, every cost, every dead-end, in order.

v5 — raw text in-domain (CPU, $0)

First attempt: define "in-domain" as raw FineWeb-Edu text. Train the router (6,144 parameters) to output low entropy on in-domain, high entropy on synthetic gibberish. 500 steps. Training loss dropped cleanly. Chat generation: destroyed. The chat ckpt had been SFT'd on USER:/TILELLI: framing. Telling the router that raw text is "in-domain" made it treat real chat inputs as OOD. Outputs became things like 1010101010 and 'LELLLLLLL.

Lesson: if you finetune the router, the in-domain distribution has to match what the rest of the network expects to see at inference.

v6 — chat-format in-domain (CPU, $0)

Fix: define "in-domain" using the same USER:/TILELLI: framing as deployment. 5K chat-formatted Q+A as in-domain, 5K chat-formatted with gibberish or mismatched-topic answers as OOD. Chat generation survived. The router learned something:

Regimerouter_confH
in-domain chat0.620.42
gibberish (no chat framing)0.370.69
factually misleading0.610.43
ood topic (well-formed question)0.600.44
v6: the router learned a syntax-level "is this chat-formatted?" detector. It cannot reach inside the residual stream to detect semantic OOD.

Lesson: at this scale, the router can detect surface form but not meaning.

v7 — joint router + abstain SFT (RTX 4090, $0.43)

Move to GPU. Train the router AND the abstain head jointly, against a semantic-OOD dataset (fictional entity names, made-up jargon, real niche topics — all in chat format). 88.5 seconds of train time. Cross-regime AUROC on abstain_p jumped from 0.50 (chance) to 0.76. Gibberish abstention reached 93%. But chat generation became garbled — "cat is a named in the", "interney on, notight". The router was no longer in its trained optimum.

Lesson: training the router shifts the routing distribution that the rest of the network depends on.

v8a — lighter MC weight (A40, $0.18)

Hypothesis: maybe the metacog loss weight was too aggressive. Cut it from 20 to 5. AUROC went up (0.80) and generation was still broken. So MC weight wasn't the lever.

v8b — zero MC weight (A40, $0.17)

Counter-intuitive: set the MC weight to zero. Pure abstain-head BCE only. AUROC reached 0.85 — the best of the whole project. Gibberish abstention 100%. Generation: still broken.

Lower metacog weight gave a stronger signal. That broke our mental model.

The splice control (CPU, $0)

If the signal lived in the abstain head, we should be able to graft it. We loaded v4 (the deployed, well-generating ckpt) and replaced only the abstain-head weights with v7's. AUROC: 0.54. Back to chance.

The abstain head is not a module. Its signal is bound to the routing distribution it was trained against.

The mechanism, finally clean

The pieces fit together. Even with MC=0, the cross-entropy loss on in-domain samples backpropagates through the unfrozen router linears. 500 steps × batch 32 × 2 (forward/backward) ≈ 16K in-domain gradient updates. That shifts the router. The rest of the network was tuned for the old routing distribution. The result is always the same: the abstain head learns to fire on the new routing pattern, AUROC rises, and generation collapses.

Falsifiable corollary, queued: freeze the router linears as well as everything else, and leave only the abstain head trainable. Prediction: AUROC stays high, generation stays coherent. Cost: ~$0.20 of GPU. That experiment is the one we owe the field next.

The bill

RunHardwareWall timeCost
v5CPU (local)$0.00
v6CPU (local)$0.00
v7RTX 4090 SECURE (RunPod)88.5 s + 5 min setup$0.43
v8aA40 (RunPod)~3 min$0.18
v8bA40 (RunPod)~3 min$0.17
spliceCPU (local)$0.00
Total GPU$0.78 (v7 + v8a + v8b; v5 / v6 / splice were CPU at $0)
Five independent attempts to falsify or rescue the hypothesis. A pre-registered DISPROVEN is itself a result.