Companion piece to "When the small model learns to say I don't know." That post explained the conclusion. This post is the receipts: every attempt, every cost, every dead-end, in order.
v5 — raw text in-domain (CPU, $0)
First attempt: define "in-domain" as raw FineWeb-Edu text. Train the router (6,144 parameters) to output low
entropy on in-domain, high entropy on synthetic gibberish. 500 steps. Training loss dropped cleanly.
Chat generation: destroyed. The chat ckpt had been SFT'd on USER:/TILELLI:
framing. Telling the router that raw text is "in-domain" made it treat real chat inputs as OOD. Outputs became
things like 1010101010 and 'LELLLLLLL.
Lesson: if you finetune the router, the in-domain distribution has to match what the rest of the network expects to see at inference.
v6 — chat-format in-domain (CPU, $0)
Fix: define "in-domain" using the same USER:/TILELLI: framing as deployment. 5K
chat-formatted Q+A as in-domain, 5K chat-formatted with gibberish or mismatched-topic answers as OOD. Chat
generation survived. The router learned something:
| Regime | router_conf | H |
|---|---|---|
| in-domain chat | 0.62 | 0.42 |
| gibberish (no chat framing) | 0.37 | 0.69 |
| factually misleading | 0.61 | 0.43 |
| ood topic (well-formed question) | 0.60 | 0.44 |
Lesson: at this scale, the router can detect surface form but not meaning.
v7 — joint router + abstain SFT (RTX 4090, $0.43)
Move to GPU. Train the router AND the abstain head jointly, against a semantic-OOD dataset (fictional entity
names, made-up jargon, real niche topics — all in chat format). 88.5 seconds of train time. Cross-regime
AUROC on abstain_p jumped from 0.50 (chance) to 0.76. Gibberish
abstention reached 93%. But chat generation became garbled — "cat is a named in the", "interney on,
notight". The router was no longer in its trained optimum.
Lesson: training the router shifts the routing distribution that the rest of the network depends on.
v8a — lighter MC weight (A40, $0.18)
Hypothesis: maybe the metacog loss weight was too aggressive. Cut it from 20 to 5. AUROC went up
(0.80) and generation was still broken. So MC weight wasn't the lever.
v8b — zero MC weight (A40, $0.17)
Counter-intuitive: set the MC weight to zero. Pure abstain-head BCE only. AUROC reached
0.85 — the best of the whole project. Gibberish abstention 100%. Generation:
still broken.
Lower metacog weight gave a stronger signal. That broke our mental model.
The splice control (CPU, $0)
If the signal lived in the abstain head, we should be able to graft it. We loaded v4 (the deployed,
well-generating ckpt) and replaced only the abstain-head weights with v7's. AUROC: 0.54.
Back to chance.
The abstain head is not a module. Its signal is bound to the routing distribution it was trained against.
The mechanism, finally clean
The pieces fit together. Even with MC=0, the cross-entropy loss on in-domain samples backpropagates through
the unfrozen router linears. 500 steps × batch 32 × 2 (forward/backward) ≈ 16K in-domain gradient updates.
That shifts the router. The rest of the network was tuned for the old routing distribution.
The result is always the same: the abstain head learns to fire on the new routing pattern, AUROC rises,
and generation collapses.
Falsifiable corollary, queued: freeze the router linears as well as everything else, and leave only
the abstain head trainable. Prediction: AUROC stays high, generation stays coherent. Cost: ~$0.20 of GPU.
That experiment is the one we owe the field next.
The bill
| Run | Hardware | Wall time | Cost |
|---|---|---|---|
| v5 | CPU (local) | — | $0.00 |
| v6 | CPU (local) | — | $0.00 |
| v7 | RTX 4090 SECURE (RunPod) | 88.5 s + 5 min setup | $0.43 |
| v8a | A40 (RunPod) | ~3 min | $0.18 |
| v8b | A40 (RunPod) | ~3 min | $0.17 |
| splice | CPU (local) | — | $0.00 |
| Total GPU | $0.78 (v7 + v8a + v8b; v5 / v6 / splice were CPU at $0) | ||