Same dataset. Same budget.
Best loss wins.
Mizan (Arabic for balance, scale) is the arena for small-model benches. Featherweight is the first bench inside it: one dataset, one step budget, one eval. No bigger-is-better, no moving the goalposts. The board names a community DoubleConv at val 1.3745, ahead of our own RWKV-mini at 1.4352.
That board is last season. Since then we kept iterating — and a new in-house architecture now edges past RWKV head-to-head in seed-matched runs. A narrow, honest win, still firming up across seeds before it earns a place on the board. The benchmark stays the point.
Four moves, one verdict.
Mizan behaves like a real arena, not a vague challenge page: same dataset, same budget, automated grading, and a leaderboard that changes only when the release is actually open.
Same fight, every time.
Same dataset (TinyStories byte-LM), same step budget (500 default, configurable), same eval (held-out val loss). No moving the goalposts.
Pick your weight class.
Precision, Optimizer, Attention, State-Space, Community. Each fighter is registered in exactly one division.
Bot trains, bot grades.
The bot trains your config on the bench, evaluates on held-out val, and posts the result. No human in the loop.
Best loss takes the belt.
Best val loss per division wins the belt. Best across all divisions is champion of champions — until someone knocks them out.
Featherweight already has a benchmark to beat.
The current internal Featherweight leader is a community-template DoubleConv configuration. The wider arena is not open publicly yet.
A community-template DoubleConv at val 1.3745 after 800 steps. It beat our own RWKV-mini (val 1.4352 at 500 steps). The public submission flow is not open yet, but this is the bar the first Featherweight release will ship with. The benchmark stays the point.
Five divisions inside the first bench.
Every fighter is registered in one division. The current belt-holders are placeholder leaders until the arena opens.
| Division | What we’re measuring | Current title-holder |
|---|---|---|
| Precision | FP32 / FP16 / Int8 / Ternary at same FLOPs budget | int8 STE — ties FP32 at 41% size |
| Optimizer | AdamW / Lion / Sophia / Muon at same step budget | Lion — beats AdamW by 6% |
| Attention | Softmax / Sliding window / GQA / Sparse | Sliding window — beats softmax at L=96 |
| State-Space | RWKV / Hyena / SSM / Mamba | RWKV-mini — beats softmax 18–32% |
| Community | Anything else that fits the Featherweight bench | DoubleConv (1.3745) |
What the bench has told us — so far.
Patterns we’ve seen across the first divisions to date. They’re meant to be challenged by better submissions once the arena opens.
State-space beats attention at this scale. RWKV, Hyena and SSM all beat softmax by 18–32% — even with 300 fewer training steps.
Lion beats AdamW by ~6% on the same bench, same step budget.
Sliding-window attention beats full softmax at sequence length 96 — and uses fewer parameters doing it.
GQA matches softmax at 5% fewer params. Lossless at this scale.
Int8 with straight-through estimator ties FP32 at 41% size. No accuracy delta. Just smaller.
Sophia is under-tuned in this bench. An honest failure — we post the loss curve and cost ledger. Better tunings welcome.
Build small models for a living? Think yours can beat 1.3745? The public submission flow is not open yet. If you want updates or early access, email us. Featherweight is first; the wider arena follows.
The first Mizan materials are still being prepared.
The arena framing, the Featherweight ruleset, and the benchmark materials are still being organized. Until then, treat this page as a research preview of how the first bench will work rather than a public release.
# Mizan release status
Arena kit: in preparation
Submission flow: not open yet
Release updates: hello@tilelli.tech
Five rules. No referee.
Open-weights only. No API-gated models. If we can’t load it, it can’t fight.
Same dataset, same step budget. No swapping the corpus to find a friendlier one.
Reproducible on CPU. If it needs more than one GPU, it isn’t featherweight.
Negative results count. A clean failure with the cost ledger is a valid submission. We post it.
Release timing is still private. The rules are ready before the public materials are.