← All posts

Built from zero. Under twenty dollars.

The full recipe to reproduce Tilelli v0.1 on one rented GPU for less than the price of a movie ticket.

15 May 2026 Recipe ~5 min read

The point of an open small model is that you can re-derive it yourself. This is the actual recipe. No pretrained weights, no shortcuts, no hidden steps.

1. Get the data

FineWeb-Edu, document-quality filtered. Roughly 2 GB of compressed text is enough for v0.1; we used 6 GB for the final ckpt. The dataset is openly licensed; download with datasets from Hugging Face.

2. Rent a GPU

A community-priced A6000 on RunPod runs about $0.40/hour. v0.1 takes ~5 hours of wall time to 50K steps. Budget: ~$2 of GPU. The rest of the ~$20 is generous buffer for restarts and tokenizer experiments.

3. Tokenize (or don't)

Tilelli is byte-level. Tokenization is the identity function: b"hello" → bytes 104, 101, 108, 108, 111. Vocabulary size 256. No tokenizer training run, no merge rules, no BPE artifacts. This is one of the reasons the model stays small.

4. Train

python scripts/train.py --config configs/lite_10m.yaml. Architecture: 6 routed blocks, each with (local-conv, sparse-attn-8h-k16, ternary-FFN expand=4) and a 3-pathway router. AdamW, lr 3e-4 cosine, weight decay 0.01, 50K steps, seq=256, batch tokens ≈ 32K. The straight-through estimator handles the ternary gradient.

5. SFT for chat (optional)

The "tilelli" chat persona is the result of a small abstain-aware SFT on top of the pretrained ckpt. About 50K USER:/TILELLI: pairs, including deliberately unanswerable questions where the target is "I don't know." Another hour or two of GPU.

6. Run it anywhere

The inference loop is in tilelli/inference.py. Pure PyTorch, runs on CPU, single-threaded or multi-threaded. Loading the ckpt takes about a second.

## five lines, on CPU, no GPU required
from tilelli import TilelliLM, ByteTokenizer
tok = ByteTokenizer()
lm  = TilelliLM.from_pretrained("tilelli/lite-10m")
out = lm.generate(tok.encode("Once upon a time"), max_new=128)
print(tok.decode(out))

What we spent (in total)

ItemCost
FineWeb-Edu data download (bandwidth)$0
Pretrain (50K steps, A6000)~$2
Abstain-aware SFT~$0.50
Metacognition experiments (v5–v8b)$1.05
Buffer (failed runs, debugging)~$3
Total to v0.1 working ckpt~$7
Less than a movie ticket. Less than the average lunch in San Francisco.

Code, weights, and the abstain config are all in the public kit. If you want to talk to the maintainers about reproducing it, email hello@tilelli.tech.