Tilelli / Tilelli Med / Methods
A short, honest technical description. What we did, how we evaluated it, and what we didn't do.
We use ComplEx (Trouillon et al., 2016): a complex-valued tensor factorization scoring function over (head, relation, tail) triples. For a triple (h, r, t), the score is the real part of <h, r, conj(t)> over complex embeddings.
We add the N3 regularizer (Lacroix et al., 2018), a tensor-power L3 regularization that consistently dominates L2/Frobenius for ComplEx-family models, and reciprocal relations: each training triple (h, r, t) is augmented with (t, r⁻¹, h). The combined recipe ("ComplEx-N3") is the published high-water mark on standard knowledge-graph benchmarks.
Each entity's real and imaginary embedding tables are quantized independently to {−1, 0, +1} with a small per-block float scale. The block size B is the knob: B=1 means a single scale per row (highest compression), B=512 means a scale per dimension (no compression). At B=128 we get 5.3× compression of the entity tables and the model still beats the OGBL TransE leaderboard baseline.
We use the official OGB filtered protocol: each test triple ships with 500 type-constrained negatives, already filtered against the train + valid + test splits. We rank the gold against (gold + 500 negatives) on both head and tail sides, and report mean reciprocal rank.
We trained 3 independent ComplEx-N3 models with random seeds 1, 2, 3:
| Seed | Validation MRR | Hits@1 | Hits@10 |
|---|---|---|---|
| 1 | 0.8378 | 0.774 | 0.946 |
| 2 | 0.8427 | 0.786 | 0.945 |
| 3 | 0.8436 | 0.785 | 0.949 |
| Mean ± SD | 0.8414 ± 0.003 | 0.782 ± 0.005 | 0.947 ± 0.002 |
SD of 0.003 across seeds means the result is stable.
| Block size | MRR | Hits@1 | Hits@10 | Compression |
|---|---|---|---|---|
| Float teacher | 0.847 | 0.790 | 0.949 | 1× |
| B=256 | 0.794 | 0.717 | 0.939 | 3.2× |
| B=128 | 0.752 | 0.667 | 0.923 | 5.3× |
| B=64 | 0.730 | 0.637 | 0.914 | 8.0× |
| B=1 (per-row) | 0.696 | 0.592 | 0.901 | 15.8× |
Reference leaderboard entries: TransE 0.745, RotatE 0.799, ComplEx 0.810. Our B=128 ternary edges out TransE; B=256 sits between TransE and RotatE.
We trained a small MLP that takes the float (h, r) embeddings and predicts whether the ternary student will agree with the float teacher on the top-1 tail for that query. This gives a per-query confidence signal — useful clinically, where "is the cheap model reliable for this case?" is the question that matters.
This is the medical analogue of the NEO-style metacognition we measure across frontier chat models — exposed not as a black-box signal but as a small, auditable predictor whose AUC and Brier are reported on every release.
For a target disease D (UMLS CUI):
contraindication as a first-class relation — see the PrimeKG follow-up.