technical methods

How the model is built.

A short, honest technical description. What we did, how we evaluated it, and what we didn't do.

Base model

We use ComplEx (Trouillon et al., 2016): a complex-valued tensor factorization scoring function over (head, relation, tail) triples. For a triple (h, r, t), the score is the real part of <h, r, conj(t)> over complex embeddings.

We add the N3 regularizer (Lacroix et al., 2018), a tensor-power L₃ regularization that consistently dominates L₂/Frobenius for ComplEx-family models, and reciprocal relations: each training triple (h, r, t) is augmented with (t, r⁻¹, h). The combined recipe ("ComplEx-N3") is the published high-water mark on standard knowledge-graph benchmarks.

Training

Dataset: OGBL-biokg public benchmark (~94 K entities, 51 base relations + 51 reciprocal, 4.76 M training triples).
Embedding dimension: 512 (real + 512 imaginary per entity).
Loss: 1-N cross-entropy. Each (h, r) query scores all 93 K entities; the gold tail is the cross-entropy target.
Regularization: N3 with λ=10⁻², per-example normalized.
Optimizer: Adagrad, lr=0.1.
Batch: 2048. Precision: bfloat16 mixed-precision via PyTorch autocast.
Epochs: 15 (loss plateaus by epoch 5; 15 is sufficient).
Hardware: single NVIDIA RTX A6000.

Compression to ternary

Each entity's real and imaginary embedding tables are quantized independently to {−1, 0, +1} with a small per-block float scale. The block size B is the knob: B=1 means a single scale per row (highest compression), B=512 means a scale per dimension (no compression). At B=128 we get 5.3× compression of the entity tables and the model still beats the OGBL TransE leaderboard baseline.

Evaluation

We use the official OGB filtered protocol: each test triple ships with 500 type-constrained negatives, already filtered against the train + valid + test splits. We rank the gold against (gold + 500 negatives) on both head and tail sides, and report mean reciprocal rank.

Multi-seed reproducibility

We trained 3 independent ComplEx-N3 models with random seeds 1, 2, 3:

Seed	Validation MRR	Hits@1	Hits@10
1	0.8378	0.774	0.946
2	0.8427	0.786	0.945
3	0.8436	0.785	0.949
Mean ± SD	0.8414 ± 0.003	0.782 ± 0.005	0.947 ± 0.002

SD of 0.003 across seeds means the result is stable.

Compression sweep (test set)

Block size	MRR	Hits@1	Hits@10	Compression
Float teacher	0.847	0.790	0.949	1×
B=256	0.794	0.717	0.939	3.2×
B=128	0.752	0.667	0.923	5.3×
B=64	0.730	0.637	0.914	8.0×
B=1 (per-row)	0.696	0.592	0.901	15.8×

Reference leaderboard entries: TransE 0.745, RotatE 0.799, ComplEx 0.810. Our B=128 ternary edges out TransE; B=256 sits between TransE and RotatE.

Agreement head (per-query confidence)

We trained a small MLP that takes the float (h, r) embeddings and predicts whether the ternary student will agree with the float teacher on the top-1 tail for that query. This gives a per-query confidence signal — useful clinically, where "is the cheap model reliable for this case?" is the question that matters.

Architecture: 4·D → 64 → 64 → 1 sigmoid (D = 512).
Trained on 5K validation queries, BCE loss against the agreement label.
Test AUC: 0.755. Test Brier score: 0.086.
Calibration: queries with predicted P>0.8 have actual agreement rate of 59%.

This is the medical analogue of the NEO-style metacognition we measure across frontier chat models — exposed not as a black-box signal but as a small, auditable predictor whose AUC and Brier are reported on every release.

Candidate prediction pipeline

For a target disease D (UMLS CUI):

Locate D's entity ID in the OGBL-biokg disease space.
For every drug entity, compute the score for the triple (drug, drug-disease, D) using each seed's teacher; average across seeds.
Filter out drugs that already appear in any (drug, drug-disease, D) triple in the OGBL-biokg train, valid, or test splits.
Return the top 20 by mean score, with seed standard deviation as a stability column.
For each top candidate, look up indications via the ChEMBL drug-indication API (using UniChem to map PubChem CID → ChEMBL ID) and the Open Targets GraphQL API. Flag candidates with at least one indication for D.

What we didn't do

No quantization-aware training. The ternary student is post-hoc absmean-quantized from the trained float teacher; we did not retrain with the quantizer in the loop. QAT would likely close more of the distillation gap, especially at smaller B.
No clinical or wet-lab validation. The corroboration check is purely against public databases. No predicted candidate has been tested in vitro or in vivo by us.
No preference / contraindication modeling on OGBL-biokg. The model has no concept of dose, route, age, or comorbidity. It ranks plausibility of association under the (drug, drug-disease, *) relation as defined in OGBL-biokg, nothing more. PrimeKG does add contraindication as a first-class relation — see the PrimeKG follow-up.
No multi-seed run on the agreement head. The head was trained once; reproducibility bands are not yet measured for it.

References

Trouillon et al., 2016. Complex Embeddings for Simple Link Prediction.
Lacroix et al., 2018. Canonical Tensor Decomposition for Knowledge Base Completion.
Hu et al., 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs.
Chandak, Huang & Zitnik, 2023. Building a knowledge graph to enable precision medicine. Nature Scientific Data.
ChEMBL: www.ebi.ac.uk/chembl
Open Targets: platform.opentargets.org