jsanzolac/qwen3_emb_512
Viewer • Updated • 6M • 71
DriftingGloVeStudent rank=300 over a frozen 300-d cl100k BPE-GloVe, distilled from
Qwen/Qwen3-Embedding-8B (MRL-truncated to the first 300 dims, then re-L2-normalized).
Same architecture and hyperparameters as the previous best 300-d run
(r300/one_more_try_train_consolidated (3).ipynb); the only change is the teacher source.
Loss: cross_entropy(v @ v_T^T / τ) + λ_MSE · MSE(v, v_T) with τ = 0.05, λ_MSE = 1.0.
Training: 150,000 steps × batch 256 × lr 0.0005 (cosine → 1e-05).
Files under rank_300/:
checkpoint_final.pt — LoRA A.weight + B.weight (E excluded; re-inject from jsanzolac/drifting-glove-distilled-r300/vectors.txt).config.jsonvectors_drifted.txt / .parquet — E + B(A(·)) per vocab row.train_log.jsonl