language:
- en
license: apache-2.0
tags:
- glove
- lora
- distillation
- hard-negatives
- qwen3-embedding
base_model: jsanzolac/bpe_glove_300_lora_r300_qwen3
datasets:
- jsanzolac/qwen3_emb_300_packed_cl100k
- jsanzolac/qwen3_emb_512_hard_negatives
bpe_glove_300_lora_r300_qwen3_hardnegs_nce_only
Continuation of jsanzolac/bpe_glove_300_lora_r300_qwen3 — same DriftingGloVeStudent rank=300 over a frozen
300-d cl100k BPE-GloVe — trained for an additional 150,000 steps with mined
hard negatives from jsanzolac/qwen3_emb_512_hard_negatives.
Loss: cross_entropy(v @ [v_T ‖ v_hards]^T / τ) — MSE term disabled in this variant. with
Ï„ = 0.05, H = 64 mined hard negatives per anchor.
Warm-start: A.weight + B.weight from jsanzolac/bpe_glove_300_lora_r300_qwen3/rank_300/checkpoint_final.pt. Optimizer
state was not in the source checkpoint, so this run uses a fresh LR schedule
(5e-4 → 1e-5 cosine over 150,000 steps).
Frozen: E (300-d GloVe from jsanzolac/drifting-glove-distilled-r300), teacher (only used to produce the
cached v_T targets in jsanzolac/qwen3_emb_300_packed_cl100k — not loaded here).
Files under rank_300/:
checkpoint_final.pt—A.weight+B.weight(E excluded; reinject fromjsanzolac/drifting-glove-distilled-r300).config.jsonvectors_drifted.txt/.parquet—E + B(A(·))per vocab row.train_log.jsonl