Upload README.md with huggingface_hub

f3415d8 verified 5 days ago

1.42 kB

language:
  - en
license: apache-2.0
tags:
  - glove
  - lora
  - distillation
  - hard-negatives
  - qwen3-embedding
base_model: jsanzolac/bpe_glove_300_lora_r300_qwen3
datasets:
  - jsanzolac/qwen3_emb_300_packed_cl100k
  - jsanzolac/qwen3_emb_512_hard_negatives

bpe_glove_300_lora_r300_qwen3_hardnegs

Continuation of jsanzolac/bpe_glove_300_lora_r300_qwen3 — same DriftingGloVeStudent rank=300 over a frozen 300-d cl100k BPE-GloVe — trained for an additional 150,000 steps with mined hard negatives from jsanzolac/qwen3_emb_512_hard_negatives.

Loss: cross_entropy(v @ [v_T ‖ v_hards]^T / τ) + λ_MSE · MSE(v, v_T) with τ = 0.05, λ_MSE = 1.0, H = 64 mined hard negatives per anchor.

Warm-start: A.weight + B.weight from jsanzolac/bpe_glove_300_lora_r300_qwen3/rank_300/checkpoint_final.pt. Optimizer state was not in the source checkpoint, so this run uses a fresh LR schedule (5e-4 → 1e-5 cosine over 150,000 steps).

Frozen: E (300-d GloVe from jsanzolac/drifting-glove-distilled-r300), teacher (only used to produce the cached v_T targets in jsanzolac/qwen3_emb_300_packed_cl100k — not loaded here).

Files under rank_300/:

checkpoint_final.pt — A.weight + B.weight (E excluded; reinject from jsanzolac/drifting-glove-distilled-r300).
config.json
vectors_drifted.txt / .parquet — E + B(A(·)) per vocab row.
train_log.jsonl