metadata
language:
- en
license: apache-2.0
tags:
- glove
- lora
- distillation
- bpe
- cl100k_base
- ffn
base_model: jsanzolac/bpe_glove_512
datasets:
- jsanzolac/qwen3_emb_512
- jsanzolac/qwen3_emb_512_packed
bpe_glove_512_lora_v1_ffn
Warm-start from jsanzolac/bpe_glove_512_lora_v1/rank_512 plus a per-token FFN inserted
between the GloVe-attention output and the alpha-pool collapse.
Trainable: A, B, FFN. Frozen: E, teacher.
Loss: λ_c·InfoNCE + λ_D·‖ρ_T − ρ_S‖²_F with λ_c=1.0, λ_D=0.1.
Density is computed on the post-FFN per-token states; InfoNCE is on the alpha-pooled sentence vector.
Files:
rank_512/checkpoint_final.pt— A + B + FFN state dict (E is non-persistent; re-inject fromjsanzolac/bpe_glove_512/vectors.txt).rank_512/config.json— full hyperparameters.rank_512/vectors_drifted.txt—E + B(A(·))per vocab row, GloVe text format. Note: this captures only the static drifted embedding lookup, not the FFN's effect (which is contextual). To use the model end-to-end, instantiateDriftingGloVeStudentFFNand run forward.rank_512/train_log.jsonl— per-step metrics.