duico commited on
Commit
dcae30d
·
verified ·
1 Parent(s): 3f971d2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: poolside/Laguna-XS.2
4
+ tags:
5
+ - code
6
+ - distillation
7
+ - moe-to-dense
8
+ - laguna
9
+ library_name: transformers
10
+ pipeline_tag: text-generation
11
+ ---
12
+
13
+ # Laguna-XS.2-dense
14
+
15
+ A **≈3B dense** model distilled from **[poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2)** — a 33B Mixture-of-Experts coding model with a ≈3B active path. We replace the MoE feed-forward layers with a single **dense** FFN of the same size as the active path (8 routed + 1 shared expert), turning the ≈3B *active* compute into a genuine ≈3B *dense* model that keeps XS.2's attention.
16
+
17
+ > ⚠️ **Research / hackathon artifact — heavily under-trained.** This checkpoint was produced in a time-boxed hackathon with a tiny distillation budget (≈22M assistant tokens). It is **not** production-ready. But after switching to **chat-format KD** it produces **coherent, runnable code** and scores **6.7% on HumanEval** (up from 0%) — see [Results](#results). It demonstrates the *method* and is a starting point for longer distillation.
18
+
19
+ ## Method
20
+
21
+ Two stages, both distilling from the frozen FP8 XS.2 teacher:
22
+
23
+ 1. **[Stage 1](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) — per-layer MoE→dense init** (RADLADS-style). Each of the 39 sparse MoE blocks is replaced by a dense SwiGLU FFN (intermediate 4608) and trained *independently, in parallel* to match the teacher MoE block's output (NMSE on the residual contribution), fed the teacher's own hidden states (no error compounding). ≈90M tokens. Result: a dense init with held-out perplexity **≈25** (vs teacher **≈4.4**) — functional but rough, because cross-layer error compounding is left uncorrected by design.
24
+ 2. **Stage 2 — synchronous logit-KD** (this model). The stitched ≈3B dense student is trained **end-to-end** against the fp8 teacher's full-vocab logits (forward-KL). Two variants:
25
+ - **Raw-text KD** (initial): 50/50 code+general raw text, packed. ≈14M tokens, **KL 2.5 → 1.40**. *This destroyed the instruct behavior* (see [Diagnosis](#diagnosis--what-fixed-it)) — 0% on HumanEval.
26
+ - **Chat-format KD** (the fix, [separate repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat)): coding instruction→response examples rendered through the model's **native chat template** (special tokens, EOS-terminated turn), with the **KL loss masked to the assistant tokens** so the student learns to answer *and stop*. ≈22M assistant tokens on Magicoder-Evol-Instruct, one H100, **KL → 0.87**. Recovered coherent code and **6.7% HumanEval**.
27
+
28
+ Data: raw-text KD used 50% [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) + 50% [StarCoder2/the-stack-v2-train](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids); chat-format KD used [Magicoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K).
29
+
30
+ **Stage-1 per-layer NMSE** (all 39 dense FFNs converging against their MoE-block targets):
31
+
32
+ ![Stage-1 per-layer NMSE curves](stage1_nmse_curves.png)
33
+
34
+ **Stage-2 KD loss** (forward-KL teacher‖student over training, 2.5 → 1.40):
35
+
36
+ ![Stage-2 KD KL loss](stage2_kl_loss.png)
37
+
38
+ ## Architecture / loading note
39
+
40
+ The dense FFNs are intermediate 4608 (layers 1–39) and 8192 (layer 0). For a uniform config that loads with the **stock** `modeling_laguna.py`, the 4608 FFNs are **zero-padded to 8192** (numerically identical — `silu(0)·0 = 0`). So the exported checkpoint reports ≈3.8B params (padded); the true model is ≈3.0B.
41
+
42
+ ```python
43
+ import torch
44
+ from transformers import AutoModelForCausalLM, AutoTokenizer
45
+
46
+ m = AutoModelForCausalLM.from_pretrained(
47
+ "poolside-laguna-hackathon/laguna-xs2-dense-stage2",
48
+ trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
49
+ tok = AutoTokenizer.from_pretrained(
50
+ "poolside-laguna-hackathon/laguna-xs2-dense-stage2", trust_remote_code=True)
51
+
52
+ ids = tok("def add(a, b):\n ", return_tensors="pt").to("cuda")
53
+ print(tok.decode(m.generate(**ids, max_new_tokens=64)[0], skip_special_tokens=True))
54
+ ```
55
+
56
+ (`student_last.pt`, the raw training state_dict, is also in this repo.)
57
+
58
+ ## Results
59
+
60
+ HumanEval pass@1 (greedy, [evalplus](https://github.com/evalplus/evalplus)), base / plus:
61
+
62
+ | Model | Params (resident) | PPL | HumanEval (raw completion) | HumanEval (chat template) |
63
+ |---|---|---|---|---|
64
+ | Teacher (Laguna XS.2, fp8) | 33B (3B active) | ≈4.4 | — | **88.4% / 84.8%** |
65
+ | [Stage-1 dense](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) | ≈3B | ≈25 | 0.0% | 0.0% |
66
+ | Stage-2 dense, **raw-text** KD | ≈3B | — | 0.6% | 0.0% |
67
+ | [**Stage-2 dense, chat-format KD**](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat) | ≈3B | — | — | **6.7% / 6.1%** |
68
+
69
+ > **Headline:** switching the Stage-2 distillation from raw text to the model's **native chat format** took the dense model from **0% → 6.7%** pass@1 — the first non-trivial coding ability, on ≈22M assistant tokens. The chat-format checkpoint lives in [its own repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat).
70
+
71
+ ### Sample generation (chat-format KD)
72
+
73
+ Prompt: *"Write a Python function `is_prime(n)` that returns True if n is prime."* The chat-KD model returns a correct, documented implementation (and **stops**):
74
+
75
+ ```python
76
+ def is_prime(n):
77
+ """Return True if n is a prime number, False otherwise."""
78
+ if n < 2:
79
+ return False
80
+ for i in range(2, int(n**0.5) + 1):
81
+ if n % i == 0:
82
+ return False
83
+ return True
84
+ ```
85
+
86
+ The raw-text KD model, by contrast, emitted control-token spam (`</think>…</assistant>`) and never produced runnable code — hence its 0%.
87
+
88
+ ## Diagnosis & what fixed it
89
+
90
+ **Symptom:** the raw-text-KD dense model scored **0% on HumanEval in both eval formats**, while the teacher scores a normal **88.4%** with its chat template — so the harness was sound; the model was genuinely broken.
91
+
92
+ **Root cause:** XS.2 is an **instruct/agentic** model (chat template, special tokens, EOS-terminated turns), but we first distilled it on **raw concatenated pretraining text** (DCLM + code, packed). That pushed the student off its instruct distribution and it never learned to *stop* — generations degenerated (Stage-1: `return n;` repeated; raw-KD Stage-2 in chat mode: control-token spam `</think>…</assistant>`).
93
+
94
+ **Fix (applied, and it worked):** distill in the model's **native chat format** — coding instruction→response conversations rendered through `chat_template.jinja`, EOS-terminated, with the **KL loss masked to the assistant tokens**. Same KD loss/loop; only the data + tokenization changed. Result: coherent code and **6.7% HumanEval**, up from 0%. The lesson: *distilling an instruct model requires chat-format, EOS-terminated data — more raw tokens would not have fixed it.*
95
+
96
+ What this release demonstrates: the MoE→dense **architecture** works (per-layer init converges, see the NMSE curves), the **≈11× weight-VRAM reduction** (below) at matched active-compute, and that chat-format KD recovers usable instruct behavior. Closing the remaining gap to the teacher is a matter of **more chat-format KD tokens** (we used ≈22M; recovery budgets are typically 250M–4B).
97
+
98
+ ## VRAM / footprint
99
+
100
+ The MoE keeps all 33B params resident even though only ≈3B are active per token; the dense model keeps only the ≈3B.
101
+
102
+ | Model | Params resident | bf16 weights | fp8 weights |
103
+ |---|---|---|---|
104
+ | XS.2 (33B MoE) | 33.4 B | ≈67 GB | ≈34 GB |
105
+ | **XS.2-dense (stage_2)** | ≈3.0 B | **≈6 GB** | ≈3 GB |
106
+
107
+ → **≈11× smaller weight footprint** (≈61 GB saved, bf16). Caveats: attention is unchanged, so **KV-cache memory is identical** to the teacher (all savings are in the weights); per-token **FLOPs are ≈unchanged** (the MoE was already ≈3B-active) — the win is **memory/deployability**, not speed. XS.2 needs an 80 GB-class GPU (or fp8 on 48 GB); the dense model fits a 16 GB consumer GPU. _(The exported checkpoint here is zero-padded to ≈3.8B / 7.7 GB for stock-modeling compat; the true model is 3.0B / ≈6 GB.)_
108
+
109
+ ## Limitations & next steps
110
+
111
+ - **Severely under-trained.** ≈22M chat-KD assistant tokens is 10–200× below typical recovery budgets (RADLADS used 250–700M; MoE→dense work ≈4B). 6.7% HumanEval is a proof-of-life, not a usable coder yet.
112
+ - **Next:** extend chat-format Stage 2 substantially — more Magicoder/teacher-generated conversations, ideally with **cached teacher top-K logits** to remove the teacher forward from the loop (3–5× throughput), reaching 100M+ assistant tokens. Then run the paper-faithful agentic evals (SWE-bench / Terminal-Bench via Harbor).
113
+
114
+ Code: https://github.com/postscarcity-inc/laguna-xs.2-dense · [Stage-1 model](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1)