--- license: apache-2.0 base_model: poolside/Laguna-XS.2 tags: - code - distillation - moe-to-dense - laguna library_name: transformers pipeline_tag: text-generation --- # Laguna-XS.2-dense A **≈3B dense** model distilled from **[poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2)** — a 33B Mixture-of-Experts coding model with a ≈3B active path. We replace the MoE feed-forward layers with a single **dense** FFN of the same size as the active path (8 routed + 1 shared expert), turning the ≈3B *active* compute into a genuine ≈3B *dense* model that keeps XS.2's attention. > ⚠️ **Research / hackathon artifact — heavily under-trained.** This checkpoint was produced in a time-boxed hackathon with a tiny distillation budget (≈22M assistant tokens). It is **not** production-ready. But after switching to **chat-format KD** it produces **coherent, runnable code** and scores **6.7% on HumanEval** (up from 0%) — see [Results](#results). It demonstrates the *method* and is a starting point for longer distillation. ## Method Two stages, both distilling from the frozen FP8 XS.2 teacher: 1. **[Stage 1](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) — per-layer MoE→dense init** (RADLADS-style). Each of the 39 sparse MoE blocks is replaced by a dense SwiGLU FFN (intermediate 4608) and trained *independently, in parallel* to match the teacher MoE block's output (NMSE on the residual contribution), fed the teacher's own hidden states (no error compounding). ≈90M tokens. Result: a dense init with held-out perplexity **≈25** (vs teacher **≈4.4**) — functional but rough, because cross-layer error compounding is left uncorrected by design. 2. **Stage 2 — synchronous logit-KD** (this model). The stitched ≈3B dense student is trained **end-to-end** against the fp8 teacher's full-vocab logits (forward-KL). Two variants: - **Raw-text KD** (initial): 50/50 code+general raw text, packed. ≈14M tokens, **KL 2.5 → 1.40**. *This destroyed the instruct behavior* (see [Diagnosis](#diagnosis--what-fixed-it)) — 0% on HumanEval. - **Chat-format KD** (the fix, [separate repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat)): coding instruction→response examples rendered through the model's **native chat template** (special tokens, EOS-terminated turn), with the **KL loss masked to the assistant tokens** so the student learns to answer *and stop*. ≈22M assistant tokens on Magicoder-Evol-Instruct, one H100, **KL → 0.87**. Recovered coherent code and **6.7% HumanEval**. Data: raw-text KD used 50% [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) + 50% [StarCoder2/the-stack-v2-train](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids); chat-format KD used [Magicoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K). **Stage-1 per-layer NMSE** (all 39 dense FFNs converging against their MoE-block targets): ![Stage-1 per-layer NMSE curves](stage1_nmse_curves.png) **Stage-2 KD loss** (forward-KL teacher‖student over training, 2.5 → 1.40): ![Stage-2 KD KL loss](stage2_kl_loss.png) ## Architecture / loading note The dense FFNs are intermediate 4608 (layers 1–39) and 8192 (layer 0). For a uniform config that loads with the **stock** `modeling_laguna.py`, the 4608 FFNs are **zero-padded to 8192** (numerically identical — `silu(0)·0 = 0`). So the exported checkpoint reports ≈3.8B params (padded); the true model is ≈3.0B. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer m = AutoModelForCausalLM.from_pretrained( "poolside-laguna-hackathon/laguna-xs2-dense-stage2", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda") tok = AutoTokenizer.from_pretrained( "poolside-laguna-hackathon/laguna-xs2-dense-stage2", trust_remote_code=True) ids = tok("def add(a, b):\n ", return_tensors="pt").to("cuda") print(tok.decode(m.generate(**ids, max_new_tokens=64)[0], skip_special_tokens=True)) ``` (`student_last.pt`, the raw training state_dict, is also in this repo.) ## Results HumanEval pass@1 (greedy, [evalplus](https://github.com/evalplus/evalplus)), base / plus: | Model | Params (resident) | PPL | HumanEval (raw completion) | HumanEval (chat template) | |---|---|---|---|---| | Teacher (Laguna XS.2, fp8) | 33B (3B active) | ≈4.4 | — | **88.4% / 84.8%** | | [Stage-1 dense](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) | ≈3B | ≈25 | 0.0% | 0.0% | | Stage-2 dense, **raw-text** KD | ≈3B | — | 0.6% | 0.0% | | [**Stage-2 dense, chat-format KD**](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat) | ≈3B | — | — | **6.7% / 6.1%** | > **Headline:** switching the Stage-2 distillation from raw text to the model's **native chat format** took the dense model from **0% → 6.7%** pass@1 — the first non-trivial coding ability, on ≈22M assistant tokens. The chat-format checkpoint lives in [its own repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat). ### Sample generation (chat-format KD) Prompt: *"Write a Python function `is_prime(n)` that returns True if n is prime."* The chat-KD model returns a correct, documented implementation (and **stops**): ```python def is_prime(n): """Return True if n is a prime number, False otherwise.""" if n < 2: return False for i in range(2, int(n**0.5) + 1): if n % i == 0: return False return True ``` The raw-text KD model, by contrast, emitted control-token spam (`…`) and never produced runnable code — hence its 0%. ## Diagnosis & what fixed it **Symptom:** the raw-text-KD dense model scored **0% on HumanEval in both eval formats**, while the teacher scores a normal **88.4%** with its chat template — so the harness was sound; the model was genuinely broken. **Root cause:** XS.2 is an **instruct/agentic** model (chat template, special tokens, EOS-terminated turns), but we first distilled it on **raw concatenated pretraining text** (DCLM + code, packed). That pushed the student off its instruct distribution and it never learned to *stop* — generations degenerated (Stage-1: `return n;` repeated; raw-KD Stage-2 in chat mode: control-token spam `…`). **Fix (applied, and it worked):** distill in the model's **native chat format** — coding instruction→response conversations rendered through `chat_template.jinja`, EOS-terminated, with the **KL loss masked to the assistant tokens**. Same KD loss/loop; only the data + tokenization changed. Result: coherent code and **6.7% HumanEval**, up from 0%. The lesson: *distilling an instruct model requires chat-format, EOS-terminated data — more raw tokens would not have fixed it.* What this release demonstrates: the MoE→dense **architecture** works (per-layer init converges, see the NMSE curves), the **≈11× weight-VRAM reduction** (below) at matched active-compute, and that chat-format KD recovers usable instruct behavior. Closing the remaining gap to the teacher is a matter of **more chat-format KD tokens** (we used ≈22M; recovery budgets are typically 250M–4B). ## VRAM / footprint The MoE keeps all 33B params resident even though only ≈3B are active per token; the dense model keeps only the ≈3B. | Model | Params resident | bf16 weights | fp8 weights | |---|---|---|---| | XS.2 (33B MoE) | 33.4 B | ≈67 GB | ≈34 GB | | **XS.2-dense (stage_2)** | ≈3.0 B | **≈6 GB** | ≈3 GB | → **≈11× smaller weight footprint** (≈61 GB saved, bf16). Caveats: attention is unchanged, so **KV-cache memory is identical** to the teacher (all savings are in the weights); per-token **FLOPs are ≈unchanged** (the MoE was already ≈3B-active) — the win is **memory/deployability**, not speed. XS.2 needs an 80 GB-class GPU (or fp8 on 48 GB); the dense model fits a 16 GB consumer GPU. _(The exported checkpoint here is zero-padded to ≈3.8B / 7.7 GB for stock-modeling compat; the true model is 3.0B / ≈6 GB.)_ ## Limitations & next steps - **Severely under-trained.** ≈22M chat-KD assistant tokens is 10–200× below typical recovery budgets (RADLADS used 250–700M; MoE→dense work ≈4B). 6.7% HumanEval is a proof-of-life, not a usable coder yet. - **Next:** extend chat-format Stage 2 substantially — more Magicoder/teacher-generated conversations, ideally with **cached teacher top-K logits** to remove the teacher forward from the loop (3–5× throughput), reaching 100M+ assistant tokens. Then run the paper-faithful agentic evals (SWE-bench / Terminal-Bench via Harbor). Code: https://github.com/postscarcity-inc/laguna-xs.2-dense · [Stage-1 model](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1)