---
license: apache-2.0
base_model: poolside/Laguna-XS.2
tags:
  - code
  - distillation
  - moe-to-dense
  - laguna
library_name: transformers
pipeline_tag: text-generation
---

# Laguna-XS.2-dense

A **≈3B dense** model distilled from **[poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2)** — a 33B Mixture-of-Experts coding model with a ≈3B active path. We replace the MoE feed-forward layers with a single **dense** FFN of the same size as the active path (8 routed + 1 shared expert), turning the ≈3B *active* compute into a genuine ≈3B *dense* model that keeps XS.2's attention.

> ⚠️ **Research / hackathon artifact — heavily under-trained.** This checkpoint was produced in a time-boxed hackathon with a tiny distillation budget (≈22M assistant tokens). It is **not** production-ready. But after switching to **chat-format KD** it produces **coherent, runnable code** and scores **6.7% on HumanEval** (up from 0%) — see [Results](#results). It demonstrates the *method* and is a starting point for longer distillation.

## Method

Two stages, both distilling from the frozen FP8 XS.2 teacher:

1. **[Stage 1](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) — per-layer MoE→dense init** (RADLADS-style). Each of the 39 sparse MoE blocks is replaced by a dense SwiGLU FFN (intermediate 4608) and trained *independently, in parallel* to match the teacher MoE block's output (NMSE on the residual contribution), fed the teacher's own hidden states (no error compounding). ≈90M tokens. Result: a dense init with held-out perplexity **≈25** (vs teacher **≈4.4**) — functional but rough, because cross-layer error compounding is left uncorrected by design.
2. **Stage 2 — synchronous logit-KD** (this model). The stitched ≈3B dense student is trained **end-to-end** against the fp8 teacher's full-vocab logits (forward-KL). Two variants:
   - **Raw-text KD** (initial): 50/50 code+general raw text, packed. ≈14M tokens, **KL 2.5 → 1.40**. *This destroyed the instruct behavior* (see [Diagnosis](#diagnosis--what-fixed-it)) — 0% on HumanEval.
   - **Chat-format KD** (the fix, [separate repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat)): coding instruction→response examples rendered through the model's **native chat template** (special tokens, EOS-terminated turn), with the **KL loss masked to the assistant tokens** so the student learns to answer *and stop*. ≈22M assistant tokens on Magicoder-Evol-Instruct, one H100, **KL → 0.87**. Recovered coherent code and **6.7% HumanEval**.

Data: raw-text KD used 50% [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) + 50% [StarCoder2/the-stack-v2-train](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids); chat-format KD used [Magicoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K).

**Stage-1 per-layer NMSE** (all 39 dense FFNs converging against their MoE-block targets):

![Stage-1 per-layer NMSE curves](stage1_nmse_curves.png)

**Stage-2 KD loss** (forward-KL teacher‖student over training, 2.5 → 1.40):

![Stage-2 KD KL loss](stage2_kl_loss.png)

## Architecture / loading note

The dense FFNs are intermediate 4608 (layers 1–39) and 8192 (layer 0). For a uniform config that loads with the **stock** `modeling_laguna.py`, the 4608 FFNs are **zero-padded to 8192** (numerically identical — `silu(0)·0 = 0`). So the exported checkpoint reports ≈3.8B params (padded); the true model is ≈3.0B.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained(
    "poolside-laguna-hackathon/laguna-xs2-dense-stage2",
    trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
tok = AutoTokenizer.from_pretrained(
    "poolside-laguna-hackathon/laguna-xs2-dense-stage2", trust_remote_code=True)

ids = tok("def add(a, b):\n    ", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=64)[0], skip_special_tokens=True))
```

(`student_last.pt`, the raw training state_dict, is also in this repo.)

## Results

HumanEval pass@1 (greedy, [evalplus](https://github.com/evalplus/evalplus)), base / plus:

| Model | Params (resident) | PPL | HumanEval (raw completion) | HumanEval (chat template) |
|---|---|---|---|---|
| Teacher (Laguna XS.2, fp8) | 33B (3B active) | ≈4.4 | — | **88.4% / 84.8%** |
| [Stage-1 dense](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) | ≈3B | ≈25 | 0.0% | 0.0% |
| Stage-2 dense, **raw-text** KD | ≈3B | — | 0.6% | 0.0% |
| [**Stage-2 dense, chat-format KD**](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat) | ≈3B | — | — | **6.7% / 6.1%** |

> **Headline:** switching the Stage-2 distillation from raw text to the model's **native chat format** took the dense model from **0% → 6.7%** pass@1 — the first non-trivial coding ability, on ≈22M assistant tokens. The chat-format checkpoint lives in [its own repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat).

### Sample generation (chat-format KD)

Prompt: *"Write a Python function `is_prime(n)` that returns True if n is prime."* The chat-KD model returns a correct, documented implementation (and **stops**):

```python
def is_prime(n):
    """Return True if n is a prime number, False otherwise."""
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True
```

The raw-text KD model, by contrast, emitted control-token spam (`</think>…</assistant>`) and never produced runnable code — hence its 0%.

## Diagnosis & what fixed it

**Symptom:** the raw-text-KD dense model scored **0% on HumanEval in both eval formats**, while the teacher scores a normal **88.4%** with its chat template — so the harness was sound; the model was genuinely broken.

**Root cause:** XS.2 is an **instruct/agentic** model (chat template, special tokens, EOS-terminated turns), but we first distilled it on **raw concatenated pretraining text** (DCLM + code, packed). That pushed the student off its instruct distribution and it never learned to *stop* — generations degenerated (Stage-1: `return n;` repeated; raw-KD Stage-2 in chat mode: control-token spam `</think>…</assistant>`).

**Fix (applied, and it worked):** distill in the model's **native chat format** — coding instruction→response conversations rendered through `chat_template.jinja`, EOS-terminated, with the **KL loss masked to the assistant tokens**. Same KD loss/loop; only the data + tokenization changed. Result: coherent code and **6.7% HumanEval**, up from 0%. The lesson: *distilling an instruct model requires chat-format, EOS-terminated data — more raw tokens would not have fixed it.*

What this release demonstrates: the MoE→dense **architecture** works (per-layer init converges, see the NMSE curves), the **≈11× weight-VRAM reduction** (below) at matched active-compute, and that chat-format KD recovers usable instruct behavior. Closing the remaining gap to the teacher is a matter of **more chat-format KD tokens** (we used ≈22M; recovery budgets are typically 250M–4B).

## VRAM / footprint

The MoE keeps all 33B params resident even though only ≈3B are active per token; the dense model keeps only the ≈3B.

| Model | Params resident | bf16 weights | fp8 weights |
|---|---|---|---|
| XS.2 (33B MoE) | 33.4 B | ≈67 GB | ≈34 GB |
| **XS.2-dense (stage_2)** | ≈3.0 B | **≈6 GB** | ≈3 GB |

→ **≈11× smaller weight footprint** (≈61 GB saved, bf16). Caveats: attention is unchanged, so **KV-cache memory is identical** to the teacher (all savings are in the weights); per-token **FLOPs are ≈unchanged** (the MoE was already ≈3B-active) — the win is **memory/deployability**, not speed. XS.2 needs an 80 GB-class GPU (or fp8 on 48 GB); the dense model fits a 16 GB consumer GPU. _(The exported checkpoint here is zero-padded to ≈3.8B / 7.7 GB for stock-modeling compat; the true model is 3.0B / ≈6 GB.)_

## Limitations & next steps

- **Severely under-trained.** ≈22M chat-KD assistant tokens is 10–200× below typical recovery budgets (RADLADS used 250–700M; MoE→dense work ≈4B). 6.7% HumanEval is a proof-of-life, not a usable coder yet.
- **Next:** extend chat-format Stage 2 substantially — more Magicoder/teacher-generated conversations, ideally with **cached teacher top-K logits** to remove the teacher forward from the loop (3–5× throughput), reaching 100M+ assistant tokens. Then run the paper-faithful agentic evals (SWE-bench / Terminal-Bench via Harbor).

Code: https://github.com/postscarcity-inc/laguna-xs.2-dense · [Stage-1 model](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1)