Text Generation
Transformers
Safetensors
laguna
code
distillation
moe-to-dense
conversational
custom_code
Instructions to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat
- SGLang
How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with Docker Model Runner:
docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: poolside/Laguna-XS.2
|
| 4 |
+
tags:
|
| 5 |
+
- code
|
| 6 |
+
- distillation
|
| 7 |
+
- moe-to-dense
|
| 8 |
+
- laguna
|
| 9 |
+
library_name: transformers
|
| 10 |
+
pipeline_tag: text-generation
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Laguna-XS.2-dense
|
| 14 |
+
|
| 15 |
+
A **≈3B dense** model distilled from **[poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2)** — a 33B Mixture-of-Experts coding model with a ≈3B active path. We replace the MoE feed-forward layers with a single **dense** FFN of the same size as the active path (8 routed + 1 shared expert), turning the ≈3B *active* compute into a genuine ≈3B *dense* model that keeps XS.2's attention.
|
| 16 |
+
|
| 17 |
+
> ⚠️ **Research / hackathon artifact — heavily under-trained.** This checkpoint was produced in a time-boxed hackathon with a tiny distillation budget (≈22M assistant tokens). It is **not** production-ready. But after switching to **chat-format KD** it produces **coherent, runnable code** and scores **6.7% on HumanEval** (up from 0%) — see [Results](#results). It demonstrates the *method* and is a starting point for longer distillation.
|
| 18 |
+
|
| 19 |
+
## Method
|
| 20 |
+
|
| 21 |
+
Two stages, both distilling from the frozen FP8 XS.2 teacher:
|
| 22 |
+
|
| 23 |
+
1. **[Stage 1](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) — per-layer MoE→dense init** (RADLADS-style). Each of the 39 sparse MoE blocks is replaced by a dense SwiGLU FFN (intermediate 4608) and trained *independently, in parallel* to match the teacher MoE block's output (NMSE on the residual contribution), fed the teacher's own hidden states (no error compounding). ≈90M tokens. Result: a dense init with held-out perplexity **≈25** (vs teacher **≈4.4**) — functional but rough, because cross-layer error compounding is left uncorrected by design.
|
| 24 |
+
2. **Stage 2 — synchronous logit-KD** (this model). The stitched ≈3B dense student is trained **end-to-end** against the fp8 teacher's full-vocab logits (forward-KL). Two variants:
|
| 25 |
+
- **Raw-text KD** (initial): 50/50 code+general raw text, packed. ≈14M tokens, **KL 2.5 → 1.40**. *This destroyed the instruct behavior* (see [Diagnosis](#diagnosis--what-fixed-it)) — 0% on HumanEval.
|
| 26 |
+
- **Chat-format KD** (the fix, [separate repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat)): coding instruction→response examples rendered through the model's **native chat template** (special tokens, EOS-terminated turn), with the **KL loss masked to the assistant tokens** so the student learns to answer *and stop*. ≈22M assistant tokens on Magicoder-Evol-Instruct, one H100, **KL → 0.87**. Recovered coherent code and **6.7% HumanEval**.
|
| 27 |
+
|
| 28 |
+
Data: raw-text KD used 50% [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) + 50% [StarCoder2/the-stack-v2-train](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids); chat-format KD used [Magicoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K).
|
| 29 |
+
|
| 30 |
+
**Stage-1 per-layer NMSE** (all 39 dense FFNs converging against their MoE-block targets):
|
| 31 |
+
|
| 32 |
+

|
| 33 |
+
|
| 34 |
+
**Stage-2 KD loss** (forward-KL teacher‖student over training, 2.5 → 1.40):
|
| 35 |
+
|
| 36 |
+

|
| 37 |
+
|
| 38 |
+
## Architecture / loading note
|
| 39 |
+
|
| 40 |
+
The dense FFNs are intermediate 4608 (layers 1–39) and 8192 (layer 0). For a uniform config that loads with the **stock** `modeling_laguna.py`, the 4608 FFNs are **zero-padded to 8192** (numerically identical — `silu(0)·0 = 0`). So the exported checkpoint reports ≈3.8B params (padded); the true model is ≈3.0B.
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
import torch
|
| 44 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 45 |
+
|
| 46 |
+
m = AutoModelForCausalLM.from_pretrained(
|
| 47 |
+
"poolside-laguna-hackathon/laguna-xs2-dense-stage2",
|
| 48 |
+
trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
|
| 49 |
+
tok = AutoTokenizer.from_pretrained(
|
| 50 |
+
"poolside-laguna-hackathon/laguna-xs2-dense-stage2", trust_remote_code=True)
|
| 51 |
+
|
| 52 |
+
ids = tok("def add(a, b):\n ", return_tensors="pt").to("cuda")
|
| 53 |
+
print(tok.decode(m.generate(**ids, max_new_tokens=64)[0], skip_special_tokens=True))
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
(`student_last.pt`, the raw training state_dict, is also in this repo.)
|
| 57 |
+
|
| 58 |
+
## Results
|
| 59 |
+
|
| 60 |
+
HumanEval pass@1 (greedy, [evalplus](https://github.com/evalplus/evalplus)), base / plus:
|
| 61 |
+
|
| 62 |
+
| Model | Params (resident) | PPL | HumanEval (raw completion) | HumanEval (chat template) |
|
| 63 |
+
|---|---|---|---|---|
|
| 64 |
+
| Teacher (Laguna XS.2, fp8) | 33B (3B active) | ≈4.4 | — | **88.4% / 84.8%** |
|
| 65 |
+
| [Stage-1 dense](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) | ≈3B | ≈25 | 0.0% | 0.0% |
|
| 66 |
+
| Stage-2 dense, **raw-text** KD | ≈3B | — | 0.6% | 0.0% |
|
| 67 |
+
| [**Stage-2 dense, chat-format KD**](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat) | ≈3B | — | — | **6.7% / 6.1%** |
|
| 68 |
+
|
| 69 |
+
> **Headline:** switching the Stage-2 distillation from raw text to the model's **native chat format** took the dense model from **0% → 6.7%** pass@1 — the first non-trivial coding ability, on ≈22M assistant tokens. The chat-format checkpoint lives in [its own repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat).
|
| 70 |
+
|
| 71 |
+
### Sample generation (chat-format KD)
|
| 72 |
+
|
| 73 |
+
Prompt: *"Write a Python function `is_prime(n)` that returns True if n is prime."* The chat-KD model returns a correct, documented implementation (and **stops**):
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
def is_prime(n):
|
| 77 |
+
"""Return True if n is a prime number, False otherwise."""
|
| 78 |
+
if n < 2:
|
| 79 |
+
return False
|
| 80 |
+
for i in range(2, int(n**0.5) + 1):
|
| 81 |
+
if n % i == 0:
|
| 82 |
+
return False
|
| 83 |
+
return True
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
The raw-text KD model, by contrast, emitted control-token spam (`</think>…</assistant>`) and never produced runnable code — hence its 0%.
|
| 87 |
+
|
| 88 |
+
## Diagnosis & what fixed it
|
| 89 |
+
|
| 90 |
+
**Symptom:** the raw-text-KD dense model scored **0% on HumanEval in both eval formats**, while the teacher scores a normal **88.4%** with its chat template — so the harness was sound; the model was genuinely broken.
|
| 91 |
+
|
| 92 |
+
**Root cause:** XS.2 is an **instruct/agentic** model (chat template, special tokens, EOS-terminated turns), but we first distilled it on **raw concatenated pretraining text** (DCLM + code, packed). That pushed the student off its instruct distribution and it never learned to *stop* — generations degenerated (Stage-1: `return n;` repeated; raw-KD Stage-2 in chat mode: control-token spam `</think>…</assistant>`).
|
| 93 |
+
|
| 94 |
+
**Fix (applied, and it worked):** distill in the model's **native chat format** — coding instruction→response conversations rendered through `chat_template.jinja`, EOS-terminated, with the **KL loss masked to the assistant tokens**. Same KD loss/loop; only the data + tokenization changed. Result: coherent code and **6.7% HumanEval**, up from 0%. The lesson: *distilling an instruct model requires chat-format, EOS-terminated data — more raw tokens would not have fixed it.*
|
| 95 |
+
|
| 96 |
+
What this release demonstrates: the MoE→dense **architecture** works (per-layer init converges, see the NMSE curves), the **≈11× weight-VRAM reduction** (below) at matched active-compute, and that chat-format KD recovers usable instruct behavior. Closing the remaining gap to the teacher is a matter of **more chat-format KD tokens** (we used ≈22M; recovery budgets are typically 250M–4B).
|
| 97 |
+
|
| 98 |
+
## VRAM / footprint
|
| 99 |
+
|
| 100 |
+
The MoE keeps all 33B params resident even though only ≈3B are active per token; the dense model keeps only the ≈3B.
|
| 101 |
+
|
| 102 |
+
| Model | Params resident | bf16 weights | fp8 weights |
|
| 103 |
+
|---|---|---|---|
|
| 104 |
+
| XS.2 (33B MoE) | 33.4 B | ≈67 GB | ≈34 GB |
|
| 105 |
+
| **XS.2-dense (stage_2)** | ≈3.0 B | **≈6 GB** | ≈3 GB |
|
| 106 |
+
|
| 107 |
+
→ **≈11× smaller weight footprint** (≈61 GB saved, bf16). Caveats: attention is unchanged, so **KV-cache memory is identical** to the teacher (all savings are in the weights); per-token **FLOPs are ≈unchanged** (the MoE was already ≈3B-active) — the win is **memory/deployability**, not speed. XS.2 needs an 80 GB-class GPU (or fp8 on 48 GB); the dense model fits a 16 GB consumer GPU. _(The exported checkpoint here is zero-padded to ≈3.8B / 7.7 GB for stock-modeling compat; the true model is 3.0B / ≈6 GB.)_
|
| 108 |
+
|
| 109 |
+
## Limitations & next steps
|
| 110 |
+
|
| 111 |
+
- **Severely under-trained.** ≈22M chat-KD assistant tokens is 10–200× below typical recovery budgets (RADLADS used 250–700M; MoE→dense work ≈4B). 6.7% HumanEval is a proof-of-life, not a usable coder yet.
|
| 112 |
+
- **Next:** extend chat-format Stage 2 substantially — more Magicoder/teacher-generated conversations, ideally with **cached teacher top-K logits** to remove the teacher forward from the loop (3–5× throughput), reaching 100M+ assistant tokens. Then run the paper-faithful agentic evals (SWE-bench / Terminal-Bench via Harbor).
|
| 113 |
+
|
| 114 |
+
Code: https://github.com/postscarcity-inc/laguna-xs.2-dense · [Stage-1 model](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1)
|