Instructions to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat

SGLang

How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat with Docker Model Runner:
```
docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat
```

duico commited on 23 days ago

Commit

dcae30d

verified ·

1 Parent(s): 3f971d2

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +114 -0

README.md ADDED Viewed

	@@ -0,0 +1,114 @@

+---
+license: apache-2.0
+base_model: poolside/Laguna-XS.2
+tags:
+  - code
+  - distillation
+  - moe-to-dense
+  - laguna
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Laguna-XS.2-dense
+A **≈3B dense** model distilled from **[poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2)** — a 33B Mixture-of-Experts coding model with a ≈3B active path. We replace the MoE feed-forward layers with a single **dense** FFN of the same size as the active path (8 routed + 1 shared expert), turning the ≈3B *active* compute into a genuine ≈3B *dense* model that keeps XS.2's attention.
+> ⚠️ **Research / hackathon artifact — heavily under-trained.** This checkpoint was produced in a time-boxed hackathon with a tiny distillation budget (≈22M assistant tokens). It is **not** production-ready. But after switching to **chat-format KD** it produces **coherent, runnable code** and scores **6.7% on HumanEval** (up from 0%) — see [Results](#results). It demonstrates the *method* and is a starting point for longer distillation.
+## Method
+Two stages, both distilling from the frozen FP8 XS.2 teacher:
+1. **[Stage 1](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) — per-layer MoE→dense init** (RADLADS-style). Each of the 39 sparse MoE blocks is replaced by a dense SwiGLU FFN (intermediate 4608) and trained *independently, in parallel* to match the teacher MoE block's output (NMSE on the residual contribution), fed the teacher's own hidden states (no error compounding). ≈90M tokens. Result: a dense init with held-out perplexity **≈25** (vs teacher **≈4.4**) — functional but rough, because cross-layer error compounding is left uncorrected by design.
+2. **Stage 2 — synchronous logit-KD** (this model). The stitched ≈3B dense student is trained **end-to-end** against the fp8 teacher's full-vocab logits (forward-KL). Two variants:
+   - **Raw-text KD** (initial): 50/50 code+general raw text, packed. ≈14M tokens, **KL 2.5 → 1.40**. *This destroyed the instruct behavior* (see [Diagnosis](#diagnosis--what-fixed-it)) — 0% on HumanEval.
+   - **Chat-format KD** (the fix, [separate repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat)): coding instruction→response examples rendered through the model's **native chat template** (special tokens, EOS-terminated turn), with the **KL loss masked to the assistant tokens** so the student learns to answer *and stop*. ≈22M assistant tokens on Magicoder-Evol-Instruct, one H100, **KL → 0.87**. Recovered coherent code and **6.7% HumanEval**.
+Data: raw-text KD used 50% [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) + 50% [StarCoder2/the-stack-v2-train](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids); chat-format KD used [Magicoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K).
+**Stage-1 per-layer NMSE** (all 39 dense FFNs converging against their MoE-block targets):
+![Stage-1 per-layer NMSE curves](stage1_nmse_curves.png)
+**Stage-2 KD loss** (forward-KL teacher‖student over training, 2.5 → 1.40):
+![Stage-2 KD KL loss](stage2_kl_loss.png)
+## Architecture / loading note
+The dense FFNs are intermediate 4608 (layers 1–39) and 8192 (layer 0). For a uniform config that loads with the **stock** `modeling_laguna.py`, the 4608 FFNs are **zero-padded to 8192** (numerically identical — `silu(0)·0 = 0`). So the exported checkpoint reports ≈3.8B params (padded); the true model is ≈3.0B.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+m = AutoModelForCausalLM.from_pretrained(
+    "poolside-laguna-hackathon/laguna-xs2-dense-stage2",
+    trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
+tok = AutoTokenizer.from_pretrained(
+    "poolside-laguna-hackathon/laguna-xs2-dense-stage2", trust_remote_code=True)
+ids = tok("def add(a, b):\n    ", return_tensors="pt").to("cuda")
+print(tok.decode(m.generate(**ids, max_new_tokens=64)[0], skip_special_tokens=True))
+```
+(`student_last.pt`, the raw training state_dict, is also in this repo.)
+## Results
+HumanEval pass@1 (greedy, [evalplus](https://github.com/evalplus/evalplus)), base / plus:
+| Model | Params (resident) | PPL | HumanEval (raw completion) | HumanEval (chat template) |
+|---|---|---|---|---|
+| Teacher (Laguna XS.2, fp8) | 33B (3B active) | ≈4.4 | — | **88.4% / 84.8%** |
+| [Stage-1 dense](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1) | ≈3B | ≈25 | 0.0% | 0.0% |
+| Stage-2 dense, **raw-text** KD | ≈3B | — | 0.6% | 0.0% |
+| [**Stage-2 dense, chat-format KD**](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat) | ≈3B | — | — | **6.7% / 6.1%** |
+> **Headline:** switching the Stage-2 distillation from raw text to the model's **native chat format** took the dense model from **0% → 6.7%** pass@1 — the first non-trivial coding ability, on ≈22M assistant tokens. The chat-format checkpoint lives in [its own repo](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat).
+### Sample generation (chat-format KD)
+Prompt: *"Write a Python function `is_prime(n)` that returns True if n is prime."* The chat-KD model returns a correct, documented implementation (and **stops**):
+```python
+def is_prime(n):
+    """Return True if n is a prime number, False otherwise."""
+    if n < 2:
+        return False
+    for i in range(2, int(n**0.5) + 1):
+        if n % i == 0:
+            return False
+    return True
+```
+The raw-text KD model, by contrast, emitted control-token spam (`</think>…</assistant>`) and never produced runnable code — hence its 0%.
+## Diagnosis & what fixed it
+**Symptom:** the raw-text-KD dense model scored **0% on HumanEval in both eval formats**, while the teacher scores a normal **88.4%** with its chat template — so the harness was sound; the model was genuinely broken.
+**Root cause:** XS.2 is an **instruct/agentic** model (chat template, special tokens, EOS-terminated turns), but we first distilled it on **raw concatenated pretraining text** (DCLM + code, packed). That pushed the student off its instruct distribution and it never learned to *stop* — generations degenerated (Stage-1: `return n;` repeated; raw-KD Stage-2 in chat mode: control-token spam `</think>…</assistant>`).
+**Fix (applied, and it worked):** distill in the model's **native chat format** — coding instruction→response conversations rendered through `chat_template.jinja`, EOS-terminated, with the **KL loss masked to the assistant tokens**. Same KD loss/loop; only the data + tokenization changed. Result: coherent code and **6.7% HumanEval**, up from 0%. The lesson: *distilling an instruct model requires chat-format, EOS-terminated data — more raw tokens would not have fixed it.*
+What this release demonstrates: the MoE→dense **architecture** works (per-layer init converges, see the NMSE curves), the **≈11× weight-VRAM reduction** (below) at matched active-compute, and that chat-format KD recovers usable instruct behavior. Closing the remaining gap to the teacher is a matter of **more chat-format KD tokens** (we used ≈22M; recovery budgets are typically 250M–4B).
+## VRAM / footprint
+The MoE keeps all 33B params resident even though only ≈3B are active per token; the dense model keeps only the ≈3B.
+| Model | Params resident | bf16 weights | fp8 weights |
+|---|---|---|---|
+| XS.2 (33B MoE) | 33.4 B | ≈67 GB | ≈34 GB |
+| **XS.2-dense (stage_2)** | ≈3.0 B | **≈6 GB** | ≈3 GB |
+→ **≈11× smaller weight footprint** (≈61 GB saved, bf16). Caveats: attention is unchanged, so **KV-cache memory is identical** to the teacher (all savings are in the weights); per-token **FLOPs are ≈unchanged** (the MoE was already ≈3B-active) — the win is **memory/deployability**, not speed. XS.2 needs an 80 GB-class GPU (or fp8 on 48 GB); the dense model fits a 16 GB consumer GPU. _(The exported checkpoint here is zero-padded to ≈3.8B / 7.7 GB for stock-modeling compat; the true model is 3.0B / ≈6 GB.)_
+## Limitations & next steps
+- **Severely under-trained.** ≈22M chat-KD assistant tokens is 10–200× below typical recovery budgets (RADLADS used 250–700M; MoE→dense work ≈4B). 6.7% HumanEval is a proof-of-life, not a usable coder yet.
+- **Next:** extend chat-format Stage 2 substantially — more Magicoder/teacher-generated conversations, ideally with **cached teacher top-K logits** to remove the teacher forward from the loop (3–5× throughput), reaching 100M+ assistant tokens. Then run the paper-faithful agentic evals (SWE-bench / Terminal-Bench via Harbor).
+Code: https://github.com/postscarcity-inc/laguna-xs.2-dense · [Stage-1 model](https://huggingface.co/poolside-laguna-hackathon/laguna-xs2-dense-stage1)