Instructions to use iamrahulreddy/Quintus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use iamrahulreddy/Quintus with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="iamrahulreddy/Quintus")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("iamrahulreddy/Quintus")
model = AutoModelForMultimodalLM.from_pretrained("iamrahulreddy/Quintus")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use iamrahulreddy/Quintus with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "iamrahulreddy/Quintus"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iamrahulreddy/Quintus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/iamrahulreddy/Quintus

SGLang

How to use iamrahulreddy/Quintus with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "iamrahulreddy/Quintus" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iamrahulreddy/Quintus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "iamrahulreddy/Quintus" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iamrahulreddy/Quintus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use iamrahulreddy/Quintus with Docker Model Runner:
```
docker model run hf.co/iamrahulreddy/Quintus
```

iamrahulreddy commited on 7 days ago

Commit

4fc1bb9

verified ·

1 Parent(s): cbe6941

release: publish Quintus project files

Browse files

Files changed (41) hide show

.gitattributes +1 -0
LICENSE +21 -0
README.md +367 -63
assets/benchmark_scoreboard.png +3 -0
assets/offline_vs_online_kd.svg +3 -0
assets/pipeline_hardening_flow.svg +3 -0
assets/quintus_architecture.svg +3 -0
configs/__init__.py +186 -0
configs/config.yaml +77 -0
configs/ds_zero2.json +20 -0
docs/architecture.md +88 -0
docs/benchmarks.md +58 -0
docs/engineering_insights.md +152 -0
docs/evaluation_methodology.md +234 -0
docs/experiment_timeline.md +181 -0
docs/huggingface_model_card.md +178 -0
docs/index.md +42 -0
docs/pipeline_hardening.md +208 -0
docs/training_playbook.md +199 -0
docs/weight_audit.md +66 -0
requirements-eval.txt +6 -0
requirements-train.txt +12 -0
requirements.txt +13 -0
sft/chat.py +89 -0
sft/evaluate.py +267 -0
sft/train_sft.py +690 -0
src/__init__.py +1 -0
src/checkpoints.py +241 -0
src/download.py +574 -0
src/kd_contracts.py +95 -0
src/losses.py +180 -0
src/optim.py +44 -0
src/provenance.py +173 -0
src/sequence_packing.py +183 -0
src/train.py +1219 -0
src/training_data.py +375 -0
src/training_schedule.py +165 -0
src/transformers_compat.py +110 -0
src/validation.py +70 -0
weight_audit/quintus_weight_audit.py +818 -0
weight_audit/weight_audit_report.txt +0 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/benchmark_scoreboard.png filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 Muskula Rahul
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,94 +1,326 @@
 ---
 license: mit
 language:
-- en
 tags:
-- text-generation
-- conversational
-- qwen3
-- knowledge-distillation
-base_model: Qwen/Qwen3-1.7B
 ---
-# Quintus 1.7B
-**Quintus-1.7B** is a compact, instruction-following AI assistant, Built on the **Qwen3-1.7B** architecture, Quintus bridges the gap between massive parameter sizes and low-resource edge deployment through a two-stage training paradigm: Online Knowledge Distillation (KD) followed by Supervised Fine-Tuning (SFT).
-The final model weights are publicly available on Hugging Face: [iamrahulreddy/Quintus](https://huggingface.co/iamrahulreddy/Quintus).
-## Model Details
-- **Architecture**: Qwen3-1.7B
-- **Teacher Model**: Qwen3-8B-Instruct
-- **Training Paradigm**: Online Full-Vocab Knowledge Distillation + Targeted Persona SFT
-- **Language**: English
-- **License**: MIT
-## Core Methodology & Architecture
-The Quintus pipeline implements two primary phases to overcome the performance limitations of compact base models without standard SFT dataset scaling limits:
-1. **Online Knowledge Distillation (KD)**: Rather than caching teacher logits offline, the Quintus engine streams the 8B teacher's full-vocabulary probability distribution live during the student's forward pass.
-2. **Targeted Persona SFT**: A final fine-tuning phase on LIMA and identity data grounds the model's persona and prevents infinite repetition loops.
-## Dataset & Training Details
-- **Training Dataset**: Fine-tuned using the [DistilQwen_100k](https://huggingface.co/datasets/alibaba-pai/DistilQwen_100k) dataset. Approximately 90,000 instruction-following examples were used after filtering out non-English (Chinese, Japanese, Korean) samples.
-- **High-Throughput Optimizations**:
-  - **Sequence Packing**: Dense sequence packing utilizing a First-Fit Decreasing (FFD) binning algorithm to eliminate VRAM waste from padding.
-  - **Memory & Compute Kernels**: Accelerated gradient computations using **FlashAttention-2** and **Liger Kernels** (fused operators).
-  - **Optimizer**: Fused AdamW optimizer configuration for faster, memory-efficient weight updates.
 ## Benchmark Scoreboard
-Quintus 1.7B demonstrates a crossover phenomenon, successfully outperforming the official instruction-tuned `Qwen3-1.7B-Instruct` model on multiple reasoning and coding tasks.
-| Benchmark | Qwen3-1.7B-Base | Qwen3-1.7B-Instruct | **Quintus 1.7B** |
-| :--- | :---: | :---: | :---: |
-| **HumanEval** pass@1 | 67.1% | **70.7%** | 67.7% |
-| **MBPP** pass@1 | 67.2% | 58.2% | **64.8%** |
-| **GSM8K** (10-shot, flexible) | 69.98% | 69.75% | **74.30%** |
-| **ARC-Challenge** acc_norm | 55.72% | 52.99% | **58.36%** |
-| **WinoGrande** (5-shot) | 65.67% | 61.01% | **66.38%** |
-| **PIQA** acc_norm | 75.63% | 72.09% | **75.57%** |
-## Usage: Quick Run in Google Colab / CLI
-You can easily run Quintus interactively. The following script sets up a conversational loop with streaming text output, perfect for Google Colab or a local terminal.
-Make sure you have the required libraries installed:
-```python
-# Install if necessary - pip install torch transformers accelerate
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
-# Repo name
 PUBLIC_REPO_ID = "iamrahulreddy/Quintus"
 print(f"Loading Quintus from {PUBLIC_REPO_ID}...")
 tokenizer = AutoTokenizer.from_pretrained(PUBLIC_REPO_ID, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
-    PUBLIC_REPO_ID,
-    device_map="auto",
-    dtype=torch.float16,
-    trust_remote_code=True
 )
-# Stopping criteria
 stop_tokens = ["<|endoftext|>", "<|im_end|>"]
 eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []
 for token in stop_tokens:
-    t_id = tokenizer.convert_tokens_to_ids(token)
-    if t_id is not None and t_id not in eos_token_ids:
-        eos_token_ids.append(t_id)
 streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
 conversation_history = [
-    {"role": "system", "content": "You are Quintus, a highly capable AI assistant created by Muskula Rahul. You are helpful, precise, and logically sound."}
 ]
-print("\nQuintus Chat (type 'quit' to exit)\n")
 while True:
     try:
@@ -100,17 +332,17 @@ while True:
             continue
         conversation_history.append({"role": "user", "content": user_input})
         prompt = tokenizer.apply_chat_template(
-            conversation_history,
-            tokenize=False,
-            add_generation_prompt=True
         )
         inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
         print("Quintus: ", end="", flush=True)
         with torch.no_grad():
             outputs = model.generate(
                 **inputs,
@@ -120,15 +352,87 @@ while True:
                 do_sample=True,
                 streamer=streamer,
                 pad_token_id=tokenizer.eos_token_id,
-                eos_token_id=eos_token_ids
             )
-        # Extract response for history
         generated_ids = outputs[0][inputs.input_ids.shape[-1]:]
-        assistant_response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
         conversation_history.append({"role": "assistant", "content": assistant_response})
         print()
     except KeyboardInterrupt:
         print("\n\nGoodbye!")
-        break

 ---
 license: mit
 language:
+  - en
+library_name: transformers
+pipeline_tag: text-generation
+base_model: Qwen/Qwen3-1.7B-Base
+base_model_relation: finetune
+datasets:
+  - alibaba-pai/DistilQwen_100k
+metrics:
+  - accuracy
+  - exact_match
+  - code_eval
 tags:
+  - qwen3
+  - qwen
+  - qwen3-1.7b
+  - qwen3-8b
+  - quintus
+  - quintus-1.7b
+  - causal-lm
+  - text-generation
+  - language-model
+  - chat
+  - assistant
+  - compact-llm
+  - small-language-model
+  - knowledge-distillation
+  - online-kd
+  - full-vocabulary-kd
+  - supervised-fine-tuning
+  - sft
+  - reasoning
+  - code-generation
+  - english
+  - pytorch
+  - transformers
+  - vllm
+widget:
+  - text: "Explain knowledge distillation in simple terms."
+  - text: "Solve this step by step: If a train travels 180 km in 3 hours, what is its average speed?"
 ---
+# Quintus
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1TdMSN5HzD1mToCFVf_qQoj10NGZLy2V0?usp=sharing)
+[![Hugging Face Model](https://img.shields.io/badge/Hugging%20Face-Quintus-ffcc4d?style=flat-square&logo=huggingface&logoColor=yellow)](https://huggingface.co/iamrahulreddy/Quintus)
+[![Docs](https://img.shields.io/badge/Docs-Project%20Guide-0f766e?style=flat-square&logo=googledocs&logoColor=white)](docs/index.md)
+[![Benchmarks](https://img.shields.io/badge/Benchmarks-Scoreboard-2563eb?style=flat-square&logo=speedtest&logoColor=white)](docs/benchmarks.md)
+[![License: MIT](https://img.shields.io/badge/License-MIT-111827?style=flat-square)](LICENSE)
+[![Base Model](https://img.shields.io/badge/Base-Qwen3--1.7B--Base-7c3aed?style=flat-square)](https://huggingface.co/Qwen/Qwen3-1.7B-Base)
+[![Teacher](https://img.shields.io/badge/Teacher-Qwen3--8B-b45309?style=flat-square)](https://huggingface.co/Qwen/Qwen3-8B)
+**Quintus-1.7B** is a compact English-focused assistant built from
+`Qwen/Qwen3-1.7B-Base`. The project uses **online full-vocabulary knowledge
+distillation** from a `Qwen/Qwen3-8B` teacher, followed by a targeted SFT stage
+for assistant behavior, identity grounding, and generation stability.
+Final model weights:
+[iamrahulreddy/Quintus](https://huggingface.co/iamrahulreddy/Quintus)
+## Core Technical Points
+- **Dense KD signal:** the final training path streams the teacher's full
+  vocabulary distribution live instead of relying on sparse cached top-k logits.
+- **Base-student strategy:** the student starts from `Qwen/Qwen3-1.7B-Base`,
+  leaving more room for distillation before assistant-format tuning.
+- **Assistant-only supervision:** prompt text, chat headers, separators, and
+  padding are masked out of the supervised target region.
+- **Sequence packing:** deterministic first-fit decreasing packing improves
+  useful-token throughput at 4096-token context length.
+- **Public benchmark controls:** raw/chat prompt format, metric extraction,
+  generation budget, and artifact hygiene are documented explicitly.
+## Training Summary
+The release training path is a two-stage pipeline:
+1. **Online KD:** train the 1.7B base student against live teacher logits from a
+   Qwen3-8B teacher.
+2. **Targeted SFT:** tune the distilled checkpoint for assistant-style
+   interaction, persona consistency, and repetition control.
+## Reuse As A KD Framework
+Quintus is released as a trained 1.7B assistant, but the repository is also a
+reusable reference pipeline for compact-model distillation. The same structure
+can be adapted to other teacher/student pairs with changes to the model IDs,
+tokenizer, dataset source, local paths, sequence length, batch schedule, and
+hardware-specific memory settings in [configs/config.yaml](configs/config.yaml).
+The reusable pieces are split across the codebase: assistant-only masking,
+sequence packing, online full-vocabulary KD loss, checkpoint/resume metadata,
+validation, provenance checks, SFT, and evaluation. The final pattern is:
+1. Distill a smaller base student from a stronger teacher with online KD.
+2. Apply targeted SFT to recover assistant behavior, formatting, identity, and
+   generation stability.
+![Quintus Architecture](assets/quintus_architecture.svg)
+Core KD objective:
+$$
+\mathcal{L}_{\text{total}}
+= \alpha \mathcal{L}_{\text{CE}}
++ (1 - \alpha)\mathcal{L}_{\text{KD}}
+$$
+For the final run,
+$$
+\alpha = 0.3,\quad T = 2.0
+$$
+Configuration snapshot:
+| Setting | Value |
+| :--- | :--- |
+| Teacher | `Qwen/Qwen3-8B` |
+| Student | `Qwen/Qwen3-1.7B-Base` |
+| Tokenizer | `Qwen/Qwen3-1.7B` |
+| Data | ~90K English-only samples from DistilQwen_100k |
+| Max sequence length | 4096 |
+| Epochs | 1 |
+| Learning rate | `5.0e-6` |
+| Weight decay | `0.1` |
+| Warmup ratio | `0.05` |
+| Online KD token chunk | 2048 |
+| Micro batch | 4 |
+| Gradient accumulation | 2 |
+| Sequence packing | enabled, `pack_length = 4096` |
+| Attention | FlashAttention-2 when available |
+| Liger kernels | enabled for compatible Qwen-family ops |
+| Optimizer | fused AdamW |
+| `torch.compile` | disabled |
+| Gradient checkpointing | disabled |
+| Seed | 25 |
+> [!NOTE]
+> FlashAttention-2, Liger kernels, and fused AdamW are acceleration paths. Keep
+> the baseline load path compatible with standard Transformers and vLLM APIs
+> before publishing checkpoints. `torch.compile` stayed disabled because this
+> KD shape showed high Inductor memory overhead, dynamic-shape graph breaks,
+> recompile overhead, and checkpoint portability risk from `_orig_mod.` state
+> dict prefixes when compiled modules are not unwrapped before saving.
+> [!TIP]
+> The B200-oriented defaults are conservative for the 8B teacher to 1.7B
+> student workload. Smaller teacher/student pairs may tolerate larger
+> micro-batches, but full-vocabulary KD scales sharply with vocabulary width.
+The editable run configuration lives in [configs/config.yaml](configs/config.yaml).
+Paths and Hub destinations are left as placeholders so each runner can set local
+directories and repository names directly.
+## Why Online KD Replaced Offline Top-K KD
+Earlier experiments cached only the teacher's top-k logits. That made storage
+smaller, but with a Qwen vocabulary around 151K tokens, $k = 8$ exposes only:
+$$
+\frac{k}{|V|}
+= \frac{8}{151{,}665}
+\approx 5.3 \times 10^{-5}
+= 0.0053\%
+$$
+of the vocabulary support at each position. The sparse signal could perturb the
+student, but it did not consistently transfer deeper reasoning behavior.
+The final online path keeps the teacher and student in memory together and
+computes KL divergence against the teacher's full-vocabulary distribution. Token
+chunking keeps that dense objective feasible without materializing a single
+large KL workspace.
 ## Benchmark Scoreboard
+The final public scoreboard compares `Qwen/Qwen3-1.7B-Base`,
+`Qwen/Qwen3-1.7B-Instruct`, and Quintus-1.7B.
+![Model Evaluation Scoreboard](assets/benchmark_scoreboard.png)
+The strongest signal is the reasoning crossover: Quintus beats both the base
+and official 1.7B instruct model on GSM8K, ARC-Challenge, and WinoGrande while
+remaining at the same parameter scale.
+See [docs/benchmarks.md](docs/benchmarks.md) for the numeric table and
+interpretation. See
+[docs/evaluation_methodology.md](docs/evaluation_methodology.md) for benchmark
+controls.
+## Evaluation Notes
+Evaluation uses a mixture of EvalPlus and `lm-evaluation-harness`/vLLM style
+benchmarks. The repository keeps evaluation methodology separate because prompt
+format can change the result:
+- Raw completion comparisons are used for base capability.
+- Chat-template comparisons are used for assistant-format behavior.
+- Log-likelihood tasks such as ARC-Challenge and PIQA should usually stay raw.
+- GSM8K can differ between strict `####` parsing and flexible number
+  extraction.
+- Metric extraction must ignore `stderr`, aliases, and wrong filter keys.
+- Runtime versions, checkpoint identity, generation budget, and stale output
+  cleanup are part of the evaluation contract.
+The active benchmark runner is [sft/evaluate.py](sft/evaluate.py). It covers
+EvalPlus code tasks and `lm-evaluation-harness`/vLLM tasks, including GSM8K
+10-shot evaluation with an extended generation budget.
+## Repository Map
+```text
+configs/        Public run profile and DeepSpeed Zero-2 template.
+src/            Data prep, online KD, losses, packing, checkpoints, provenance.
+sft/            Post-KD SFT, local chat, and consolidated evaluation runner.
+docs/           Public architecture, training, evaluation, and release notes.
+weight_audit/   Checkpoint structure and weight-divergence audit material.
+```
+Key files:
+- [src/train.py](src/train.py): SFT, offline KD compatibility, and final
+  `online_kd` training entry point.
+- [src/download.py](src/download.py): model setup, dataset loading, schema
+  normalization, tokenization, and assistant-only loss masks.
+- [src/losses.py](src/losses.py): CE/KD objective, including online full-vocab
+  KD token chunking.
+- [src/sequence_packing.py](src/sequence_packing.py): deterministic first-fit
+  decreasing sequence packing.
+- [src/checkpoints.py](src/checkpoints.py): checkpoint save/resume metadata and
+  packing compatibility checks.
+- [src/provenance.py](src/provenance.py): tokenizer/model/data contract checks.
+- [sft/train_sft.py](sft/train_sft.py): post-KD supervised fine-tuning.
+- [sft/evaluate.py](sft/evaluate.py): EvalPlus and
+  `lm-evaluation-harness`/vLLM benchmark runner.
+- [sft/chat.py](sft/chat.py): local interactive chat wrapper.
+## Commands
+Install the base dependencies:
+```bash
+pip install -r requirements.txt
+```
+For training and benchmark runs, install the matching extras:
+```bash
+pip install -r requirements-train.txt
+pip install -r requirements-eval.txt
+```
+Inspect or prepare data/model assets:
+```bash
+python -m src.download --help
+```
+Run the final KD path after editing [configs/config.yaml](configs/config.yaml)
+for local paths and hardware:
+```bash
+python -m src.train --phase online_kd
+```
+Hub checkpoint uploads are off by default for local runs. Pass
+`--upload_last_checkpoint` or the step/epoch upload flags only after setting the
+target repository and `HF_TOKEN`.
+Run the consolidated benchmark suite:
+```bash
+python sft/evaluate.py
+```
+Start local chat with a downloaded or local checkpoint:
+```bash
+python sft/chat.py --model_path path/to/quintus/checkpoint
+```
+## Interactive Chat
+```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
 PUBLIC_REPO_ID = "iamrahulreddy/Quintus"
 print(f"Loading Quintus from {PUBLIC_REPO_ID}...")
 tokenizer = AutoTokenizer.from_pretrained(PUBLIC_REPO_ID, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
+    PUBLIC_REPO_ID,
+    device_map="auto",
+    dtype=torch.float16,
+    trust_remote_code=True,
 )
 stop_tokens = ["<|endoftext|>", "<|im_end|>"]
 eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []
 for token in stop_tokens:
+    token_id = tokenizer.convert_tokens_to_ids(token)
+    if token_id is not None and token_id not in eos_token_ids:
+        eos_token_ids.append(token_id)
 streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
 conversation_history = [
+    {
+        "role": "system",
+        "content": (
+            "You are Quintus, a highly capable AI assistant created by "
+            "Muskula Rahul. You are helpful, precise, and logically sound."
+        ),
+    }
 ]
+print()
+print("Quintus Chat (type 'quit' to exit)")
+print()
 while True:
     try:
             continue
         conversation_history.append({"role": "user", "content": user_input})
         prompt = tokenizer.apply_chat_template(
+            conversation_history,
+            tokenize=False,
+            add_generation_prompt=True,
         )
         inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
         print("Quintus: ", end="", flush=True)
         with torch.no_grad():
             outputs = model.generate(
                 **inputs,
                 do_sample=True,
                 streamer=streamer,
                 pad_token_id=tokenizer.eos_token_id,
+                eos_token_id=eos_token_ids,
             )
         generated_ids = outputs[0][inputs.input_ids.shape[-1]:]
+        assistant_response = tokenizer.decode(
+            generated_ids,
+            skip_special_tokens=True,
+        ).strip()
         conversation_history.append({"role": "assistant", "content": assistant_response})
         print()
     except KeyboardInterrupt:
         print("\n\nGoodbye!")
+        break
+```
+## Documentation
+- [Documentation Index](docs/index.md): recommended public reading order.
+- [Architecture](docs/architecture.md): end-to-end data flow, modules, and
+  training phases.
+- [Experiment Timeline](docs/experiment_timeline.md): why the project moved
+  from offline top-k KD to online full-vocabulary KD.
+- [Training Playbook](docs/training_playbook.md): memory rules, packing,
+  kernels, checkpointing, and B200-oriented guidance.
+- [Pipeline Hardening](docs/pipeline_hardening.md): silent-failure classes,
+  artifact contracts, and safety checks.
+- [Evaluation Methodology](docs/evaluation_methodology.md): raw/chat controls,
+  parser traps, metric extraction, and qualitative evaluation rules.
+- [Engineering Insights](docs/engineering_insights.md): condensed lessons and
+  design decisions.
+- [Benchmarks](docs/benchmarks.md): verified scoreboard and interpretation.
+- [Weight Audit](docs/weight_audit.md): structural checkpoint sanity checks and
+  weight-divergence summary.
+- [Hugging Face Model Card](docs/huggingface_model_card.md): release-page
+  copy for the public model card.
+## Limitations
+- Quintus is still a 1.7B model and inherits compact-model capacity limits.
+- Factual answers can be confidently wrong and should be verified.
+- Code generation may still contradict stated complexity or edge-case
+  requirements.
+- Raw and chat-template results are not interchangeable.
+- Additional preference tuning or DPO would likely improve calibration, refusal
+  behavior, and open-ended assistant polish.
+## Credits
+Quintus builds on open model, dataset, and tooling work from the broader LLM
+community:
+- [Qwen Team](https://qwenlm.github.io/) and the
+  [Qwen Hugging Face organization](https://huggingface.co/Qwen) for the Qwen3
+  model family.
+- [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B), used as the
+  distillation teacher.
+- [`Qwen/Qwen3-1.7B-Base`](https://huggingface.co/Qwen/Qwen3-1.7B-Base), used
+  as the base student checkpoint.
+- [`Qwen/Qwen3-1.7B`](https://huggingface.co/Qwen/Qwen3-1.7B), used for the
+  tokenizer and chat-template contract.
+- [Alibaba PAI](https://huggingface.co/alibaba-pai) for the
+  [`DistilQwen_100k`](https://huggingface.co/datasets/alibaba-pai/DistilQwen_100k)
+  dataset used as the primary instruction source after filtering.
+- [Hugging Face Transformers](https://github.com/huggingface/transformers) for
+  model loading, tokenization, and generation APIs.
+- [vLLM](https://github.com/vllm-project/vllm),
+  [EvalPlus](https://github.com/evalplus/evalplus), and
+  [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
+  for evaluation infrastructure.
+- [FlashAttention](https://github.com/Dao-AILab/flash-attention) and
+  [Liger Kernel](https://github.com/linkedin/Liger-Kernel) for performance
+  kernels used or validated during training.
+## License And Author
+This software is distributed under the MIT License. Refer to the
+[LICENSE](LICENSE) file for full text.
+Author: Muskula Rahul - [@iamrahulreddy](https://github.com/iamrahulreddy)
+## Citation
+If this model, codebase, or training pipeline is useful in your work, please cite this repository and acknowledge the upstream Qwen3 models.

assets/benchmark_scoreboard.png ADDED Viewed

Git LFS Details

SHA256: 2070f6f33338dab31006a253b2eb4a3f9c1655490c444af5daa7c2cb07bb9b15
Pointer size: 131 Bytes
Size of remote file: 131 kB

assets/offline_vs_online_kd.svg ADDED Viewed

assets/pipeline_hardening_flow.svg ADDED Viewed

assets/quintus_architecture.svg ADDED Viewed

configs/__init__.py ADDED Viewed

	@@ -0,0 +1,186 @@

+from __future__ import annotations
+import logging
+import os
+import sys
+import time
+from datetime import timezone, timedelta
+from pathlib import Path
+from zoneinfo import ZoneInfo
+from omegaconf import OmegaConf
+_THIS_DIR = Path(__file__).resolve().parent
+_YAML_PATH = _THIS_DIR / "config.yaml"
+def _load_cfg():
+    return OmegaConf.load(_YAML_PATH)
+cfg = _load_cfg()
+_LOG_TZ_NAME = os.environ.get("QUINTUS_LOG_TZ", "Asia/Kolkata")
+try:
+    _LOG_TZ = ZoneInfo(_LOG_TZ_NAME)
+except Exception:
+    _LOG_TZ = timezone(timedelta(hours=5, minutes=30))
+    _LOG_TZ_NAME = "Asia/Kolkata"
+os.environ["TZ"] = _LOG_TZ_NAME
+if hasattr(time, "tzset"):
+    time.tzset()
+_LOG_TZ_LABEL = "IST" if _LOG_TZ_NAME == "Asia/Kolkata" else _LOG_TZ_NAME
+def _read_bool_env(name: str) -> bool | None:
+    raw = os.environ.get(name)
+    if raw is None:
+        return None
+    normalised = raw.strip().lower()
+    if normalised in {"1", "true", "yes", "on"}:
+        return True
+    if normalised in {"0", "false", "no", "off"}:
+        return False
+    raise ValueError(
+        f"Invalid boolean value for {name}: {raw!r}. "
+        "Use 1/0, true/false, yes/no, or on/off."
+    )
+# Environment variable overrides used by the wrapper.
+if os.environ.get("QUINTUS_TEACHER_MODEL"):
+    cfg.model.teacher = os.environ["QUINTUS_TEACHER_MODEL"]
+if os.environ.get("QUINTUS_TEACHER_REVISION"):
+    cfg.model.teacher_revision = os.environ["QUINTUS_TEACHER_REVISION"]
+if os.environ.get("QUINTUS_STUDENT_MODEL"):
+    cfg.model.student = os.environ["QUINTUS_STUDENT_MODEL"]
+if os.environ.get("QUINTUS_STUDENT_REVISION"):
+    cfg.model.student_revision = os.environ["QUINTUS_STUDENT_REVISION"]
+if os.environ.get("QUINTUS_TOKENIZER_MODEL"):
+    cfg.model.tokenizer = os.environ["QUINTUS_TOKENIZER_MODEL"]
+if os.environ.get("QUINTUS_TOKENIZER_REVISION"):
+    cfg.model.tokenizer_revision = os.environ["QUINTUS_TOKENIZER_REVISION"]
+if os.environ.get("QUINTUS_STUDENT_DIR"):
+    cfg.paths.student_dir = os.environ["QUINTUS_STUDENT_DIR"]
+if os.environ.get("QUINTUS_TOKENIZER_DIR"):
+    cfg.paths.tokenizer_dir = os.environ["QUINTUS_TOKENIZER_DIR"]
+if os.environ.get("NUM_SAMPLES"):
+    cfg.data.num_samples = int(os.environ["NUM_SAMPLES"])
+if os.environ.get("TRAIN_NUM_EPOCHS"):
+    cfg.training.num_epochs = int(os.environ["TRAIN_NUM_EPOCHS"])
+if os.environ.get("TRAIN_LEARNING_RATE"):
+    cfg.training.learning_rate = float(os.environ["TRAIN_LEARNING_RATE"])
+if os.environ.get("TRAIN_ALPHA"):
+    cfg.training.alpha = float(os.environ["TRAIN_ALPHA"])
+if os.environ.get("TRAIN_TEMPERATURE"):
+    cfg.training.temperature = float(os.environ["TRAIN_TEMPERATURE"])
+if os.environ.get("TRAIN_TOP_K"):
+    cfg.training.top_k = int(os.environ["TRAIN_TOP_K"])
+if os.environ.get("QUINTUS_ONLINE_KD_TOKEN_CHUNK_SIZE"):
+    cfg.training.online_kd_token_chunk_size = int(os.environ["QUINTUS_ONLINE_KD_TOKEN_CHUNK_SIZE"])
+if os.environ.get("TRAIN_MICRO_BATCH_SIZE"):
+    cfg.training.micro_batch_size = int(os.environ["TRAIN_MICRO_BATCH_SIZE"])
+if os.environ.get("TRAIN_GRAD_ACCUM_STEPS"):
+    cfg.training.grad_accum_steps = int(os.environ["TRAIN_GRAD_ACCUM_STEPS"])
+if os.environ.get("TRAIN_DATALOADER_WORKERS"):
+    cfg.training.dataloader_workers = int(os.environ["TRAIN_DATALOADER_WORKERS"])
+if os.environ.get("TRAIN_PREFETCH_FACTOR"):
+    cfg.training.prefetch_factor = int(os.environ["TRAIN_PREFETCH_FACTOR"])
+sequence_packing_override = _read_bool_env("QUINTUS_SEQUENCE_PACKING")
+if sequence_packing_override is not None:
+    cfg.training.sequence_packing.enabled = sequence_packing_override
+if os.environ.get("QUINTUS_PACK_LENGTH"):
+    cfg.training.sequence_packing.pack_length = int(os.environ["QUINTUS_PACK_LENGTH"])
+compile_override = _read_bool_env("QUINTUS_COMPILE_MODEL")
+if compile_override is not None:
+    cfg.training.compile_model = compile_override
+fused_adamw_override = _read_bool_env("TRAIN_FUSED_ADAMW")
+if fused_adamw_override is not None:
+    cfg.training.fused_adamw = fused_adamw_override
+if os.environ.get("QUINTUS_DISTILLED_DIR"):
+    cfg.paths.distilled_dir = os.environ["QUINTUS_DISTILLED_DIR"]
+if os.environ.get("DATA_STREAM_SHUFFLE_BUFFER_SIZE"):
+    cfg.data.stream_shuffle_buffer_size = int(os.environ["DATA_STREAM_SHUFFLE_BUFFER_SIZE"])
+if os.environ.get("DATA_STREAM_SHUFFLE_SEED"):
+    cfg.data.stream_shuffle_seed = int(os.environ["DATA_STREAM_SHUFFLE_SEED"])
+remote_code_override = _read_bool_env("QUINTUS_ALLOW_REMOTE_CODE")
+if remote_code_override is not None:
+    cfg.model.allow_remote_code = remote_code_override
+class _TagFormatter(logging.Formatter):
+    def __init__(self, tag: str, fmt: str, datefmt: str | None = None):
+        super().__init__(fmt=fmt, datefmt=datefmt)
+        self.tag = tag
+    def formatTime(self, record: logging.LogRecord, datefmt: str | None = None) -> str:
+        dt = datetime_from_timestamp(record.created)
+        if datefmt:
+            return dt.strftime(datefmt)
+        return dt.isoformat(timespec="seconds")
+    def format(self, record: logging.LogRecord) -> str:
+        record.tag = self.tag  # type: ignore[attr-defined]
+        return super().format(record)
+def datetime_from_timestamp(timestamp: float):
+    from datetime import datetime
+    return datetime.fromtimestamp(timestamp, tz=_LOG_TZ)
+def setup_logger(module_tag: str, rank: int = -1) -> logging.Logger:
+    name = f"quintus.{module_tag}"
+    logger = logging.getLogger(name)
+    if logger.handlers:
+        return logger
+    logger.setLevel(logging.DEBUG)
+    logger.propagate = False
+    # Suppress duplicate output from non-primary ranks.
+    if rank not in (-1, 0):
+        logger.addHandler(logging.NullHandler())
+        return logger
+    # Plain text file handler.
+    file_fmt = _TagFormatter(
+        tag=module_tag,
+        fmt=f"[%(asctime)s {_LOG_TZ_LABEL}] [%(levelname)-5s] [%(tag)-8s] %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+    log_dir = os.path.dirname(cfg.paths.log_file)
+    if log_dir:
+        os.makedirs(log_dir, exist_ok=True)
+    file_handler = logging.FileHandler(cfg.paths.log_file, mode="a", encoding="utf-8")
+    file_handler.setLevel(logging.DEBUG)
+    file_handler.setFormatter(file_fmt)
+    logger.addHandler(file_handler)
+    # Plain text console handler. Keep the runtime logs stable across terminals,
+    # notebooks and log files.
+    console_handler = logging.StreamHandler(sys.stdout)
+    console_handler.setLevel(logging.INFO)
+    console_handler.setFormatter(file_fmt)
+    logger.addHandler(console_handler)
+    return logger
+def emit_log_spacing(logger: logging.Logger, count: int = 2) -> None:
+    if count <= 0:
+        return
+    blank_block = "\n" * count
+    for handler in logger.handlers:
+        if isinstance(handler, logging.NullHandler):
+            continue
+        stream = getattr(handler, "stream", None)
+        if stream is not None and hasattr(stream, "write"):
+            stream.write(blank_block)
+            flush = getattr(stream, "flush", None)
+            if callable(flush):
+                flush()
+            continue
+        console = getattr(handler, "console", None)
+        if console is not None:
+            console.print(blank_block, end="")

configs/config.yaml ADDED Viewed

	@@ -0,0 +1,77 @@

+# Quintus Distillation Pipeline
+# Run profile: online full-vocabulary KD, 8B teacher -> 1.7B-Base student.
+# Data: ~90K English-only samples from DistilQwen_100k.
+data:
+  dataset_path: "<REDACTED_ON_PURPOSE>"
+  num_samples: 90234
+  max_seq_len: 4096
+  stream_shuffle_buffer_size: 20000
+  stream_shuffle_seed: 25
+model:
+  teacher: "Qwen/Qwen3-8B"
+  student: "Qwen/Qwen3-1.7B-Base"
+  # The instruct tokenizer carries the chat template used to format the base
+  # student into assistant-style training examples.
+  tokenizer: "Qwen/Qwen3-1.7B"
+  teacher_revision: "main"
+  student_revision: "main"
+  tokenizer_revision: "main"
+  allow_remote_code: false
+training:
+  # Schedule
+  num_epochs: 1
+  validation_ratio: 0.02
+  split_seed: 25
+  # Optimizer
+  learning_rate: 5.0e-6
+  weight_decay: 0.1
+  warmup_ratio: 0.05
+  # Loss mix used by src/losses.py:
+  # total = alpha * CE + (1 - alpha) * KD
+  alpha: 0.3
+  temperature: 2.0
+  # Online KD streams full-vocabulary teacher logits. top_k is retained for
+  # offline-KD compatibility/provenance checks.
+  top_k: 8
+  online_kd_token_chunk_size: 2048
+  # Conservative B200 profile. Effective batch = 4 * 2 = 8.
+  # If VRAM headroom is comfortable and Liger is installed, try 8 * 1.
+  micro_batch_size: 4
+  grad_accum_steps: 2
+  gradient_checkpointing: false
+  compile_model: false
+  fused_adamw: true
+  dataloader_workers: 8
+  prefetch_factor: 2
+  sequence_packing:
+    enabled: true
+    pack_length: 4096
+    mask_first_token_after_separator: true
+hub:
+  # Prefer HF_TOKEN or huggingface-cli login for real runs.
+  token: null
+  username: "<REDACTED_ON_PURPOSE>"
+  repo_name: "<REDACTED_ON_PURPOSE>"
+paths:
+  teacher_dir: "<REDACTED_ON_PURPOSE>"
+  student_dir: "<REDACTED_ON_PURPOSE>"
+  tokenizer_dir: "<REDACTED_ON_PURPOSE>"
+  tokenized_dir: "<REDACTED_ON_PURPOSE>"
+  logits_dir: "<REDACTED_ON_PURPOSE>"
+  distilled_dir: "<REDACTED_ON_PURPOSE>"
+  log_file: "<REDACTED_ON_PURPOSE>"
+  system_info: "<REDACTED_ON_PURPOSE>"
+  loss_csv: "<REDACTED_ON_PURPOSE>"

configs/ds_zero2.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 500000000,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 500000000,
+    "overlap_comm": true,
+    "contiguous_gradients": true
+  },
+  "bf16": {
+    "enabled": true
+  },
+  "gradient_clipping": 1.0,
+  "steps_per_print": 50,
+  "wall_clock_breakdown": false,
+  "comms_logger": {
+    "enabled": false
+  }
+}

docs/architecture.md ADDED Viewed

	@@ -0,0 +1,88 @@

+# Architecture
+Quintus is built as a two-stage model development pipeline:
+1. Online full-vocabulary knowledge distillation from a larger Qwen3 teacher into a Qwen3-1.7B base student.
+2. Targeted SFT to improve instruction-following behavior, persona consistency, and generation stability.
+![Quintus Architecture](../assets/quintus_architecture.svg)
+## Core Training Path
+The main training entry point is `src/train.py`. It supports three phases:
+- `sft`: Cross-entropy training on assistant response tokens.
+- `kd`: Offline top-k teacher-logit distillation, retained for compatibility and provenance checks.
+- `online_kd`: The final preferred path. Teacher logits are produced live during the student forward pass.
+The final KD objective is implemented in `src/losses.py`:
+$$
+\mathcal{L}_{\text{total}}
+= \alpha \mathcal{L}_{\text{CE}}
++ (1 - \alpha)\mathcal{L}_{\text{KD}}
+$$
+For the final run, $\alpha = 0.3$ and $T = 2.0$. In this codebase, $\alpha$ is the cross-entropy weight. The complementary weight is assigned to the KD term.
+## Data Flow
+`src/download.py` prepares the training data. It handles both pre-tokenized rows and raw instruction data. For raw rows, it normalizes common conversation schemas, applies the tokenizer chat template, and builds an assistant-only `loss_mask`.
+Important details:
+- Prompt and formatting tokens are masked out.
+- Assistant response tokens receive loss.
+- Samples longer than `max_seq_len` are rejected rather than silently truncated.
+- The tokenizer contract is later validated to avoid teacher/student vocabulary mismatches.
+## Sequence Packing
+`src/sequence_packing.py` implements deterministic first-fit decreasing packing. It places multiple shorter samples into fixed-length bins, separated by EOS tokens.
+Packing properties:
+- Training split is packed; validation can remain unpacked for interpretability.
+- Bins are fixed at `pack_length = 4096` in the final profile.
+- EOS separators have `loss_mask = 0`.
+- The first token after a separator is optionally masked to avoid cross-sample target leakage.
+- Attention masks are built from the true packed length, not by comparing token IDs against `pad_token_id`.
+The attention-mask detail is important because Qwen tokenizers can reuse EOS-like IDs in ways that make token-identity-derived padding masks unsafe.
+## Online KD Memory Strategy
+Full-vocabulary KD is expensive because both student and teacher produce logits shaped as:
+$$
+\text{student\_logits},\ \text{teacher\_logits}
+\in \mathbb{R}^{B \times S \times |V|}
+$$
+The implementation keeps this feasible by chunking along the token dimension with:
+$$
+C_{\text{KD}} = 2048
+$$
+Each chunk computes the teacher softmax, student log-softmax, and masked KL contribution, then accumulates the result. This preserves the dense teacher distribution while avoiding a single large KL workspace.
+## Validation, Provenance, And Safety Checks
+Several modules exist to prevent silent training corruption:
+- `src/provenance.py`: Validates tokenizer contracts, vocab sizes, revisions, and teacher-logit metadata.
+- `src/kd_contracts.py`: Builds deterministic tokenizer fingerprints.
+- `src/training_schedule.py`: Aligns train/validation splits with batch and gradient-accumulation constraints.
+- `src/checkpoints.py`: Saves model, tokenizer, scheduler, trainer state, and packing metadata; validates resume compatibility.
+- `src/transformers_compat.py`: Resolves attention backend and formats model-loading errors.
+## SFT Layer
+The `sft/` directory contains the post-KD alignment layer:
+- `sft/train_sft.py`: SFT training with optional sequence packing, LoRA/QLoRA paths, and built-in spot evaluations.
+- `sft/evaluate.py`: EvalPlus and lm-evaluation-harness orchestration.
+- `sft/chat.py`: Local interactive chat wrapper using the tokenizer chat template.
+This stage is intentionally separate from KD. KD transfers the teacher's probability structure; SFT teaches the model how to expose that capability in the intended assistant format.

docs/benchmarks.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# Benchmarks
+The release scoreboard compares Qwen3-1.7B-Base, Qwen3-1.7B-Instruct, and Quintus-1.7B. Evaluations use a mixture of EvalPlus and lm-evaluation-harness style benchmarks, with greedy or deterministic settings where applicable.
+For the detailed benchmark-control rules, see [Evaluation Methodology](evaluation_methodology.md).
+## Final Scoreboard
+| Benchmark | Qwen3-1.7B-Base | Qwen3-1.7B-Instruct | Quintus-1.7B |
+| :--- | :---: | :---: | :---: |
+| HumanEval pass@1 | 67.1% | 70.7% | 67.7% |
+| MBPP pass@1 | 67.2% | 58.2% | 64.8% |
+| GSM8K, 10-shot flexible | 69.98% | 69.75% | 74.30% |
+| ARC-Challenge acc_norm | 55.72% | 52.99% | 58.36% |
+| WinoGrande, 5-shot | 65.67% | 61.01% | 66.38% |
+| PIQA acc_norm | 75.63% | 72.09% | 75.57% |
+## Interpretation
+The strongest result is the reasoning crossover: Quintus beats both the base and the official 1.7B instruct model on GSM8K, ARC-Challenge, and WinoGrande, despite remaining at the same parameter scale.
+The coding picture is mixed but useful:
+- HumanEval remains slightly below Qwen3-1.7B-Instruct.
+- MBPP is substantially above Qwen3-1.7B-Instruct, though still below the base model.
+This suggests the model gained useful instruction-following and reasoning behavior without fully matching larger or more heavily aligned code-specialized models.
+## What The Benchmarks Support
+These results support four claims:
+1. Online KD transferred reasoning capability into a compact student.
+2. The final model did not merely memorize assistant formatting; it improved several reasoning and commonsense metrics.
+3. SFT helped expose the distilled capability in an assistant setting.
+4. The model still has capacity limits typical of the 1.7B scale, especially on code execution reliability and long multi-step algorithm generation.
+## Evaluation Caveats
+Benchmark comparisons are sensitive to prompt format. Raw completion, chat-template generation, and log-likelihood multiple-choice scoring can produce different rankings. For fair interpretation:
+- Compare raw models against raw models when measuring base reasoning.
+- Compare chat-wrapped models against chat-wrapped models when measuring format alignment.
+- Treat open-ended qualitative prompts as alignment tests, not as a replacement for standardized benchmarks.
+Important implementation caveats:
+- GSM8K extraction can differ between strict `####` parsing and flexible number extraction.
+- Multiple-choice log-likelihood tasks can be distorted by chat templates.
+- `acc_norm` is preferred when answer-option length bias can change the ranking.
+- Metric extraction scripts must reject `stderr` and `alias` fields when looking for the actual score.
+- Runtime versions should be recorded with benchmark outputs because harness behavior can change across releases.
+## Earlier Development Signals
+Before the final Qwen3 8B -> 1.7B run, earlier experiments showed that sparse offline top-k KD could not consistently outperform strong baselines. Those runs were useful because they identified the bottleneck: sparse cached teacher logits were not dense enough to transfer deeper reasoning pathways.
+The final move to online full-vocabulary KD is the key methodological change behind the stronger final results.

docs/engineering_insights.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Engineering Insights
+This project evolved through several failed and successful training designs. The useful lessons are summarized here as public engineering notes.
+For expanded operational detail, see [Training Playbook](training_playbook.md), [Pipeline Hardening](pipeline_hardening.md), and [Evaluation Methodology](evaluation_methodology.md).
+## 1. Sparse Offline KD Hit A Ceiling
+The earliest distillation path cached only a small top-k slice of teacher logits. That made training cheaper, but it discarded most of the teacher distribution. With a vocabulary of roughly 151K tokens and $k = 8$, the visible support was:
+$$
+\frac{k}{|V|}
+= \frac{8}{151{,}665}
+\approx 5.3 \times 10^{-5}
+= 0.0053\%
+$$
+The result was clear: top-k KD could perturb the student, but it did not transfer enough "dark knowledge" to reliably improve reasoning. Different alphas, epochs, and student initializations could not escape this sparse-signal ceiling.
+The final fix was to use online KD: load teacher and student together, run both forward passes, and compute KL against the teacher's full vocabulary distribution.
+## 2. Base Student Was Better Than Fighting An Aligned Space
+Distilling into an already instruction-tuned student can cause destructive interference. The student's weights already encode one aligned behavior manifold, while the teacher's soft logits pull toward another. Training can look numerically stable while reasoning metrics regress.
+The final path uses `Qwen/Qwen3-1.7B-Base` as the student. The base model has more plasticity, while the CE term and later SFT stage teach assistant formatting.
+## 3. KD And Alignment Are Different Problems
+Standardized benchmarks showed that KD can improve reasoning and calibration, but open-ended chat quality still needs alignment data.
+The important diagnosis:
+- A distillation failure means the student did not absorb the teacher's useful probability structure.
+- An alignment gap means the student has capability, but the generation path is not yet trained to behave like a polished assistant.
+The project therefore separates the pipeline into KD first, then SFT.
+## 4. Assistant-Only Loss Masking Matters
+A key bug class was assigning loss to chat formatting tokens instead of only assistant response content. If the model is trained to optimize structural tokens too heavily, it can learn formatting before substance.
+The current tokenization path derives an assistant-only `loss_mask`, so:
+- User prompts are context, not targets.
+- Chat headers and separators are masked.
+- Assistant response tokens are the only supervised targets.
+This keeps training focused on semantic outputs rather than wrapper reproduction.
+## 5. Sequence Packing Was The Main Throughput Win
+The dataset contains many sequences shorter than the maximum context length. Dynamic padding wastes a large fraction of compute. First-fit decreasing sequence packing converted that waste into useful tokens.
+Observed engineering outcome:
+- Unpacked B200 online KD ran around the low-20K tokens/sec range in earlier probes.
+- Packed B200 online KD reached roughly the mid-40K tokens/sec range after warmup.
+- Packed utilization was close to full 4096-token bins.
+The final code keeps packing deterministic and stores packing metadata in checkpoints so packed/unpacked resume mismatches fail loudly.
+## 6. Full-Vocab KD Needed Token Chunking
+Online KD preserves the full teacher distribution, but a full KL workspace at Qwen vocabulary scale is too large to materialize casually:
+$$
+\text{KL workspace} \sim \mathbb{R}^{B \times S \times |V|}
+$$
+The solution is token-dimension chunking. The current implementation uses:
+$$
+C_{\text{KD}} = 2048
+$$
+Larger chunks reduce loop overhead, but increase temporary memory pressure. The selected value is a practical B200-oriented balance for the 8B -> 1.7B workload.
+## 7. Shape Churn And Synchronization Can Quietly Drain Throughput
+Several performance bugs were not correctness bugs:
+- Dynamic sequence lengths caused allocator churn.
+- Repeated `.item()` calls forced CPU-GPU synchronization.
+- Single-GPU DeepSpeed could add overhead when the model already fit comfortably.
+- `torch.compile` added memory overhead, dynamic-shape graph breaks, recompile overhead, and checkpoint portability risk.
+The final training loop favors stable shapes, fewer scalar syncs, fused AdamW when available, FlashAttention when available, and Liger kernels where they do not conflict with KD logits.
+## 8. Evaluation Requires Controlled Comparisons
+Raw completion and chat-template evaluation activate different behavior. A base model can perform well in raw mode and poorly under chat markup. A chat-aligned model can underperform on raw continuation-style tasks if the benchmark asks for direct option likelihoods.
+The project uses both controls:
+- Raw-to-raw comparisons isolate distilled base capability.
+- Chat-to-chat comparisons estimate template robustness and assistant-format alignment.
+This distinction avoids blaming KD for failures that belong to alignment or benchmark formatting.
+## 9. Post-KD SFT Is Not Optional For Assistant Quality
+KD transfers probability structure; it does not guarantee careful behavior, refusal policy, calibrated uncertainty, or code reliability. Targeted SFT was added to address:
+- Confident hallucination in open-ended answers.
+- Persona and identity consistency.
+- Repetition loops.
+- Chat-format stability.
+- Practical assistant presentation.
+Preference training or DPO would be the natural next layer if the project continues beyond the current release.
+## 10. Training Loss Is Not The Release Gate
+Several development runs looked numerically healthy while downstream benchmarks moved in the wrong direction. That pattern is expected when the training objective is only a proxy for the release objective.
+Useful release gates:
+- Standardized benchmarks.
+- Raw and chat controls.
+- Mismatch inspection.
+- Qualitative prompts after benchmark checks.
+- Weight and checkpoint structure audits.
+Held-out KD validation loss is important, but it cannot prove that the model improved on math, code, multiple-choice reasoning, or assistant behavior.
+## 11. Fail-Fast Beats Silent Recovery
+The pipeline hardened around a simple rule: corrupt artifacts should stop the run.
+Examples:
+- Missing teacher-logit shards fail instead of becoming zero tensors.
+- Tokenization with zero usable rows fails immediately.
+- Shard schema mismatches are rejected.
+- Packed/unpacked checkpoint resume mismatches are rejected.
+- Stale evaluation outputs are cleaned before new scores are written.
+This makes errors louder, but it keeps published numbers trustworthy.
+## 12. Public Docs Should Preserve Decisions
+A release-quality project should expose durable engineering conclusions:
+- why online KD replaced offline top-k KD,
+- why assistant-only masking matters,
+- why raw/chat evaluation controls are required,
+- why sequence packing changed throughput,
+- why SFT remains necessary after KD,
+- why checkpoint and provenance checks exist.
+That level of detail is enough for technical readers without turning the documentation into a chronological run journal.

docs/evaluation_methodology.md ADDED Viewed

	@@ -0,0 +1,234 @@

+# Evaluation Methodology
+Evaluation was one of the hardest parts of Quintus. Several early scores were misleading until prompt format, metric extraction, parser behavior, and runtime artifacts were audited carefully.
+## Evaluation Principle
+A model comparison is only meaningful when the prompt format and metric path match the question being asked.
+Two distinct questions matter:
+- Base capability: Does the distilled model improve raw reasoning and likelihood behavior?
+- Assistant behavior: Does the distilled model handle chat formatting and produce usable responses?
+Those questions need separate controls.
+## Run Identity And Determinism
+A benchmark record should identify the checkpoint role (`best` or `last`), exact model directory or revision, prompt mode, seeds, decoding mode, and runtime versions. Greedy decoding with fixed seeds makes repeated runs easier to compare, but hardware and kernel drift can still change edge cases.
+Treat determinism as an artifact contract, not a vague claim.
+## Raw-To-Raw And Chat-To-Chat
+Raw completion and chat-template prompting activate different model behavior. A base model can be strong in raw mode and weak under chat markup. An instruct model can be strong in chat, but weak on raw continuation-style likelihood tasks.
+Recommended controls:
+- Raw-to-raw: compare base-style prompts against base-style prompts.
+- Chat-to-chat: compare chat-wrapped prompts against chat-wrapped prompts.
+- Raw-vs-chat within the same model: measure format tax.
+Avoid comparing a chat-wrapped distilled model directly against a raw base baseline and treating the delta as pure capability transfer.
+## Log-Likelihood Tasks Should Usually Stay Raw
+Multiple-choice tasks such as ARC-Challenge, HellaSwag, and PIQA often score options by likelihood:
+$$
+P(\text{option}\mid\text{prompt})
+$$
+Wrapping the prompt in chat markup changes the next-token distribution. An aligned model may not want to begin a response with a bare option string after `<|im_start|>assistant`, so option likelihoods can fall for formatting reasons rather than reasoning reasons.
+For log-likelihood tasks:
+- Use raw completion format unless the benchmark was designed for chat.
+- Prefer `acc_norm` where length bias matters.
+- Record whether chat templates were applied.
+## GSM8K Parser Traps
+GSM8K evaluation can be distorted by parser behavior.
+Two common filters behave differently:
+- `strict-match`: looks for an answer after the `####` delimiter.
+- `flexible-extract`: searches for numbers and may choose the last matched number.
+A chat model can solve the problem, emit the correct `####` answer, miss EOS, and continue into a hallucinated next dialogue turn containing another number. In that case:
+- `strict-match` may score the response correct.
+- `flexible-extract` may grab the later hallucinated number and score it wrong.
+This is not just a parser detail. It reveals an EOS and prompt-format interaction.
+Mitigations:
+- Register all relevant EOS tokens, including `<|im_end|>` and `<|endoftext|>`.
+- Use deterministic generation for benchmark runs.
+- Avoid excessive `fewshot_as_multiturn` wrapping unless the model was trained for that shape.
+- Inspect mismatches, not just aggregate scores.
+## Reasoning Models Need Enough Generation Budget
+Instruction-tuned reasoning models may spend hundreds of tokens inside a reasoning trace before reaching the final answer. If `max_new_tokens` is too small, the model can be cut off before emitting the final answer marker.
+That can make a capable model appear weak under exact-match metrics.
+For fair GSM8K-style generation:
+- Set a sufficient generation limit.
+- Track truncation rate.
+- Compare extracted answers against raw responses during audits.
+## Batched Generation Details
+Decoder-only batched generation should use left-padding. Right-padding can put the next-token position on padding for shorter prompts and make batched outputs differ from single-sample outputs.
+Generation parsers should:
+- Set `tokenizer.padding_side = "left"` for batched generation.
+- Slice decoded continuations by each prompt's true input length.
+- Stop at the first registered EOS token.
+- Record truncation and empty-generation counts.
+## English-Only Evaluation Controls
+For English-only release checks, filtering the dataset is necessary but not sufficient. Evaluation should also use an English-only system instruction when chat prompts are enabled, register all relevant EOS IDs, and clean generated artifacts that continue into another language after the intended answer.
+This cleanup is an evaluation-artifact guard. It is not a substitute for training data quality, SFT, preference tuning, or behavioral calibration.
+## Metric Extraction Must Be Strict
+Post-processing scripts should never fall back loosely to any metric key that starts with the right prefix. A loose fallback can accidentally read:
+- `*_stderr`
+- `alias`
+- a different filter result
+Robust extraction should:
+- Match the exact metric and filter name.
+- Ignore stderr and alias fields when extracting scores.
+- Fail loudly if the expected key is absent.
+## Boolean CLI Flags
+Some harness flags use `action="store_true"`. Passing `"False"` after such a flag does not disable it; the presence of the flag enables it.
+Correct pattern:
+- Include the flag only when true.
+- Omit the flag when false.
+This matters for options such as multiturn few-shot formatting.
+## Sample Log Format
+`lm-evaluation-harness` may log different filters for the same document as separate JSONL objects with the same `doc_id`. A parser that assumes one object contains all filters can crash or silently compare the wrong fields.
+Correct approach:
+- Group sample records by `doc_id`.
+- Index filter-specific records inside each group.
+- Compare strict and flexible outputs from the same document.
+## JSONL Parsing With Unicode Line Separators
+Model outputs can contain Unicode line separator characters such as `\u2028` or `\u2029`. Calling `str.splitlines()` on a whole JSONL file can split a valid JSON string into invalid fragments.
+Robust JSONL parsing:
+```python
+with open(path, "r", encoding="utf-8") as f:
+    for line in f:
+        if line.strip():
+            record = json.loads(line)
+```
+Iterating the file handle respects actual line endings and does not split on Unicode separators inside JSON strings.
+## Hub Loading And Snapshot Hygiene
+If weights or datasets are stored on the Hub, the client should be told the correct repository type. Download or snapshot the artifact first, verify that expected files exist, then pass the local directory to Transformers, vLLM, or the evaluation harness.
+This separates transfer failures from engine construction and avoids repeated downloads during long benchmark runs.
+Optional high-throughput Hub transfer backends such as `hf_transfer` can reduce setup time, but the correctness contract is still local snapshot validation.
+## Path-Length And Output Artifacts
+Evaluation tools can derive output paths from model paths. Deep Hugging Face cache paths can become extremely long after sanitization, especially on Windows.
+Public guidance:
+- Copy or symlink model weights to a short local directory before evaluation.
+- Pass short relative paths to the evaluator.
+- Keep result directories shallow.
+- Fail if expected sample files are missing.
+This prevents silent write failures and missing-output confusion.
+## vLLM Evaluation Settings
+For large benchmark runs, vLLM can greatly reduce runtime through continuous batching and KV-cache management.
+Useful settings in development:
+- `batch_size = auto`
+- prefix caching enabled
+- PagedAttention-backed KV-cache management when available
+- bounded GPU memory utilization
+- explicit `max_model_len` where context bounds matter
+- explicit attention backend where the runtime supports it
+- local pre-caching of model snapshots before engine construction
+- explicit engine teardown between model runs
+The benchmark artifact should record runtime versions for:
+- `lm-eval`
+- `vllm`
+- `transformers`
+- `torch`
+- `datasets`
+- `accelerate`
+Version drift can change metric keys, generation behavior, attention backends, and output formats.
+## Qualitative Evaluation
+Open-ended prompt suites are useful, but they are not replacements for standardized benchmarks.
+A good qualitative suite should:
+- Compare raw and chat modes separately.
+- Use fixed prompts and deterministic ordering.
+- Include benchmark-template leakage probes.
+- Include factual, math, code, system design, and LLM-internals prompts.
+- Record complete outputs.
+- Inspect inherited base-model errors separately from new chat-mode errors.
+Qualitative failures should be classified:
+- Distillation failure: the student did not absorb useful teacher probability structure.
+- Alignment gap: capability exists, but the generation path lacks SFT, preference tuning, or calibration.
+- Data contamination: the model repeats benchmark or pretraining artifacts.
+- Code reliability gap: prose is correct, but generated code violates stated constraints.
+This distinction prevents the wrong fix. Distillation failures need KD changes. Alignment gaps need SFT, DPO, RLHF, or curated behavior data.
+## Release Gate
+The final checkpoint should pass all of these before public claims are made:
+- Benchmark tasks use the intended prompt format.
+- Metric keys are exact.
+- Sample counts match the full benchmark set.
+- Raw and chat comparisons are not mixed.
+- Generation limits are sufficient for the model style.
+- Checkpoint identity is explicit.
+- Missing requested checkpoints fail instead of falling back to older local weights.
+- Runtime versions are recorded.
+- Mismatch samples are inspected for parser artifacts.
+- No stale result directory or old JSON file is reused.

docs/experiment_timeline.md ADDED Viewed

	@@ -0,0 +1,181 @@

+# Experiment Timeline
+This timeline explains why the final Quintus design looks the way it does. It focuses on the technical evolution from sparse offline distillation to the final online full-vocabulary pipeline.
+## 1. Offline Top-K KD Prototype
+The earliest design precomputed teacher logits to disk and trained the student from cached top-k supports.
+Why it was attractive:
+- Avoided loading teacher and student together.
+- Reduced KD memory from full vocabulary to top-k support.
+- Made cloud interruptions easier to survive because teacher logits were already saved.
+Main lessons:
+- Serialization contracts matter as much as loss math.
+- Top-k token IDs need safe dtypes.
+- Teacher-logit shards must preserve original row order.
+- Missing or stale shards should fail loudly.
+## 2. Static Audit And Fail-Fast Hardening
+The project then moved through a static-audit phase focused on silent failure modes.
+Major hardening themes:
+- Dataset zero-retention checks.
+- Missing-shard hard failures.
+- Stale artifact cleanup.
+- DeepSpeed accumulation correctness.
+- Rank-safe writes.
+- Explicit model revision and remote-code policy.
+- Stronger provenance metadata.
+This phase turned the code from a script bundle into a more reliable training pipeline.
+## 3. Assistant-Only Supervision
+The tokenization path originally risked supervising the whole conversation. That can over-train prompts, headers, and formatting tokens.
+The corrected path derives `loss_mask` and trains only on assistant response tokens.
+This changed the training contract:
+- Prompt tokens provide context.
+- Assistant tokens receive CE and KD loss.
+- Rows without assistant targets are rejected.
+- Checkpoints and datasets must agree on the mask schema.
+## 4. Top-K Plus Residual Bucket
+A later offline-KD pass improved the sparse support by adding an "other" bucket for teacher probability mass outside top-k.
+This fixed a mathematical weakness: the student should be normalized against the full vocabulary before comparison, not only inside top-k. The residual bucket made offline KD less wrong, but it still compressed most of the teacher distribution into one scalar.
+That design was useful, but not enough for flagship results.
+## 5. Dataset And Objective Mismatch
+Smoke runs showed a pattern that became important later: held-out KD validation loss can improve while benchmark quality worsens.
+Key diagnosis:
+- Matching teacher token distributions on a training corpus is not identical to improving GSM8K, ARC, coding, or open-ended assistant quality.
+- Dataset order and first-N streaming can bias sample selection.
+- Long reasoning traces can overweight style and process tokens relative to final answers.
+- Small students can forget useful baseline behavior when full-parameter training is too aggressive.
+This motivated stricter downstream evaluation gates.
+## 6. Base Student Pivot
+Several runs tested whether distilling into an already-instruct-tuned student caused destructive interference. The base-student hypothesis was sound: a raw base model has more plasticity and fewer alignment paths to overwrite.
+The result was only a marginal improvement under offline top-k KD. That was the decisive clue.
+Conclusion:
+The student choice was not the main bottleneck. Offline top-k sparsity was the main bottleneck.
+## 7. Offline Top-K Ceiling
+With $k = 8$, the student saw only a tiny fraction of the teacher vocabulary distribution per target token:
+$$
+\frac{k}{|V|}
+= \frac{8}{151{,}665}
+\approx 5.3 \times 10^{-5}
+= 0.0053\%
+$$
+Different $\alpha$ values, epochs, and student initializations did not remove this limit.
+Offline top-k KD could perturb the student and sometimes improve narrow metrics, but it could not reliably transfer the teacher's broader reasoning distribution.
+The project stopped treating offline top-k KD as the path to a flagship model.
+## 8. Online Full-Vocabulary KD
+![Offline vs Online KD](../assets/offline_vs_online_kd.svg)
+Online KD became the final architecture.
+Instead of reading cached teacher shards, the training loop loads a frozen teacher and runs live teacher forward passes beside the student. The KD loss uses the teacher's full-vocabulary distribution.
+Benefits:
+- No top-k sparsity ceiling.
+- No shard-order mismatch risk.
+- No stale teacher-logit cache.
+- Stronger transfer signal for reasoning.
+Cost:
+- Higher VRAM footprint.
+- Teacher and student must fit together.
+- KL computation needs chunking.
+- Throughput depends heavily on packing and kernels.
+## 9. Sequence Packing And B200 Tuning
+Sequence packing converted padding waste into useful tokens.
+The packing implementation:
+- Packs only training data.
+- Keeps validation easier to interpret.
+- Uses fixed 4096-token bins.
+- Inserts masked EOS separators.
+- Stores packing metadata in checkpoints.
+- Rejects packed/unpacked resume mismatches.
+Development probes showed the expected utilization improvement and made online KD fast enough for serious single-GPU runs.
+## 10. English-Only Final Data
+The release run focuses on English samples.
+Reasons:
+- Reduce language drift in open-ended outputs.
+- Keep the model's assistant behavior aligned with the intended release language.
+- Make qualitative evaluation cleaner.
+- Avoid CJK continuation artifacts after missed EOS.
+The tradeoff is real: removing multilingual data can reduce access to some reasoning traces. For a public English assistant, language stability is worth that tradeoff.
+## 11. Targeted SFT After KD
+Online KD transferred capability, but raw KD is not a full assistant-alignment process.
+Targeted SFT was added after KD to improve:
+- identity grounding,
+- chat format stability,
+- practical assistant style,
+- repetition control,
+- response presentation.
+This created the final two-stage public model:
+```text
+Qwen3-1.7B-Base
+  -> online full-vocab KD from Qwen3-8B
+  -> targeted SFT
+  -> Quintus-1.7B
+```
+## 12. Release Verification
+The final release surface combines:
+- benchmark scoreboard,
+- architecture documentation,
+- evaluation methodology notes,
+- pipeline hardening notes,
+- weight audit,
+- model-card draft.
+The public docs focus on reusable methods, release results, and reproducible checks.

docs/huggingface_model_card.md ADDED Viewed

	@@ -0,0 +1,178 @@

+# Quintus-1.7B
+Quintus-1.7B is a compact instruction-following assistant derived from `Qwen/Qwen3-1.7B-Base`. It was trained with online full-vocabulary knowledge distillation from a larger Qwen3-8B teacher, followed by targeted SFT for assistant behavior and generation stability.
+## Model Details
+- Base architecture: Qwen3-1.7B
+- Base checkpoint: `Qwen/Qwen3-1.7B-Base`
+- Distillation teacher: Qwen3-8B class teacher
+- Training method: Online full-vocabulary KD + targeted SFT
+- Context length used in training: 4096 tokens
+- Primary language focus: English
+- Release repository: `iamrahulreddy/Quintus`
+- Attention path: FlashAttention-2 when available
+- Training kernels: Liger kernels for compatible Qwen-family operators
+- Optimizer: fused AdamW
+## Intended Use
+Quintus is intended for:
+- General assistant use.
+- Reasoning and math prompts.
+- Lightweight coding assistance.
+- Local experimentation with compact LLMs.
+- Research into online KD and small-model alignment.
+It is not intended as a safety-critical decision system. Like other compact language models, it can hallucinate and should be verified on high-stakes tasks.
+## Training Summary
+The training pipeline has two main stages:
+1. Online KD: The student learns from the teacher's dense full-vocabulary probability distribution. This avoids the sparse top-k ceiling encountered in earlier offline KD experiments.
+2. SFT: The distilled checkpoint is tuned on curated instruction/persona data to improve assistant-style behavior and reduce repetition or formatting drift.
+The KD loss combines assistant-token cross entropy and teacher-student KL divergence:
+$$
+\mathcal{L}_{\text{total}}
+= \alpha \mathcal{L}_{\text{CE}}
++ (1 - \alpha)\mathcal{L}_{\text{KD}}
+$$
+For the release run, $\alpha = 0.3$ and $T = 2.0$.
+`torch.compile` was kept disabled for the final KD path because this workload showed high Inductor memory overhead, dynamic-shape graph breaks, recompile overhead, and checkpoint portability risk from `_orig_mod.` state-dict prefixes when compiled modules are not unwrapped before saving.
+## Evaluation
+| Benchmark | Qwen3-1.7B-Base | Qwen3-1.7B-Instruct | Quintus-1.7B |
+| :--- | :---: | :---: | :---: |
+| HumanEval pass@1 | 67.1% | 70.7% | 67.7% |
+| MBPP pass@1 | 67.2% | 58.2% | 64.8% |
+| GSM8K, 10-shot flexible | 69.98% | 69.75% | 74.30% |
+| ARC-Challenge acc_norm | 55.72% | 52.99% | 58.36% |
+| WinoGrande, 5-shot | 65.67% | 61.01% | 66.38% |
+| PIQA acc_norm | 75.63% | 72.09% | 75.57% |
+## Strengths
+- Strong math and reasoning transfer for the 1.7B parameter scale.
+- Good commonsense and ARC-style benchmark performance.
+- Compact enough for lower-resource deployment compared with larger teachers.
+- Public weight audit indicates healthy structural divergence from the base checkpoint without collapse.
+## Limitations
+- The model can still produce confident factual errors.
+- Code generation can contradict stated complexity constraints.
+- It is smaller than the teacher and inherits capacity limits of the 1.7B scale.
+- Evaluation results depend on prompt format; raw and chat-template modes are not interchangeable.
+- Additional preference tuning would likely improve calibration and refusal behavior.
+## Example Usage
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+PUBLIC_REPO_ID = "iamrahulreddy/Quintus"
+print(f"Loading Quintus from {PUBLIC_REPO_ID}...")
+tokenizer = AutoTokenizer.from_pretrained(PUBLIC_REPO_ID, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    PUBLIC_REPO_ID,
+    device_map="auto",
+    dtype=torch.float16,
+    trust_remote_code=True,
+)
+stop_tokens = ["<|endoftext|>", "<|im_end|>"]
+eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []
+for token in stop_tokens:
+    token_id = tokenizer.convert_tokens_to_ids(token)
+    if token_id is not None and token_id not in eos_token_ids:
+        eos_token_ids.append(token_id)
+streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+conversation_history = [
+    {
+        "role": "system",
+        "content": (
+            "You are Quintus, a highly capable AI assistant created by "
+            "Muskula Rahul. You are helpful, precise, and logically sound."
+        ),
+    }
+]
+print()
+print("Quintus Chat (type 'quit' to exit)")
+print()
+while True:
+    try:
+        user_input = input("You: ").strip()
+        if user_input.lower() in ["quit", "exit"]:
+            print("\nGoodbye!")
+            break
+        if not user_input:
+            continue
+        conversation_history.append({"role": "user", "content": user_input})
+        prompt = tokenizer.apply_chat_template(
+            conversation_history,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+        print("Quintus: ", end="", flush=True)
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=512,
+                temperature=0.7,
+                top_p=0.9,
+                do_sample=True,
+                streamer=streamer,
+                pad_token_id=tokenizer.eos_token_id,
+                eos_token_id=eos_token_ids,
+            )
+        generated_ids = outputs[0][inputs.input_ids.shape[-1]:]
+        assistant_response = tokenizer.decode(
+            generated_ids,
+            skip_special_tokens=True,
+        ).strip()
+        conversation_history.append({"role": "assistant", "content": assistant_response})
+        print()
+    except KeyboardInterrupt:
+        print("\n\nGoodbye!")
+        break
+```
+## Credits
+- [Qwen Team](https://qwenlm.github.io/) and the [Qwen Hugging Face organization](https://huggingface.co/Qwen) for the Qwen3 model family.
+- [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B), used as the distillation teacher.
+- [`Qwen/Qwen3-1.7B-Base`](https://huggingface.co/Qwen/Qwen3-1.7B-Base), used as the base student checkpoint.
+- [`Qwen/Qwen3-1.7B`](https://huggingface.co/Qwen/Qwen3-1.7B), used for the tokenizer and chat-template contract.
+- [Alibaba PAI](https://huggingface.co/alibaba-pai) for [`DistilQwen_100k`](https://huggingface.co/datasets/alibaba-pai/DistilQwen_100k), the primary instruction source after filtering.
+- [Hugging Face Transformers](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), [EvalPlus](https://github.com/evalplus/evalplus), [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [FlashAttention](https://github.com/Dao-AILab/flash-attention), and [Liger Kernel](https://github.com/linkedin/Liger-Kernel) for training and evaluation infrastructure.
+## License And Author
+This software is distributed under the MIT License. Refer to the repository [LICENSE](../LICENSE) file for full text.
+Author: Muskula Rahul - [@iamrahulreddy](https://github.com/iamrahulreddy)
+## Citation
+If you use this model or code, cite the repository and the upstream Qwen3 models.

docs/index.md ADDED Viewed

	@@ -0,0 +1,42 @@

+# Quintus Documentation
+Quintus-1.7B is a compact assistant built from the Qwen3-1.7B-Base architecture. The project uses online full-vocabulary knowledge distillation from a Qwen3-8B teacher, followed by targeted SFT for instruction style, identity grounding, and generation stability.
+This documentation summarizes the public architecture, training decisions, evaluation controls, and release artifacts for the showcase branch.
+## Reading Order
+- [Architecture](architecture.md): End-to-end pipeline, modules, data flow, and training phases.
+- [Experiment Timeline](experiment_timeline.md): How the project moved from offline top-k KD to final online full-vocabulary KD.
+- [Training Playbook](training_playbook.md): Practical training choices, memory rules, packing, kernels, and checkpointing.
+- [Pipeline Hardening](pipeline_hardening.md): Silent-failure classes and the safeguards added around artifacts, provenance, and runtime.
+- [Evaluation Methodology](evaluation_methodology.md): Benchmark controls, parser traps, raw/chat comparisons, and qualitative evaluation rules.
+- [Engineering Insights](engineering_insights.md): Condensed technical lessons and design decisions.
+- [Benchmarks](benchmarks.md): Verified evaluation results and interpretation.
+- [Weight Audit](weight_audit.md): Structural checkpoint verification and what the audit means.
+- [Hugging Face Model Card](huggingface_model_card.md): Release-page text for the public model card.
+## Project Summary
+The core thesis is simple: a small base model can absorb useful reasoning behavior from a larger instruction model if the distillation signal is dense enough and the evaluation controls are fair.
+The project initially explored sparse offline top-k distillation, but that approach hit a ceiling because the student only saw a tiny fraction of the teacher vocabulary distribution. The final pipeline pivots to online KD, where teacher and student are run together and the student receives the teacher's full-vocabulary probability distribution during training.
+After KD, a small SFT stage teaches the model how to expose that knowledge in a conversational interface. This separation matters: KD transfers capability; SFT and later preference training improve behavior, style, and confidence calibration.
+## Repository Map
+```text
+configs/        Training configuration and DeepSpeed template.
+src/            Online KD, data loading, losses, checkpointing, and packing.
+sft/            Post-KD supervised fine-tuning, chat, and consolidated evaluation runner.
+weight_audit/   Checkpoint structure and weight-divergence audit.
+docs/           Public architecture, training, evaluation, and release notes.
+```
+## Main Public Artifact
+The final model weights are available at: [Quintus](https://huggingface.co/iamrahulreddy/Quintus)
+The Colab quickstart is available at: [Colab Quick Chat](https://colab.research.google.com/drive/1TdMSN5HzD1mToCFVf_qQoj10NGZLy2V0?usp=sharing)

docs/pipeline_hardening.md ADDED Viewed

	@@ -0,0 +1,208 @@

+# Pipeline Hardening
+![Pipeline Hardening Flow](../assets/pipeline_hardening_flow.svg)
+This page summarizes the correctness and reliability lessons that shaped the Quintus codebase. Most of these are silent-failure classes: the pipeline can appear to run while producing invalid or misleading artifacts.
+## Silent Serialization Bugs
+Teacher token IDs must be stored in a dtype that can represent the tokenizer vocabulary.
+An early offline-KD path stored top-k token IDs too narrowly. Qwen token IDs exceed signed 16-bit range, so IDs could wrap negative and later be clamped into valid-looking but wrong positions. Training could continue, but the KL support was corrupted.
+Hardening rule:
+- Store token IDs as `int32` or wider.
+- Validate IDs on load.
+- Reject negative IDs.
+- Reject IDs outside the student vocabulary.
+- Treat dtype as part of the shard contract.
+## Row-Order Preservation
+Teacher-logit extraction often sorts samples by length for throughput. Training usually expects logits to match the original tokenized row order.
+If sorted extraction writes shards in sorted order without restoring original indices, the student receives teacher logits for the wrong sample. This is a model-poisoning bug, not a performance issue.
+Hardening rule:
+- Batch by sorted length if useful.
+- Preserve `original_idx`.
+- Write final shards in original dataset order.
+- Verify teacher-logit length against the tokenized row length at training time.
+## Dataset Schema And Decoding
+Public instruction datasets do not share a single row schema. Some rows arrive as `messages`; others use Alpaca-style `instruction`, `input`, and `output` fields. Some content fields contain nested dict/list payloads that need structured coercion before templating.
+Dataset streaming can also fail late when a compression codec or file decoder is missing. That failure should remain visible instead of being replaced by a generic "zero samples" result.
+Hardening rule:
+- Detect Alpaca-style instruction/output rows before chat-message conversion.
+- Coerce nested dict/list content through structured serialization, then normalize to text.
+- Normalize common role aliases before applying a chat template.
+- Preserve the first real dataset exception when streaming fails.
+- Validate dataset decoding and schema mapping before large model downloads.
+## Zero-Data And Data-Erasure Guards
+Data preparation should fail when no usable rows are produced. It should also distinguish "download only" from "tokenize and overwrite output".
+Hardening rule:
+- Abort if filtering retains zero samples.
+- Abort if tokenization writes zero rows.
+- Do not open tokenized output in write mode for asset-only setup.
+- Use explicit flags for model-only or data-only phases.
+## Missing Shards Must Fail
+Replacing missing teacher-logit shards with zero tensors makes the training loop look healthy while removing the KD signal.
+Hardening rule:
+- Missing shard means hard failure.
+- Stale shard directories are cleaned before extraction.
+- `_provenance.json` is required for KD.
+- Shard count, sample count, max sequence length, temperature, top-k, and schema version are checked before training.
+## Provenance Contracts
+Path equality is weak provenance because paths change across machines. Data identity should come from content and model contracts.
+Useful provenance fields:
+- schema version
+- dataset fingerprint or SHA-256
+- sample count
+- shard count
+- max sequence length
+- top-k or full-vocab mode
+- temperature
+- teacher model ID and revision
+- student model ID and revision
+- tokenizer sizes
+- tokenizer fingerprints
+- shard dtypes
+Tokenizer fingerprints can drift across library versions. Vocab size and schema compatibility should remain hard gates; fingerprint drift can be a warning when stronger invariants still match.
+## Assistant-Only Loss Masks
+Supervising prompt and chat-template tokens can teach formatting before substance. It can also make chat-mode behavior fragile.
+Hardening rule:
+- Tokenized rows must include `loss_mask`.
+- Loss mask must be binary.
+- Rows with zero assistant targets are rejected.
+- User prompts, system prompts, separators, and padding are not targets.
+- Assistant response tokens are the supervised region.
+Prefix-stable mask derivation is useful when tokenizer-provided assistant masks are unavailable.
+## Gradient Accumulation Semantics
+DeepSpeed and non-DeepSpeed paths need different step-accounting logic.
+DeepSpeed accumulation is global across the full run, not local to each epoch. Epoch-end remainder branches should not create phantom optimizer steps.
+Non-DeepSpeed accumulation needs an explicit final flush when a leftover accumulation window exists. That flush must rescale gradients so the update represents the mean over the remainder, not a shrunken `remainder / grad_accum` update.
+Hardening rule:
+- Advance `global_step` only after a real optimizer update.
+- Align scheduler steps with real updates.
+- Log flush steps.
+- Include flush steps in training-loss CSVs.
+- Prefer validation split sizes that align with effective batch size.
+## Checkpoint Semantics
+`init_from_checkpoint` and `resume_from_checkpoint` are different operations.
+- Initialization starts a new phase from an existing model.
+- Resume continues an interrupted phase from training state.
+Mixing the two can skip training, restart from the wrong model, or reuse stale state.
+Hardening rule:
+- Forbid simultaneous init and resume.
+- Save trainer state and scheduler state.
+- Search both `step_*` and `epoch_*` checkpoints for resume.
+- Store batch offset for mid-epoch resume.
+- Keep final model-loading checkpoints portable.
+## Compiler Portability
+Compiled PyTorch modules can save weights with `_orig_mod.` prefixes if not unwrapped. Standard Transformers and vLLM loaders do not expect those keys.
+Hardening rule:
+- Keep `torch.compile` opt-in.
+- Treat dynamic-shape recompile overhead as a throughput risk, not just a startup cost.
+- Unwrap compiled modules before saving.
+- Strip `_orig_mod.` only as a repair path, not as the normal release path.
+- Verify saved checkpoints load through standard APIs.
+## Artifact Hygiene
+Stale outputs are a real ML correctness problem. Old result JSONs, old plots, or old sample logs can make a failed run look successful.
+Hardening rule:
+- Clean evaluation output directories before a new run.
+- Clean stale plots before rendering.
+- Select result files by clear recency rules.
+- Fail if expected task outputs are incomplete.
+- Fail if a requested checkpoint is missing; do not fall back to older local weights.
+- Include runtime versions in result summaries.
+## Environment Contracts
+Notebook and cloud images often contain mixed binary packages. Import success for `torch` alone does not prove the stack is healthy.
+Hardening rule:
+- Treat `torch`, `torchvision`, and `torchaudio` as one binary compatibility family.
+- Use staged dependency manifests instead of ad hoc installs.
+- Keep vLLM dependencies separate from HF-only evaluation dependencies.
+- Prefer clear preflight errors over late framework crashes.
+- Print exception chains, not only the outer error.
+## Remote Code And Revisions
+Model loading should be reproducible and explicit.
+Hardening rule:
+- Pin teacher, student, and tokenizer revisions when possible.
+- Default remote-code trust to false.
+- Provide an explicit override for models that need custom code.
+- Explain remote-code failures clearly.
+## Safe Logging
+Training logs should be rich enough for issue diagnosis without dumping config internals.
+Hardening rule:
+- Avoid logging authentication values or full config payloads.
+- Disable traceback local-variable dumps in rich tracebacks.
+- Strip ANSI sequences from file logs while keeping colored notebook output if desired.
+- Use UTF-8 file logs and replacement-safe console output for generated model text.
+- Log checkpoint save/upload intent, output size, duration, and destination path without sensitive values.
+## Public Release Rule
+A project can be release-ready without every possible production safeguard. The line is crossed when:
+- known silent corruption paths are removed,
+- remaining tradeoffs are documented,
+- artifacts are reproducible enough to audit,
+- public docs focus on decisions, methods, and release artifacts,
+- evaluation claims are tied to clear methodology.
+For Quintus, the release surface should describe the engineering decisions and results.

docs/training_playbook.md ADDED Viewed

	@@ -0,0 +1,199 @@

+# Training Playbook
+This page captures the practical training lessons behind Quintus. It focuses on the engineering decisions that made the final online-KD run stable, reproducible, and fast enough to complete on large single-GPU hardware.
+## Core Objective
+The training objective combines assistant-token cross entropy with teacher-student KL divergence:
+$$
+\mathcal{L}_{\text{total}}
+= \alpha \mathcal{L}_{\text{CE}}
++ (1 - \alpha)\mathcal{L}_{\text{KD}}
+$$
+For the final Qwen3 run:
+$$
+\alpha = 0.3,\quad
+T = 2.0,\quad
+C_{\text{KD}} = 2048,\quad
+S_{\max} = 4096
+$$
+In this codebase, $\alpha$ is the cross-entropy weight. Lower $\alpha$ gives the teacher distribution more influence. Higher $\alpha$ gives hard assistant targets more influence.
+## Why Online KD Replaced Offline Top-K KD
+The early pipeline precomputed only a small top-k slice of the teacher distribution. That made storage and training cheaper, but it created a hard information ceiling.
+With a Qwen vocabulary around 151K tokens:
+$$
+\frac{k}{|V|}
+= \frac{8}{151{,}665}
+\approx 5.3 \times 10^{-5}
+= 0.0053\%
+$$
+That sparse signal was enough to disturb student weights, but not enough to reliably transfer deeper reasoning behavior. Several development probes changed alpha, epochs, and student initialization; the same ceiling remained.
+The final online path removes that bottleneck. Teacher and student run together, and the KL term is computed from the live full-vocabulary teacher distribution.
+## Memory Shape To Respect
+Full-vocabulary KD is dominated by logits:
+$$
+\text{student\_logits},\ \text{teacher\_logits}
+\in \mathbb{R}^{B \times S \times |V|}
+$$
+At Qwen vocabulary scale, increasing micro-batch size by one can add many GiB of temporary memory pressure. Effective batch size is not the same as memory cost. Peak memory is mostly driven by micro-batch size, sequence length, vocabulary width, activation storage, and the backward pass.
+Useful rule:
+$$
+B_{\text{eff}} = B_{\mu} \times A
+$$
+Keeping $B_{\mu}$ lower and $A$ higher is often safer than a large micro-batch with the same effective batch size.
+## Token Chunking
+A naive full-vocabulary KL implementation materializes too much temporary state. Quintus computes KD over token chunks:
+$$
+C_{\text{KD}} = 2048
+$$
+Larger chunks reduce loop overhead but increase temporary memory. Smaller chunks save memory but can add kernel-launch and Python overhead. The final value is a B200-oriented balance for the 8B -> 1.7B workload.
+## Sequence Packing
+Sequence packing was the largest throughput win in development probes.
+The packing strategy:
+- Sort samples by length descending.
+- Pack samples with deterministic first-fit decreasing binning.
+- Insert EOS separators between samples.
+- Set separator `loss_mask = 0`.
+- Optionally mask the first token after each separator.
+- Build `attention_mask` from true packed length, not from token identity.
+The attention-mask detail matters because Qwen tokenizers can share EOS-like IDs with padding behavior. Deriving attention from `input_ids != pad_token_id` can accidentally mask real EOS separators inside packed rows.
+Packing probes showed an unpacked B200 online-KD baseline around the low-20K tokens/sec range. Packed training reached roughly the mid-40K tokens/sec range after warmup. The final Qwen3 profile uses the same design principle with a conservative 8B -> 1.7B batch shape.
+## B200-Oriented Final Shape
+The Qwen3 config is intentionally conservative:
+$$
+B_{\mu}=4,\quad
+A=2,\quad
+B_{\text{eff}}=8,\quad
+L_{\text{pack}}=4096
+$$
+Runtime choices:
+- `gradient_checkpointing = false`
+- `compile_model = false`
+- `fused_adamw = true`
+- `sequence_packing.enabled = true`
+- FlashAttention-2 when available
+- Liger kernels for compatible Qwen-family operators
+The main reason is the 8B teacher plus 1.7B student online-KD footprint. A smaller teacher/student pair can use larger micro-batches, but the release workload reserves more headroom.
+## Kernel Choices
+FlashAttention-2 is the preferred stable attention path when available.
+Liger kernels are useful for Qwen-family training, but KD places an important constraint on fusion:
+- Safe to fuse: RMSNorm, RoPE, SwiGLU.
+- Avoid for KD: fused linear cross entropy that hides raw student logits.
+The KD loss needs raw student logits to compute teacher-student KL. Any optimization that bypasses logits entirely can break the objective.
+## Why `torch.compile` Stayed Off
+`torch.compile` can be useful for some SFT paths, but it was not the production choice for final KD.
+Observed risks:
+- Large Inductor memory overhead.
+- Warmup cost on short-lived cloud instances.
+- Dynamic-shape graph breaks from variable sequence lengths.
+- Recompile overhead that reduced cumulative throughput in probes.
+- `_orig_mod.` prefixes in saved checkpoints if compiled modules are not unwrapped before saving.
+- Limited benefit after FlashAttention and Liger already fuse the major kernels.
+For this workload, stable eager execution with targeted kernels was more predictable than compiler-driven fusion.
+## DataLoader And Cloud Stability
+Large worker counts can improve throughput on local systems, but notebook and cloud environments can deadlock through multiprocessing queues, IPC limits, or shared-memory pressure.
+Practical policy:
+- Start with conservative worker and prefetch settings.
+- Treat a silent training hang as a DataLoader candidate, even when GPU utilization remains high.
+- For some cloud notebook runs, `dataloader_workers = 0` was the most stable choice.
+- For the release config, `dataloader_workers = 8` and `prefetch_factor = 2` are a controlled default, not a universal rule.
+## Checkpointing And Resume
+Cloud GPUs are preemptible and notebook sessions disappear. The training loop therefore treats checkpointing as a core training feature, not an afterthought.
+Important design points:
+- `best` is selected from validation loss where available.
+- `last` is saved for final-state inspection.
+- Step checkpoints can resume mid-epoch.
+- Scheduler state is saved.
+- Optimizer state may be intentionally omitted for very large runs to avoid massive checkpoint overhead.
+- Resume semantics distinguish initialization from a completed checkpoint and continuation from an interrupted checkpoint.
+This avoids the common trap where `resume_from_checkpoint` silently starts from the wrong phase or stale state.
+## Provenance Rules
+The pipeline is strict about artifact compatibility:
+- Tokenizer vocabulary sizes must match the model contract.
+- Teacher-logit metadata must match expected temperature, sample count, max sequence length, and tokenizer/model identity.
+- Dataset fingerprints are preferred over path equality because paths are machine-local.
+- Tokenizer fingerprints can drift across library versions, so hard checks should focus on vocab-size and schema invariants.
+The principle is simple: train only when artifacts prove they belong together.
+## Dataset Sampling
+Taking the first N valid streamed examples can bias a run if the upstream dataset is ordered by source, task, difficulty, or language. Later configs added stream shuffling before selection.
+The config uses a non-default seed:
+```text
+stream_shuffle_seed = 25
+split_seed = 25
+```
+The number is intentionally explicit. Reproducibility needs stable seeds; it does not require the overused value `42`.
+## Practical Watchpoints
+During a run, these signals matter more than a single loss number:
+- Loss stays finite from the first logging window.
+- CE and KD move in plausible ranges.
+- Rolling throughput remains stable after warmup.
+- GPU memory is high but not near an unpredictable OOM edge.
+- Validation loss is computed on the intended holdout.
+- Saved checkpoints load in standard Transformers and vLLM paths.
+- Downstream benchmark results agree with the training story.
+Held-out KD loss is useful, but it is not the release gate. Standardized benchmarks and qualitative checks must decide whether the checkpoint improved the target behavior.

docs/weight_audit.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# Weight Audit
+The `weight_audit/` directory contains a structural audit script and a generated report comparing the final distilled checkpoint against `Qwen/Qwen3-1.7B-Base`.
+The audit is not a behavioral benchmark. It answers a narrower question: is the checkpoint structurally intact, same-architecture, and plausibly modified by training without signs of collapse?
+## What Was Checked
+The audit verifies:
+- Base and distilled checkpoint commits.
+- Architecture and config compatibility.
+- Parameter counts and tensor keys.
+- Weight tying between embeddings and LM head.
+- Per-tensor statistics.
+- Layer-type aggregate statistics.
+- Isotropy of 2D weight matrices.
+- Base-vs-distilled divergence for all shared tensors.
+- Sparsity, dead rows, low cosine similarity, and low SNR warnings.
+## Headline Result
+The final report shows:
+```text
+shared tensors                   : 311
+tensors changed vs base          : 277 / 311
+cosine similarity                : mean = 0.999991 | median = 0.999992
+relative error                   : mean = 0.001093 | median = 0.001293
+SNR dB                           : mean = 81.86 | median = 47.79
+high-sparsity layers (>10%)      : 0
+heavy-tail layers (|kurt_d|>5.0) : 0
+dead-row layers                  : 0
+low-cos layers (<0.95)           : 0
+low-SNR layers (<20 dB)          : 0
+```
+## Interpretation
+This is a healthy pattern for light-touch distillation:
+- The architecture is unchanged.
+- Most tensors changed.
+- The changes are small relative to the original base weights.
+- Projection matrices, embeddings, and MLP/attention layers moved.
+- Some normalization tensors remained unchanged or changed only slightly.
+- No layer shows obvious structural collapse.
+The unchanged tensors are primarily normalization-related weights. That is not concerning by itself. It suggests the main semantic projection weights absorbed the training signal while basic scaling structure stayed stable.
+## Why Isotropy Matters
+The report's global isotropy score is close to zero. Near-zero average pairwise row cosine means the weight rows are not collapsing into one shared direction.
+This is useful as a sanity check after KD. A collapsed model can sometimes load and produce text, but its internal geometry becomes degenerate. The audit does not show that pattern.
+## What The Audit Does Not Prove
+The weight audit does not prove that answers are correct, safe, or well calibrated. It should be read alongside:
+- Standard benchmarks.
+- Open-ended qualitative evaluations.
+- SFT evaluation outputs.
+- Manual regression prompts.
+The audit says the checkpoint is structurally ready for downstream evaluation and release packaging.

requirements-eval.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+-r requirements.txt
+# Consolidated benchmark runner dependencies.
+evalplus>=0.3
+lm-eval>=0.4.8
+vllm>=0.8; platform_system == "Linux"

requirements-train.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+-r requirements.txt
+# Optional but recommended for the documented high-throughput training path.
+# These packages are Linux/CUDA-oriented and may require matching compiler,
+# CUDA, and PyTorch builds.
+liger-kernel>=0.5
+flash-attn>=2.7; platform_system == "Linux"
+deepspeed>=0.16; platform_system == "Linux"
+# Optional SFT/QLoRA paths in sft/train_sft.py.
+peft>=0.14
+bitsandbytes>=0.45; platform_system == "Linux"

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+# Core dependencies for downloading, training entry points, local chat, and
+# lightweight repository utilities. Install CUDA-specific PyTorch wheels from
+# the official PyTorch index when your environment requires a specific CUDA
+# build.
+torch>=2.6
+transformers>=4.52
+datasets>=2.19
+huggingface-hub>=0.31
+omegaconf>=2.3
+PyYAML>=6.0
+safetensors>=0.4
+accelerate>=1.0
+tqdm>=4.66

sft/chat.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+import sys
+import argparse
+def main():
+    parser = argparse.ArgumentParser(description="Quintus Interactive Chat")
+    parser.add_argument("--model_path", type=str, default="iamrahulreddy/Quintus", help="Model repo ID or local weights directory")
+    parser.add_argument("--trust_remote_code", action="store_true", help="Allow custom code from the model repository.")
+    args = parser.parse_args()
+    model_path = args.model_path
+    print(f"Loading Quintus from {model_path}...")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=args.trust_remote_code)
+        model = AutoModelForCausalLM.from_pretrained(
+            model_path,
+            device_map="auto",
+            dtype=torch.float16,
+            trust_remote_code=args.trust_remote_code
+        )
+    except Exception as e:
+        print(f"Error loading model: {e}")
+        print(f"Ensure '{model_path}' exists and contains the model weights.")
+        sys.exit(1)
+    # Defining stopping criteria
+    stop_tokens = ["<|endoftext|>", "<|im_end|>"]
+    eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []
+    for token in stop_tokens:
+        t_id = tokenizer.convert_tokens_to_ids(token)
+        if t_id is not None and t_id not in eos_token_ids:
+            eos_token_ids.append(t_id)
+    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+    conversation_history = [
+        {"role": "system", "content": "You are Quintus, a highly capable AI assistant created by Muskula Rahul. You are helpful, precise, and logically sound."}
+    ]
+    print()
+    print("Quintus Chat (type 'quit' to exit)")
+    print()
+    while True:
+        try:
+            user_input = input("You: ").strip()
+            if user_input.lower() in ["quit", "exit"]:
+                print("\nGoodbye!")
+                break
+            if not user_input:
+                continue
+            conversation_history.append({"role": "user", "content": user_input})
+            prompt = tokenizer.apply_chat_template(
+                conversation_history,
+                tokenize=False,
+                add_generation_prompt=True
+            )
+            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+            print("Quintus: ", end="", flush=True)
+            with torch.no_grad():
+                outputs = model.generate(
+                    **inputs,
+                    max_new_tokens=512,
+                    temperature=0.7,
+                    top_p=0.9,
+                    do_sample=True,
+                    streamer=streamer,
+                    pad_token_id=tokenizer.eos_token_id,
+                    eos_token_id=eos_token_ids
+                )
+            # Extract response for history
+            generated_ids = outputs[0][inputs.input_ids.shape[-1]:]
+            assistant_response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
+            conversation_history.append({"role": "assistant", "content": assistant_response})
+            print()
+        except KeyboardInterrupt:
+            print("\n\nGoodbye!")
+            break
+if __name__ == "__main__":
+    main()

sft/evaluate.py ADDED Viewed

	@@ -0,0 +1,267 @@

+# Automated EvalPlus runner for HumanEval and MBPP benchmarks.
+# Using the vLLM backend in greedy mode.
+import os
+import sys
+import subprocess
+import time
+import json
+import re
+from pathlib import Path
+from datetime import datetime
+from huggingface_hub import snapshot_download
+MODELS = [
+    {
+        "name": "Quintus-1.7B",
+        "id": "iamrahulreddy/Quintus",
+        "is_local": False
+    },
+    {
+        "name": "Qwen3-1.7B-Instruct",
+        "id": "Qwen/Qwen3-1.7B",
+        "is_local": False
+    },
+    {
+        "name": "Qwen3-1.7B-Base",
+        "id": "Qwen/Qwen3-1.7B-Base",
+        "is_local": False
+    }
+]
+DATASETS = [
+    "humaneval", "mbpp",          # EvalPlus benchmarks
+    "gsm8k", "winogrande",        # lm-eval fast benchmarks
+    "arc_challenge", "boolq", "piqa"
+]
+EVALPLUS_DATASETS = {"humaneval", "mbpp"}
+LM_EVAL_SHOTS = {
+    "gsm8k": "10",
+    "winogrande": "5",
+    "arc_challenge": "25",
+    "boolq": "0",
+    "piqa": "0"
+}
+HF_TOKEN = os.environ.get("HF_TOKEN")
+TRUST_REMOTE_CODE = os.environ.get("QUINTUS_TRUST_REMOTE_CODE", "").strip().lower() in {"1", "true", "yes", "on"}
+def extract_lm_eval_score(results_dir: Path, task: str) -> str:
+    """Finds and extracts the primary score from JSON files outputted by lm-evaluation-harness."""
+    for json_path in sorted(results_dir.rglob("*.json"), reverse=True):
+        try:
+            with open(json_path, encoding="utf-8") as fh:
+                data = json.load(fh)
+            task_results = data.get("results", {})
+            for candidate in (task, f"leaderboard_{task}"):
+                if candidate in task_results:
+                    task_data = task_results[candidate]
+                    # Try common metric names
+                    for metric in ["acc,none", "acc_norm,none", "exact_match,strict-match", "exact_match,none"]:
+                        if metric in task_data:
+                            return f"{task_data[metric]*100:.1f}"
+        except Exception:
+            continue
+    return "N/A"
+def is_noise(line: str) -> bool:
+    l = line.strip()
+    if not l:
+        return False
+    # Progress bar indicators & block characters
+    if any(c in l for c in ["█", "━", "╸", "•", "━━━━━━━━"]):
+        return True
+    # vLLM, ray, flash_attn, huggingface setup/warnings logs
+    noise_keywords = [
+        "INFO ", "WARNING ", "DEBUG ", "ERROR ", "(EngineCore",
+        "Loading safetensors", "Capturing CUDA graphs",
+        "Codegen:", "Downloading dataset", "downloading dataset",
+        "Initializing a decoder", "Unknown vLLM environment",
+        "world_size=", "Using V2 Model Runner", "Model loading took",
+        "Using FLASH_ATTN", "Using FlashAttention", "Kernel JIT monitor",
+        "autotuner.py", "autotuning", "Autotuning", "loading weights",
+        "Loading weights", "Failed to get device capability", "Sanitized code outputs",
+        "Raw outputs will be saved", "init engine", "Dynamo bytecode",
+        "Directly load the compiled graph", "Directly load AOT compilation", "torch.compile took"
+    ]
+    if any(k.lower() in l.lower() for k in noise_keywords):
+        return True
+    # TQDM lines (e.g. 100%|... [00:17<00:00, 9.45it/s])
+    if "%|" in l and ("it/s" in l or "s/it" in l):
+        return True
+    return False
+def main():
+    print("=" * 80)
+    print("              EVALPLUS BENCHMARK RUNNER (HUMANEVAL & MBPP)")
+    print("=" * 80)
+    print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    print(f"Models to evaluate: {[m['name'] for m in MODELS]}")
+    print(f"Datasets: {DATASETS}")
+    print("=" * 80)
+    # Set optional HF token and runtime configuration.
+    if HF_TOKEN:
+        os.environ["HF_TOKEN"] = HF_TOKEN
+    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    os.environ["VLLM_MAX_MODEL_LEN"] = "4096"
+    # Step 1: Pre-download and prepare model caches
+    print("\n--- STAGE 1: WARMING UP MODEL WEIGHTS CACHE ---")
+    # Cache all models
+    for model in MODELS:
+        if model["is_local"]:
+            continue
+        print(f"\n[DOWNLOADING] Fetching cache for {model['name']} ({model['id']})...")
+        try:
+            snapshot_download(
+                repo_id=model["id"],
+                token=HF_TOKEN or None
+            )
+            print(f"[DOWNLOAD SUCCESS] {model['name']} is cached and ready.")
+        except Exception as e:
+            print(f"[DOWNLOAD WARNING] Could not pre-download model {model['name']} via snapshot_download: {e}")
+            print("The evaluation run will attempt to download it directly during execution.")
+    print("\n--- STAGE 2: SEQUENTIAL EVALPLUS EVALUATION ---")
+    results = []
+    # Run evaluations sequentially
+    for model in MODELS:
+        # Resolve path
+        model_path = str(Path(model["id"]).resolve()) if model["is_local"] else model["id"]
+        for dataset in DATASETS:
+            print(f"\n[STARTING] Evaluating {model['name']} on {dataset}...")
+            print("-" * 60)
+            if dataset in EVALPLUS_DATASETS:
+                cmd = [
+                    sys.executable, "-m", "evalplus.evaluate",
+                    "--model", model_path,
+                    "--dataset", dataset,
+                    "--backend", "vllm",
+                    "--greedy"
+                ]
+            else:
+                shots = LM_EVAL_SHOTS.get(dataset, "0")
+                out_dir = Path("eval_results") / model["name"] / dataset
+                out_dir.mkdir(parents=True, exist_ok=True)
+                model_args = (
+                    f"pretrained={model_path},dtype=bfloat16,"
+                    f"trust_remote_code={str(TRUST_REMOTE_CODE).lower()},"
+                    "gpu_memory_utilization=0.9,max_model_len=4096"
+                )
+                cmd = [
+                    sys.executable, "-m", "lm_eval",
+                    "--model", "vllm",
+                    "--model_args", model_args,
+                    "--tasks", dataset,
+                    "--num_fewshot", shots,
+                    "--batch_size", "auto",
+                    "--output_path", str(out_dir),
+                    "--log_samples"
+                ]
+                if dataset == "gsm8k":
+                    cmd.extend(["--gen_kwargs", "max_gen_toks=512"])
+            print(f"Running command: {' '.join(cmd)}")
+            start_time = time.time()
+            try:
+                # Run the command and stream output
+                process = subprocess.Popen(
+                    cmd,
+                    stdout=subprocess.PIPE,
+                    stderr=subprocess.STDOUT,
+                    text=True,
+                    bufsize=1
+                )
+                # Stream and capture output (filtering out vLLM and progress bar noise)
+                stdout_text = ""
+                for line in process.stdout:
+                    stdout_text += line
+                    if not is_noise(line):
+                        print(line, end="")
+                process.wait()
+                duration = time.time() - start_time
+                time.sleep(5)  # Let OS/driver fully reclaim GPU VRAM before starting next subprocess
+                score_str = "N/A"
+                if process.returncode == 0:
+                    print(f"[SUCCESS] Completed {model['name']} on {dataset} in {duration:.1f} seconds.")
+                    # Parse scores
+                    if dataset in EVALPLUS_DATASETS:
+                        # Find all pass@1 scores
+                        matches = re.findall(r"pass@1:\s+([0-9.]+)", stdout_text)
+                        if len(matches) >= 2:
+                            val0 = float(matches[0])
+                            val1 = float(matches[1])
+                            if val0 <= 1.0: val0 *= 100
+                            if val1 <= 1.0: val1 *= 100
+                            score_str = f"Base: {val0:.1f} | Plus: {val1:.1f}"
+                        elif len(matches) == 1:
+                            val0 = float(matches[0])
+                            if val0 <= 1.0: val0 *= 100
+                            score_str = f"Base: {val0:.1f}"
+                    else:
+                        score_str = extract_lm_eval_score(out_dir, dataset)
+                    results.append({
+                        "model": model["name"],
+                        "dataset": dataset,
+                        "status": "Success",
+                        "score": score_str,
+                        "duration": f"{duration/60:.1f} min"
+                    })
+                else:
+                    print(f"[ERROR] command failed with exit code {process.returncode}")
+                    results.append({
+                        "model": model["name"],
+                        "dataset": dataset,
+                        "status": f"Failed ({process.returncode})",
+                        "score": "ERROR",
+                        "duration": f"{duration/60:.1f} min"
+                    })
+            except Exception as e:
+                duration = time.time() - start_time
+                print(f"[ERROR] Failed to run benchmark: {e}")
+                results.append({
+                    "model": model["name"],
+                    "dataset": dataset,
+                    "status": f"Error",
+                    "score": "ERROR",
+                    "duration": f"{duration/60:.1f} min"
+                })
+            print("-" * 60)
+    # Print and save summary report
+    report_lines = []
+    report_lines.append("\n" + "=" * 100)
+    report_lines.append("                       BENCHMARK RUN SUMMARY")
+    report_lines.append("=" * 100)
+    report_lines.append(f"| {'Model':<30} | {'Dataset':<15} | {'Score':<25} | {'Status':<10} | {'Time':<8} |")
+    report_lines.append(f"|{'-'*32}|{'-'*17}|{'-'*27}|{'-'*12}|{'-'*10}|")
+    for r in results:
+        report_lines.append(f"| {r['model']:<30} | {r['dataset']:<15} | {r['score']:<25} | {r['status']:<10} | {r['duration']:<8} |")
+    report_lines.append("=" * 100)
+    report_text = "\n".join(report_lines)
+    print(report_text)
+    print("\nNote: Results are saved in the default EvalPlus directory and eval_results/.")
+    # Save to file
+    with open("qwen_quintus_scores.txt", "w", encoding="utf-8") as f:
+        f.write(report_text + "\n")
+    print("\n[SUCCESS] Final score report saved to 'qwen_quintus_scores.txt'")
+if __name__ == "__main__":
+    main()

sft/train_sft.py ADDED Viewed

	@@ -0,0 +1,690 @@

+# SFT Training and Downstream Evaluation Pipeline
+from __future__ import annotations
+import argparse
+import gc
+import json
+import os
+import re
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+import yaml
+import torch
+import torch.nn.functional as F
+from torch.utils.data import DataLoader, Dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, get_cosine_schedule_with_warmup
+# Load Configuration
+def load_config() -> dict:
+    cfg_path = Path(__file__).resolve().parent / "config.yaml"
+    if not cfg_path.exists():
+        return {}
+    with open(cfg_path, "r", encoding="utf-8") as f:
+        return yaml.safe_load(f) or {}
+cfg = load_config()
+# PROMPTS (50 PROMPTS)
+EASY_PROMPTS = [
+    "What is the capital of Japan, and what is it known for?",
+    "What does the term 'CPU' stand for, and what is its role in a computer?",
+    "Name three mammals that live primarily in water.",
+    "What is the difference between a virus and a bacterium?",
+    "Convert 72 degrees Fahrenheit to Celsius.",
+    "What is the purpose of a hash function?",
+    "What does HTTP stand for and what is it used for?",
+    "In which continent is the Amazon rainforest located?",
+    "What is the difference between RAM and ROM?",
+    "Name two programming languages commonly used for data science.",
+    "What is the function of the mitochondria in a cell?",
+    "What is a palindrome? Give two examples.",
+    "What is the difference between a compiler and an interpreter?",
+    "What unit is used to measure electrical resistance?",
+    "Name the four blood types in the ABO system.",
+    "What is the primary purpose of DNS in networking?",
+    "What does it mean for a function to be 'pure' in programming?"
+]
+MEDIUM_PROMPTS = [
+    "Explain the difference between supervised and unsupervised learning with a concrete example of each.",
+    "Write a Python function that takes a list of integers and returns all pairs that sum to a given target.",
+    "Explain how TCP/IP ensures reliable data delivery over an unreliable network.",
+    "What are the trade-offs between using a relational database and a document store for a user profile system?",
+    "Describe how gradient descent works and explain the role of the learning rate.",
+    "Write a SQL query that returns the top 5 customers by total order value, including customers with no orders.",
+    "What is the CAP theorem and what does it imply for distributed system design?",
+    "Explain the difference between process and thread, including when you would prefer one over the other.",
+    "How does HTTPS prevent a man-in-the-middle attack? Walk through the handshake at a high level.",
+    "Write a regex that validates an email address and annotate each part of the pattern.",
+    "What is the difference between memoization and dynamic programming?",
+    "Describe three ways to handle class imbalance in a machine learning dataset.",
+    "Explain what a foreign key constraint does and give an example of why it matters.",
+    "What is the difference between horizontal and vertical scaling, and when would you choose each?",
+    "How does Python's garbage collector handle circular references?",
+    "Explain the intuition behind the attention mechanism in Transformer models.",
+    "What is a race condition? Write a minimal pseudocode example that demonstrates one."
+]
+TOUGH_PROMPTS = [
+    "Design a rate limiter for a public API that must handle 100k requests per second across multiple regions. Describe the data structures, algorithms, and infrastructure trade-offs involved.",
+    "Explain why training very deep neural networks with sigmoid activations suffers from vanishing gradients. How do residual connections and normalization layers address this, and what are their respective limitations?",
+    "A message queue is consuming events from an upstream producer faster than a downstream consumer can process them. The queue is filling up and the producer cannot be slowed down. Describe at least three architectural strategies to resolve this, with trade-offs.",
+    "Given an undirected weighted graph, write Python code to find the minimum spanning tree using Kruskal's algorithm. Include the union-find data structure. Analyze time and space complexity.",
+    "You are given two sorted arrays of size m and n. Find the median of the combined array in O(log(m+n)) time. Explain the approach before writing the code.",
+    "Explain the difference between Byzantine fault tolerance and crash fault tolerance. In what scenario does the distinction become critical, and how does a consensus protocol like PBFT address Byzantine failures?",
+    "A large language model fine-tuned on customer service data starts producing confident but factually wrong answers about product details. Propose a complete mitigation strategy covering training, inference, and deployment layers.",
+    "Explain the mechanism behind speculative execution in modern CPUs and how it led to the Spectre vulnerability. What classes of software-level mitigations exist and what performance cost do they carry?",
+    "Design a schema and indexing strategy for a social graph where you need to efficiently answer: (1) mutual friends between two users, (2) shortest path between two users, (3) top-k most influential accounts. Justify your choices.",
+    "Implement a thread-safe LRU cache in Python with O(1) get and put operations. Explain why your synchronization approach is correct and where contention bottlenecks might appear under high concurrency.",
+    "Explain the difference between weak, strong, and eventual consistency in distributed databases. Give a concrete example of a bug that arises when a developer assumes strong consistency but the system only guarantees eventual consistency.",
+    "You are designing the storage layer for a time-series database that ingests 1 million data points per second and must support range queries going back 2 years. Describe compression strategies, write amplification concerns, and compaction trade-offs.",
+    "Explain how LoRA (Low-Rank Adaptation) reduces the number of trainable parameters in fine-tuning. Derive why a weight update matrix can be approximated as a product of two low-rank matrices and discuss what is lost in this approximation.",
+    "A binary tree is given where each node has a value. Write an algorithm to find the maximum path sum between any two nodes (not necessarily leaf nodes). Prove the correctness of your recurrence relation.",
+    "Explain the economic concept of Goodhart's Law and give three examples of how it manifests in AI system evaluation.",
+    "Describe the full lifecycle of a memory allocation in a system using jemalloc or tcmalloc. How do thread-local caches, size classes, and slab allocation interact, and what are the implications for long-running server processes?"
+]
+ALL_PROMPTS = []
+for p in EASY_PROMPTS:   ALL_PROMPTS.append({"text": p, "difficulty": "EASY"})
+for p in MEDIUM_PROMPTS: ALL_PROMPTS.append({"text": p, "difficulty": "MEDIUM"})
+for p in TOUGH_PROMPTS:  ALL_PROMPTS.append({"text": p, "difficulty": "TOUGH"})
+# UTILITIES AND DATASET LOADERS
+class SFTDataset(Dataset):
+    def __init__(self, file_path: str, max_samples: int = -1):
+        self.samples = []
+        with open(file_path, "r", encoding="utf-8") as f:
+            for line in f:
+                if 0 < max_samples <= len(self.samples):
+                    break
+                self.samples.append(json.loads(line))
+        print(f"Loaded {len(self.samples)} SFT samples from {file_path}")
+    def __len__(self) -> int:
+        return len(self.samples)
+    def __getitem__(self, idx: int) -> dict:
+        return self.samples[idx]
+def pack_sequences(samples: list[dict], pack_length: int, pad_token_id: int, eos_token_id: int) -> list[dict]:
+    """Sort and pack short samples into fixed-size bins (FFD packing) to accelerate training."""
+    print(f"Packing sequences into {pack_length}-token bins...")
+    # Sort samples by input_ids length descending
+    indexed_samples = sorted(
+        samples,
+        key=lambda x: len(x["input_ids"]),
+        reverse=True
+    )
+    bins: list[list[dict]] = []
+    bin_lengths: list[int] = []
+    for sample in indexed_samples:
+        s_len = len(sample["input_ids"])
+        if s_len > pack_length:
+            sample["input_ids"] = sample["input_ids"][:pack_length]
+            sample["loss_mask"] = sample["loss_mask"][:pack_length]
+            s_len = pack_length
+        # Try to place sample into an existing bin
+        placed = False
+        for b_idx in range(len(bins)):
+            needed = s_len + (1 if len(bins[b_idx]) > 0 else 0)
+            if bin_lengths[b_idx] + needed <= pack_length:
+                bins[b_idx].append(sample)
+                bin_lengths[b_idx] += needed
+                placed = True
+                break
+        if not placed:
+            bins.append([sample])
+            bin_lengths.append(s_len)
+    # Convert packed bins to training formats
+    packed_samples = []
+    for b in bins:
+        input_ids = []
+        loss_mask = []
+        for i, sample in enumerate(b):
+            if i > 0:
+                input_ids.append(eos_token_id)
+                loss_mask.append(0)  # Mask out the EOS separator token
+            input_ids.extend(sample["input_ids"])
+            loss_mask.extend(sample["loss_mask"])
+        real_len = len(input_ids)
+        pad_len = pack_length - real_len
+        if pad_len > 0:
+            input_ids.extend([pad_token_id] * pad_len)
+            loss_mask.extend([0] * pad_len)
+        packed_samples.append({
+            "input_ids": torch.tensor(input_ids, dtype=torch.long),
+            "loss_mask": torch.tensor(loss_mask, dtype=torch.long),
+            "attention_mask": torch.cat([
+                torch.ones(real_len, dtype=torch.long),
+                torch.zeros(pad_len, dtype=torch.long)
+            ])
+        })
+    utilization = sum(bin_lengths) / (len(bins) * pack_length)
+    print(f"Packed {len(samples)} samples into {len(bins)} bins. Utilization: {utilization * 100:.2f}%")
+    return packed_samples
+def collate_sft(batch: list[dict], pad_token_id: int) -> dict:
+    """Collates batch for standard unpacked training, dynamically padding batch to max length."""
+    max_len = max(len(s["input_ids"]) for s in batch)
+    input_ids_list = []
+    attention_mask_list = []
+    labels_list = []
+    for s in batch:
+        ids = s["input_ids"]
+        mask = s["loss_mask"]
+        pad_len = max_len - len(ids)
+        padded_ids = ids + [pad_token_id] * pad_len
+        padded_labels = [ids[i] if mask[i] == 1 else -100 for i in range(len(ids))] + [-100] * pad_len
+        input_ids_list.append(torch.tensor(padded_ids, dtype=torch.long))
+        attention_mask_list.append(torch.tensor([1] * len(ids) + [0] * pad_len, dtype=torch.long))
+        labels_list.append(torch.tensor(padded_labels, dtype=torch.long))
+    return {
+        "input_ids": torch.stack(input_ids_list),
+        "attention_mask": torch.stack(attention_mask_list),
+        "labels": torch.stack(labels_list)
+    }
+def collate_packed(batch: list[dict]) -> dict:
+    """Collates pre-packed sequence bins by simple stacking."""
+    input_ids = torch.stack([item["input_ids"] for item in batch])
+    attention_mask = torch.stack([item["attention_mask"] for item in batch])
+    loss_mask = torch.stack([item["loss_mask"] for item in batch])
+    labels = input_ids.clone()
+    labels = labels.masked_fill(loss_mask == 0, -100)
+    return {
+        "input_ids": input_ids,
+        "attention_mask": attention_mask,
+        "labels": labels
+    }
+# PARSING AND MAIN LOGIC
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Clean SFT training and evaluation suite")
+    parser.add_argument("--student_model", type=str, default=cfg.get("model", {}).get("student", "Qwen/Qwen3-1.7B-Base"))
+    parser.add_argument("--tokenizer_model", type=str, default=cfg.get("model", {}).get("tokenizer", "Qwen/Qwen3-1.7B"))
+    parser.add_argument("--data_repo", type=str, default=os.environ.get("QUINTUS_SFT_DATA_REPO"), help="HF dataset repo containing train_sft.jsonl. Optional when data/tokenized/train_sft.jsonl exists.")
+    parser.add_argument("--token", type=str, default=None)
+    parser.add_argument("--trust_remote_code", action="store_true", help="Allow custom code from model/tokenizer repositories.")
+    parser.add_argument("--num_epochs", type=int, default=1)
+    parser.add_argument("--learning_rate", type=float, default=2e-5)
+    parser.add_argument("--micro_batch_size", type=int, default=4)
+    parser.add_argument("--grad_accum_steps", type=int, default=2)
+    parser.add_argument("--max_seq_len", type=int, default=4096)
+    parser.add_argument("--sequence_packing", action="store_true", default=True)
+    parser.add_argument("--no_sequence_packing", action="store_false", dest="sequence_packing")
+    parser.add_argument("--output_dir", type=str, default="quintus_sft_output")
+    parser.add_argument("--run_prompt_suite", action="store_true", default=True)
+    parser.add_argument("--no_prompt_suite", action="store_false", dest="run_prompt_suite")
+    parser.add_argument("--run_gsm8k", action="store_true", default=True)
+    parser.add_argument("--no_gsm8k", action="store_false", dest="run_gsm8k")
+    parser.add_argument("--gsm8k_samples", type=int, default=100)
+    parser.add_argument("--optim", type=str, choices=["adamw", "adamw_8bit"], default="adamw")
+    parser.add_argument("--gradient_checkpointing", action="store_true", default=False)
+    parser.add_argument("--load_in_4bit", action="store_true", default=False)
+    parser.add_argument("--use_lora", action="store_true", default=False)
+    parser.add_argument("--lora_r", type=int, default=8)
+    parser.add_argument("--lora_alpha", type=int, default=16)
+    parser.add_argument("--push_to_hub", action="store_true", default=False, help="Automatically push fine-tuned model to Hugging Face Hub after training")
+    parser.add_argument("--hub_model_id", type=str, default="iamrahulreddy/Quintus", help="Target Hugging Face Hub repository ID")
+    return parser.parse_args()
+def download_hf_dataset(repo_id: str | None, token: str | None) -> str:
+    print(f"Checking for tokenized dataset in local folders...")
+    local_path = "data/tokenized/train_sft.jsonl"
+    if os.path.exists(local_path):
+        print(f"Found local dataset: {local_path}")
+        return local_path
+    if not repo_id:
+        raise ValueError(
+            "No local SFT dataset found at data/tokenized/train_sft.jsonl. "
+            "Pass --data_repo or set QUINTUS_SFT_DATA_REPO."
+        )
+    print(f"Local file not found. Pulling from Hugging Face: {repo_id}...")
+    from huggingface_hub import hf_hub_download
+    os.makedirs("data/tokenized", exist_ok=True)
+    downloaded = hf_hub_download(
+        repo_id=repo_id,
+        filename="train_sft.jsonl",
+        repo_type="dataset",
+        local_dir="data/tokenized",
+        token=token
+    )
+    # Ensure correct local path layout
+    if os.path.exists(downloaded) and downloaded != local_path:
+        os.rename(downloaded, local_path)
+    print(f"Dataset downloaded to: {local_path}")
+    return local_path
+# DOWNSTREAM EVALUATION CODE
+def run_prompt_suite(model, tokenizer, device, output_dir: str):
+    print("\n" + "="*70)
+    print("RUNNING QUALITATIVE PROMPT SUITE (50 Prompts)")
+    print("="*70)
+    # Compile stop token IDs
+    eos_token_ids = [tokenizer.eos_token_id]
+    for token in ["<|im_end|>", "<|endoftext|>", "<|im_start|>"]:
+        t_id = tokenizer.convert_tokens_to_ids(token)
+        if t_id is not None and t_id != tokenizer.unk_token_id:
+            eos_token_ids.append(t_id)
+    eos_token_ids = list(set(eos_token_ids))
+    # Initialize output file
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    out_path = os.path.join(output_dir, f"prompt_suite_eval_{timestamp}.txt")
+    os.makedirs(output_dir, exist_ok=True)
+    with open(out_path, "w", encoding="utf-8") as f:
+        f.write("QUINTUS SFT POST-TRAINING PROMPT SUITE\n")
+        f.write(f"Timestamp: {timestamp}\n")
+        f.write("="*72 + "\n\n")
+        f.flush()
+    # Set padding side to left for batch generation
+    orig_padding_side = tokenizer.padding_side
+    tokenizer.padding_side = "left"
+    batch_size = 16
+    for i in range(0, len(ALL_PROMPTS), batch_size):
+        batch_items = ALL_PROMPTS[i : i + batch_size]
+        # Format prompts
+        formatted_prompts = []
+        for item in batch_items:
+            prompt_text = item["text"]
+            if tokenizer.chat_template is not None:
+                prompt_str = tokenizer.apply_chat_template(
+                    [{"role": "user", "content": prompt_text}],
+                    tokenize=False, add_generation_prompt=True
+                )
+            else:
+                prompt_str = f"<|im_start|>user\n{prompt_text}<|im_end|>\n<|im_start|>assistant\n"
+            formatted_prompts.append(prompt_str)
+        # Tokenize with padding
+        inputs = tokenizer(formatted_prompts, padding=True, return_tensors="pt").to(device)
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=2048,
+                do_sample=False,  # Greedy for clean, reproducible comparison
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=eos_token_ids
+            )
+        # Decode and write results in real-time
+        for idx, item in enumerate(batch_items):
+            input_len = inputs["input_ids"][idx].shape[0]
+            gen_tokens = outputs[idx][input_len:]
+            # Slice at the first EOS token
+            eos_indices = []
+            for eos_id in eos_token_ids:
+                indices = (gen_tokens == eos_id).nonzero(as_tuple=True)[0]
+                if len(indices) > 0:
+                    eos_indices.append(indices[0].item())
+            if eos_indices:
+                gen_tokens = gen_tokens[:min(eos_indices)]
+            response = tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()
+            # Log progress
+            global_idx = i + idx + 1
+            print(f"[{global_idx:02d}/50] ({item['difficulty']}) Q: {item['text'][:40]}... -> Answered ({len(gen_tokens)} tokens)")
+            # Append directly to output file
+            with open(out_path, "a", encoding="utf-8") as f:
+                f.write(f"[{global_idx:02d}/50]  {item['difficulty']}\n")
+                f.write(f"Q: {item['text']}\n\n")
+                f.write(f"Response:\n{response}\n")
+                f.write("\n" + "-"*72 + "\n\n")
+                f.flush()
+    # Restore original tokenizer settings
+    tokenizer.padding_side = orig_padding_side
+    print(f"\nPrompt suite evaluation complete. Saved report to: {out_path}\n")
+def extract_gsm8k_answer(text: str) -> str | None:
+    text = text.replace(",", "")
+    match = re.findall(r"The answer is\s*:?\s*(-?\d+)", text, re.IGNORECASE)
+    if match:
+        return match[-1]
+    match = re.findall(r"(-?\d+)", text)
+    if match:
+        return match[-1]
+    return None
+def run_gsm8k_eval(model, tokenizer, device, num_samples: int = 100):
+    print("\n" + "="*70)
+    print(f"RUNNING GSM8K MATH EVALUATION ({num_samples} Samples)")
+    print("="*70)
+    from datasets import load_dataset
+    try:
+        dataset = load_dataset("openai/gsm8k", "main", split="test")
+    except Exception as e:
+        print(f"Warning: Could not download GSM8K test set directly: {e}")
+        return
+    dataset = dataset.shuffle(seed=42).select(range(min(num_samples, len(dataset))))
+    correct = 0
+    total = 0
+    for idx, item in enumerate(dataset):
+        question = item["question"]
+        answer = item["answer"]
+        target_match = re.search(r"####\s*(-?\d+)", answer)
+        if not target_match:
+            continue
+        target_val = target_match.group(1)
+        if tokenizer.chat_template is not None:
+            prompt = tokenizer.apply_chat_template(
+                [{"role": "user", "content": question + "\nShow your work and conclude with 'The answer is: <number>'."}],
+                tokenize=False, add_generation_prompt=True
+            )
+        else:
+            prompt = f"<|im_start|>user\n{question}\nShow your work and conclude with 'The answer is: <number>'.<|im_end|>\n<|im_start|>assistant\n"
+        inputs = tokenizer(prompt, return_tensors="pt").to(device)
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=1024,
+                do_sample=False,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+        gen_tokens = outputs[0][inputs.input_ids.shape[1]:]
+        generated_text = tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()
+        pred_val = extract_gsm8k_answer(generated_text)
+        is_match = (pred_val == target_val)
+        if is_match:
+            correct += 1
+        total += 1
+        # Log sample output periodically
+        if idx % 10 == 0:
+            print(f"\n[GSM8K Sample {idx+1}]")
+            print(f"Q: {question[:80]}...")
+            print(f"A: {generated_text[:120]}... (Target: {target_val} | Pred: {pred_val})")
+            print(f"Match: {is_match}")
+    accuracy = (correct / total * 100) if total > 0 else 0
+    print("\n" + "="*70)
+    print(f"GSM8K EVALUATION SUMMARY: {correct}/{total} Correct -> Accuracy: {accuracy:.2f}%")
+    print("="*70 + "\n")
+# TRAINING PIPELINE
+def main() -> None:
+    args = parse_args()
+    # Propagate HF token to environment for auto-authentication of downstream hub calls
+    try:
+        import huggingface_hub
+        cached_token = huggingface_hub.get_token()
+    except Exception:
+        cached_token = None
+    resolved_token = os.environ.get("HF_TOKEN") or cached_token or args.token
+    if resolved_token:
+        os.environ["HF_TOKEN"] = resolved_token
+        args.token = resolved_token
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"SFT Environment initialized. Target device: {device}")
+    # 1. Pull dataset from HF
+    try:
+        dataset_file = download_hf_dataset(args.data_repo, args.token)
+    except ValueError as exc:
+        print(f"Error: {exc}")
+        sys.exit(1)
+    # 2. Setup Tokenizer and Model
+    print(f"Loading tokenizer: {args.tokenizer_model}")
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_model, trust_remote_code=args.trust_remote_code)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # 4-bit configuration if requested
+    bnb_config = None
+    if args.load_in_4bit:
+        from transformers import BitsAndBytesConfig
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.bfloat16 if device.type == "cuda" else torch.float32,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_use_double_quant=True
+        )
+        print("Using 4-bit BitsAndBytes quantization.")
+    # Liger Kernel (skipped for 4-bit/PEFT as it can interfere with quantized layers)
+    if not args.load_in_4bit:
+        try:
+            from liger_kernel.transformers import apply_liger_kernel_to_qwen3
+            apply_liger_kernel_to_qwen3(
+                rope=True,
+                swiglu=True,
+                rms_norm=True,
+                cross_entropy=False,
+                fused_linear_cross_entropy=False,
+            )
+            print("Liger Kernel optimizations applied successfully.")
+        except ImportError:
+            print("Liger Kernel not installed, skipping optimizations.")
+    attn_impl = "sdpa"
+    if device.type == "cuda":
+        try:
+            import flash_attn
+            attn_impl = "flash_attention_2"
+            print("FlashAttention-2 enabled.")
+        except ImportError:
+            print("flash-attn not installed, falling back to SDPA.")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.student_model,
+        quantization_config=bnb_config,
+        dtype=torch.bfloat16 if device.type == "cuda" else torch.float32,
+        trust_remote_code=args.trust_remote_code,
+        attn_implementation=attn_impl
+    )
+    if not args.load_in_4bit:
+        model = model.to(device)
+    model.config.use_cache = False
+    # Wrap with LoRA if requested or required for 4-bit training
+    if args.use_lora or args.load_in_4bit:
+        try:
+            from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+            if args.load_in_4bit:
+                model = prepare_model_for_kbit_training(model)
+            peft_config = LoraConfig(
+                r=args.lora_r,
+                lora_alpha=args.lora_alpha,
+                target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+                lora_dropout=0.05,
+                bias="none",
+                task_type="CAUSAL_LM"
+            )
+            model = get_peft_model(model, peft_config)
+            print("LoRA adapters successfully attached to target modules.")
+            model.print_trainable_parameters()
+        except ImportError:
+            print("Error: peft not installed. Please run `!pip install -q peft` to use LoRA/QLoRA.")
+            sys.exit(1)
+    if args.gradient_checkpointing:
+        model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
+        print("Gradient checkpointing enabled.")
+    # 3. Prepare dataset
+    raw_dataset = SFTDataset(dataset_file)
+    if args.sequence_packing:
+        packed_samples = pack_sequences(
+            raw_dataset.samples,
+            pack_length=args.max_seq_len,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.eos_token_id
+        )
+        train_dataloader = DataLoader(
+            packed_samples,
+            batch_size=args.micro_batch_size,
+            shuffle=True,
+            collate_fn=collate_packed
+        )
+    else:
+        train_dataloader = DataLoader(
+            raw_dataset,
+            batch_size=args.micro_batch_size,
+            shuffle=True,
+            collate_fn=lambda b: collate_sft(b, tokenizer.pad_token_id)
+        )
+    # 4. Optimizer and scheduler setup
+    if args.optim == "adamw_8bit":
+        try:
+            import bitsandbytes as bnb
+            optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=args.learning_rate, weight_decay=0.1)
+            print("Using BitsAndBytes 8-bit AdamW optimizer.")
+        except ImportError:
+            print("Warning: bitsandbytes not installed. Falling back to standard AdamW.")
+            use_fused = (device.type == "cuda")
+            optimizer = torch.optim.AdamW(model.parameters(), lr=args.learning_rate, weight_decay=0.1, fused=use_fused)
+    else:
+        use_fused = (device.type == "cuda")
+        optimizer = torch.optim.AdamW(model.parameters(), lr=args.learning_rate, weight_decay=0.1, fused=use_fused)
+        print(f"Using standard AdamW optimizer (fused={use_fused}).")
+    steps_per_epoch = (len(train_dataloader) + args.grad_accum_steps - 1) // args.grad_accum_steps
+    total_steps = steps_per_epoch * args.num_epochs
+    warmup_steps = int(total_steps * 0.05)
+    scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)
+    # 5. Training Loop
+    print("\n" + "="*70)
+    print(f"STARTING SFT TRAINING (Epochs: {args.num_epochs} | Steps: {total_steps})")
+    print("="*70)
+    model.train()
+    step = 0
+    total_tokens_processed = 0
+    t0 = time.time()
+    for epoch in range(args.num_epochs):
+        epoch_loss = 0.0
+        for batch_idx, batch in enumerate(train_dataloader):
+            input_ids = batch["input_ids"].to(device)
+            attention_mask = batch["attention_mask"].to(device)
+            labels = batch["labels"].to(device)
+            # Accumulate the number of active (non-padded) tokens processed
+            total_tokens_processed += attention_mask.sum().item()
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            loss = outputs.loss / args.grad_accum_steps
+            loss.backward()
+            epoch_loss += loss.item() * args.grad_accum_steps
+            if (batch_idx + 1) % args.grad_accum_steps == 0 or (batch_idx + 1) == len(train_dataloader):
+                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+                optimizer.step()
+                scheduler.step()
+                optimizer.zero_grad()
+                step += 1
+                if step % 5 == 0 or step == total_steps:
+                    elapsed = time.time() - t0
+                    tokens_per_sec = total_tokens_processed / max(elapsed, 1e-5)
+                    print(
+                        f"Epoch {epoch+1}/{args.num_epochs} | "
+                        f"Step {step}/{total_steps} | "
+                        f"Loss: {loss.item() * args.grad_accum_steps:.4f} | "
+                        f"LR: {scheduler.get_last_lr()[0]:.2e} | "
+                        f"Tokens: {total_tokens_processed} | "
+                        f"Speed: {tokens_per_sec:.2f} tokens/s"
+                    )
+    # 6. Save model weights and tokenizer
+    print(f"\nTraining complete in {time.time() - t0:.1f}s. Saving weights to: {args.output_dir}")
+    os.makedirs(args.output_dir, exist_ok=True)
+    if hasattr(model, "merge_and_unload") and not args.load_in_4bit:
+        print("Merging LoRA adapters into base weights...")
+        try:
+            merged_model = model.merge_and_unload()
+            merged_model.save_pretrained(args.output_dir)
+            print("Merged model weights saved successfully.")
+        except Exception as e:
+            print(f"Failed to merge and unload: {e}. Saving adapter weights only.")
+            model.save_pretrained(args.output_dir)
+    else:
+        model.save_pretrained(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    print("Weights and configuration saved successfully.")
+    # 7. SFT Downstream Evaluations
+    model.eval()
+    if args.run_prompt_suite:
+        run_prompt_suite(model, tokenizer, device, args.output_dir)
+    if args.run_gsm8k:
+        run_gsm8k_eval(model, tokenizer, device, num_samples=args.gsm8k_samples)
+    if args.push_to_hub:
+        print(f"\nUploading fine-tuned model and tokenizer to Hugging Face Hub: {args.hub_model_id}...")
+        try:
+            from huggingface_hub import create_repo, HfApi
+            token_val = args.token or os.environ.get("HF_TOKEN")
+            create_repo(repo_id=args.hub_model_id, token=token_val, exist_ok=True)
+            api = HfApi()
+            api.upload_folder(
+                folder_path=args.output_dir,
+                repo_id=args.hub_model_id,
+                repo_type="model",
+                token=token_val
+            )
+            print("Successfully uploaded model and tokenizer to Hugging Face Hub!")
+        except Exception as hub_err:
+            print(f"Failed to push to Hub: {hub_err}")
+    print("Pipeline Execution Complete. Model is ready.")
+if __name__ == "__main__":
+    main()

src/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # src package

src/checkpoints.py ADDED Viewed

	@@ -0,0 +1,241 @@

+from __future__ import annotations
+import json
+import os
+import time
+from pathlib import Path
+import torch
+from configs import cfg
+def checkpoint_rank(path: str) -> tuple[int, int]:
+    name = os.path.basename(path)
+    prefix, _, raw_value = name.partition("_")
+    try:
+        value = int(raw_value)
+    except ValueError:
+        value = -1
+    if prefix == "epoch":
+        return (2, value)
+    if prefix == "step":
+        return (1, value)
+    return (0, value)
+def find_latest_training_checkpoint(output_dir: str) -> str | None:
+    candidates = []
+    for pattern in ("epoch_*", "step_*"):
+        candidates.extend(str(path) for path in Path(output_dir).glob(pattern) if path.is_dir())
+    if not candidates:
+        return None
+    return max(candidates, key=checkpoint_rank)
+def load_trainer_state(checkpoint_dir: str, log) -> dict:
+    state_path = os.path.join(checkpoint_dir, "trainer_state.json")
+    if os.path.exists(state_path):
+        try:
+            with open(state_path, "r", encoding="utf-8") as f:
+                state = json.load(f)
+            if isinstance(state, dict):
+                return state
+        except (OSError, json.JSONDecodeError) as exc:
+            log.warning(f"Could not read trainer_state.json from {checkpoint_dir}: {exc}")
+    name = os.path.basename(checkpoint_dir)
+    prefix, _, raw_value = name.partition("_")
+    try:
+        value = int(raw_value)
+    except ValueError:
+        value = 0
+    if prefix == "epoch":
+        return {
+            "checkpoint_type": "epoch",
+            "start_epoch": value,
+            "global_step": 0,
+            "micro_step_global": 0,
+            "next_batch_in_epoch": 0,
+        }
+    if prefix == "step":
+        return {
+            "checkpoint_type": "step",
+            "start_epoch": 0,
+            "global_step": value,
+            "micro_step_global": 0,
+            "next_batch_in_epoch": 0,
+        }
+    return {}
+def packing_checkpoint_metadata(enabled: bool, pack_length: int | None, max_seq_len: int) -> dict[str, int | bool | None]:
+    return {
+        "sequence_packing_enabled": bool(enabled),
+        "sequence_packing_pack_length": int(pack_length) if enabled and pack_length is not None else None,
+        "data_max_seq_len": int(max_seq_len),
+    }
+def validate_resume_packing_state(
+    trainer_state: dict,
+    *,
+    enabled: bool,
+    pack_length: int,
+    max_seq_len: int,
+    log,
+) -> None:
+    checkpoint_enabled = bool(trainer_state.get("sequence_packing_enabled", False))
+    if checkpoint_enabled != bool(enabled):
+        log.error(
+            "Checkpoint sequence-packing state does not match the current run: "
+            f"checkpoint={checkpoint_enabled}, current={bool(enabled)}."
+        )
+        raise SystemExit(1)
+    if checkpoint_enabled:
+        checkpoint_pack_length = trainer_state.get("sequence_packing_pack_length")
+        try:
+            checkpoint_pack_length = int(checkpoint_pack_length)
+        except (TypeError, ValueError):
+            log.error("Checkpoint is missing a valid sequence_packing_pack_length value.")
+            raise SystemExit(1)
+        if checkpoint_pack_length != int(pack_length):
+            log.error(
+                "Checkpoint pack length does not match the current run: "
+                f"checkpoint={checkpoint_pack_length}, current={int(pack_length)}."
+            )
+            raise SystemExit(1)
+    checkpoint_max_seq_len = trainer_state.get("data_max_seq_len")
+    if checkpoint_max_seq_len is not None:
+        try:
+            checkpoint_max_seq_len = int(checkpoint_max_seq_len)
+        except (TypeError, ValueError):
+            log.error("Checkpoint is missing a valid data_max_seq_len value.")
+            raise SystemExit(1)
+        if checkpoint_max_seq_len != int(max_seq_len):
+            log.error(
+                "Checkpoint max sequence length does not match the current run: "
+                f"checkpoint={checkpoint_max_seq_len}, current={int(max_seq_len)}."
+            )
+            raise SystemExit(1)
+def save_checkpoint(
+    model,
+    tokenizer,
+    output_dir: str,
+    tag: str,
+    logger,
+    *,
+    scheduler=None,
+    trainer_state: dict | None = None,
+) -> str:
+    save_dir = os.path.join(output_dir, tag)
+    os.makedirs(save_dir, exist_ok=True)
+    save_start = time.time()
+    logger.info(f"[CKPT] Saving {tag} -> {save_dir}/")
+    model_to_save = model.module if hasattr(model, "module") else model
+    if hasattr(model_to_save, "_orig_mod"):
+        model_to_save = model_to_save._orig_mod
+    model_to_save.config.save_pretrained(save_dir)
+    tokenizer.save_pretrained(save_dir)
+    try:
+        from safetensors.torch import save_file
+        state_dict = {k: v.contiguous().cpu() for k, v in model_to_save.state_dict().items()}
+        save_file(state_dict, os.path.join(save_dir, "model.safetensors"))
+        logger.info("[CKPT] Saved via safetensors")
+    except ImportError:
+        torch.save(model_to_save.state_dict(), os.path.join(save_dir, "pytorch_model.bin"))
+        logger.info("[CKPT] Saved via torch.save")
+    if scheduler is not None:
+        torch.save(scheduler.state_dict(), os.path.join(save_dir, "scheduler.pt"))
+    if trainer_state is not None:
+        trainer_state = dict(trainer_state)
+        trainer_state.setdefault("tag", tag)
+        trainer_state.setdefault("saved_at", time.strftime("%Y-%m-%d %H:%M:%S %Z"))
+        with open(os.path.join(save_dir, "trainer_state.json"), "w", encoding="utf-8") as f:
+            json.dump(trainer_state, f, indent=2)
+    size_mb = sum(f.stat().st_size for f in Path(save_dir).rglob("*") if f.is_file()) / 1e6
+    save_elapsed = time.time() - save_start
+    logger.info(f"[CKPT] {tag} -> {save_dir}/ ({size_mb:.0f} MB, {save_elapsed:.1f}s)")
+    return save_dir
+def read_env_flag(name: str, default: bool = False) -> bool:
+    raw = os.environ.get(name)
+    if raw is None:
+        return default
+    return raw.strip().lower() in {"1", "true", "yes", "on"}
+def hub_upload_strict() -> bool:
+    strict = getattr(getattr(cfg, "hub", None), "hub_upload_strict", None)
+    if strict is None:
+        return read_env_flag("QUINTUS_HUB_UPLOAD_STRICT", False)
+    return bool(strict)
+def should_upload_checkpoint_tag(tag: str) -> bool:
+    upload_regular = getattr(getattr(cfg, "hub", None), "upload_kd_checkpoints", False) or read_env_flag("QUINTUS_UPLOAD_KD_CHECKPOINTS", False)
+    upload_steps = getattr(getattr(cfg, "hub", None), "upload_step_checkpoints", False) or read_env_flag("QUINTUS_UPLOAD_STEP_CHECKPOINTS", False)
+    upload_last = getattr(getattr(cfg, "hub", None), "upload_last_checkpoint", False) or read_env_flag("QUINTUS_UPLOAD_LAST_CHECKPOINT", False)
+    if tag.startswith("step_"):
+        return upload_steps
+    if tag.startswith("epoch_"):
+        return upload_regular
+    if tag == "best":
+        return upload_regular
+    if tag == "last":
+        return upload_last or upload_regular
+    return False
+def maybe_upload_checkpoint(checkpoint_dir: str, tag: str, logger) -> None:
+    if not should_upload_checkpoint_tag(tag):
+        return
+    token = os.environ.get("HF_TOKEN") or getattr(cfg.hub, "token", None)
+    if not token:
+        msg = "HF checkpoint upload requested, but HF_TOKEN/cfg.hub.token is missing"
+        strict = hub_upload_strict()
+        if strict:
+            raise RuntimeError(msg)
+        logger.warning(f"[CKPT] {msg}; continuing without remote backup")
+        return
+    repo_id = getattr(getattr(cfg, "hub", None), "repo_id", None) or os.environ.get("QUINTUS_HUB_REPO_ID") or f"{cfg.hub.username}/{cfg.hub.repo_name}"
+    base_path = getattr(getattr(cfg, "hub", None), "ckpt_path_in_repo", None) or os.environ.get("KD_CKPT_PATH_IN_REPO", "models/online_kd_3b_05b_ep3_B200_20260601")
+    base_path = base_path.strip("/")
+    path_in_repo = f"{base_path}/{tag}"
+    commit_prefix = getattr(getattr(cfg, "hub", None), "commit_message_prefix", None) or os.environ.get(
+        "KD_COMMIT_MESSAGE_PREFIX",
+        "Online KD 8B->1.7B Run",
+    )
+    commit_message = os.environ.get("KD_COMMIT_MESSAGE") or f"{commit_prefix}: upload {tag}"
+    upload_start = time.time()
+    size_mb = sum(f.stat().st_size for f in Path(checkpoint_dir).rglob("*") if f.is_file()) / 1e6
+    strict = hub_upload_strict()
+    logger.info(
+        f"[CKPT] Uploading {tag} -> {repo_id}/{path_in_repo} "
+        f"({size_mb:.0f} MB, strict={strict})"
+    )
+    logger.info(f"[CKPT] Commit: {commit_message}")
+    try:
+        from huggingface_hub import HfApi
+        api = HfApi(token=token)
+        api.create_repo(repo_id=repo_id, repo_type="dataset", private=True, exist_ok=True)
+        api.upload_folder(
+            folder_path=checkpoint_dir,
+            repo_id=repo_id,
+            path_in_repo=path_in_repo,
+            repo_type="dataset",
+            commit_message=commit_message,
+            ignore_patterns=["*.tmp", "*.log", "__pycache__/*"],
+        )
+        upload_elapsed = time.time() - upload_start
+        logger.info(f"[CKPT] Uploaded {tag} to HF Hub in {upload_elapsed / 60:.1f}m")
+    except Exception as exc:
+        msg = f"HF checkpoint upload failed for {tag}: {exc}"
+        if hub_upload_strict():
+            raise RuntimeError(msg) from exc
+        logger.warning(f"[CKPT] {msg}; continuing because hub upload strict mode is disabled")

src/download.py ADDED Viewed

	@@ -0,0 +1,574 @@

+from __future__ import annotations
+import argparse
+import json
+import os
+import platform
+import sys
+import time
+import warnings
+from pathlib import Path
+_REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+os.environ.setdefault("HF_HUB_DISABLE_PROGRESS_BARS", "0")
+os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
+warnings.filterwarnings("ignore", category=FutureWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+import torch
+from datasets import load_dataset
+from huggingface_hub import snapshot_download
+from transformers import AutoTokenizer
+from configs import cfg, emit_log_spacing, setup_logger
+from src.transformers_compat import format_model_load_error
+_IGNORE_PATTERNS = ["*.msgpack", "*.h5", "*.bin", "optimizer.pt", "optimizer.safetensors"]
+_TOKENIZER_ALLOW_PATTERNS = [
+    "tokenizer.json",
+    "tokenizer.model",
+    "tokenizer_config.json",
+    "special_tokens_map.json",
+    "added_tokens.json",
+    "vocab.json",
+    "merges.txt",
+    "generation_config.json",
+]
+_MIN_TOKEN_LENGTH = 10
+_DATA_STATS_FILENAME = "_data_stats.json"
+_ASSISTANT_MASK_KEYS = ("assistant_masks", "assistant_mask", "assistant_tokens_mask")
+def _build_chat_template_error_types() -> tuple[type[BaseException], ...]:
+    error_types: list[type[BaseException]] = [
+        AttributeError,
+        IndexError,
+        KeyError,
+        RuntimeError,
+        TypeError,
+        ValueError,
+    ]
+    try:
+        from jinja2 import TemplateError
+        error_types.append(TemplateError)
+    except ImportError:
+        pass
+    return tuple(error_types)
+_CHAT_TEMPLATE_ERRORS = _build_chat_template_error_types()
+def _config_revision(value: str | None) -> str | None:
+    if value is None:
+        return None
+    stripped = value.strip()
+    return stripped or None
+def _download_tokenizer_artifacts(tokenizer_model: str, tokenizer_revision: str | None, tokenizer_dir: str, log) -> None:
+    log.info(f"Downloading tokenizer -> ./{tokenizer_dir}/")
+    t0 = time.time()
+    try:
+        snapshot_download(
+            repo_id=tokenizer_model,
+            local_dir=tokenizer_dir,
+            revision=tokenizer_revision,
+            allow_patterns=_TOKENIZER_ALLOW_PATTERNS,
+        )
+        size_mb = sum(f.stat().st_size for f in Path(tokenizer_dir).rglob("*") if f.is_file()) / 1e6
+        log.info(f"Tokenizer downloaded: {size_mb:.1f} MB in {time.time() - t0:.0f}s")
+    except Exception as exc:
+        log.error(f"Failed to download tokenizer: {exc}")
+        sys.exit(1)
+def write_system_info(output_path: str, logger) -> None:
+    output_dir = os.path.dirname(output_path)
+    if output_dir:
+        os.makedirs(output_dir, exist_ok=True)
+    info = {
+        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S %Z"),
+        "platform": platform.platform(),
+        "python": sys.version.split()[0],
+        "torch": torch.__version__,
+        "cuda": torch.version.cuda if torch.cuda.is_available() else None,
+        "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
+        "cpu_count": os.cpu_count(),
+    }
+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(info, f, indent=2)
+    logger.info(f"System info -> {output_path}")
+def write_data_stats(output_path: str, stats: dict, dataset_id: str, config_name: str, target_samples: int, max_seq_len: int, logger) -> None:
+    output_dir = os.path.dirname(output_path)
+    if output_dir:
+        os.makedirs(output_dir, exist_ok=True)
+    meta = {
+        "dataset": dataset_id,
+        "config": config_name,
+        "target_samples": target_samples,
+        "max_seq_len": max_seq_len,
+        "stats": stats,
+    }
+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(meta, f, indent=2)
+    logger.info(f"Dataset stats -> {output_path}")
+def _coerce_content_str(content) -> str | None:
+    if isinstance(content, str):
+        return content.strip() or None
+    if isinstance(content, dict):
+        for key in ("answer_content", "text", "value", "content", "think_content"):
+            value = content.get(key)
+            if isinstance(value, str) and value.strip():
+                return value.strip()
+        return None
+    if isinstance(content, list):
+        parts = []
+        for item in content:
+            if isinstance(item, str) and item.strip():
+                parts.append(item.strip())
+            elif isinstance(item, dict):
+                value = item.get("text", item.get("value", ""))
+                if isinstance(value, str) and value.strip():
+                    parts.append(value.strip())
+        joined = " ".join(parts).strip()
+        return joined or None
+    return None
+def _extract_clean_sample(row: dict) -> dict | None:
+    messages = row.get("messages", row.get("conversations", []))
+    if not messages and "instruction" in row and "output" in row:
+        messages = [
+            {"role": "user", "content": row["instruction"]},
+            {"role": "assistant", "content": row["output"]},
+        ]
+    if isinstance(messages, str):
+        try:
+            messages = json.loads(messages)
+        except (json.JSONDecodeError, TypeError):
+            return None
+    if not isinstance(messages, list) or len(messages) < 2:
+        return None
+    clean_messages: list[dict] = []
+    for message in messages:
+        if not isinstance(message, dict):
+            return None
+        role = message.get("role", message.get("from", ""))
+        content = _coerce_content_str(message.get("content", message.get("value", "")))
+        if not isinstance(role, str) or not role.strip() or content is None:
+            return None
+        role = role.strip().lower()
+        if role == "human":
+            role = "user"
+        elif role == "gpt":
+            role = "assistant"
+        clean_messages.append({"role": role, "content": content})
+    source = str(row.get("source", "unknown"))
+    return {"messages": clean_messages, "source": source}
+def _coerce_token_ids(token_ids) -> list[int]:
+    if hasattr(token_ids, "input_ids"):
+        token_ids = token_ids.input_ids
+    if isinstance(token_ids, dict):
+        token_ids = token_ids.get("input_ids", token_ids)
+    if hasattr(token_ids, "tolist"):
+        token_ids = token_ids.tolist()
+    if isinstance(token_ids, tuple):
+        token_ids = list(token_ids)
+    if not isinstance(token_ids, list):
+        raise TypeError(f"Unexpected token id payload: {type(token_ids).__name__}")
+    if token_ids and isinstance(token_ids[0], list):
+        raise TypeError("Expected a single token sequence, not a batched payload")
+    return [int(token_id) for token_id in token_ids]
+def _coerce_binary_mask(mask_values, expected_len: int) -> list[int]:
+    if hasattr(mask_values, "tolist"):
+        mask_values = mask_values.tolist()
+    if isinstance(mask_values, tuple):
+        mask_values = list(mask_values)
+    if not isinstance(mask_values, list):
+        raise TypeError(f"Unexpected assistant mask payload: {type(mask_values).__name__}")
+    if mask_values and isinstance(mask_values[0], list):
+        raise TypeError("Expected a single assistant mask, not a batched payload")
+    mask = [1 if int(value) != 0 else 0 for value in mask_values]
+    if len(mask) != expected_len:
+        raise ValueError(f"Assistant mask length mismatch: got {len(mask)}, expected {expected_len}")
+    return mask
+def _try_builtin_assistant_mask(sample: dict, tokenizer: AutoTokenizer, input_ids: list[int]) -> list[int] | None:
+    try:
+        with warnings.catch_warnings():
+            warnings.filterwarnings("ignore", message="return_assistant_tokens_mask")
+            encoded = tokenizer.apply_chat_template(
+                sample["messages"],
+                tokenize=True,
+                add_generation_prompt=False,
+                return_dict=True,
+                return_assistant_tokens_mask=True,
+            )
+    except TypeError:
+        return None
+    except _CHAT_TEMPLATE_ERRORS:
+        return None
+    if not hasattr(encoded, "get"):
+        return None
+    try:
+        encoded_ids = _coerce_token_ids(encoded)
+    except _CHAT_TEMPLATE_ERRORS:
+        return None
+    if encoded_ids != input_ids:
+        return None
+    for key in _ASSISTANT_MASK_KEYS:
+        if key not in encoded:
+            continue
+        try:
+            mask = _coerce_binary_mask(encoded[key], expected_len=len(input_ids))
+        except (TypeError, ValueError):
+            return None
+        if any(mask):
+            return mask
+    return None
+def _build_assistant_mask_from_prefixes(sample: dict, tokenizer: AutoTokenizer, input_ids: list[int]) -> list[int]:
+    loss_mask = [0] * len(input_ids)
+    prefix_ids: list[int] = []
+    for turn_index, message in enumerate(sample["messages"], start=1):
+        role = str(message.get("role", "")).strip().lower()
+        if role == "assistant":
+            # prompt_ids contains everything up to user message + assistant header (<|im_start|>assistant\n)
+            prompt_ids = _coerce_token_ids(
+                tokenizer.apply_chat_template(
+                    sample["messages"][:turn_index-1],
+                    tokenize=True,
+                    add_generation_prompt=True,
+                )
+            )
+            # full_ids contains prompt + assistant response content + eos
+            full_ids = _coerce_token_ids(
+                tokenizer.apply_chat_template(
+                    sample["messages"][:turn_index],
+                    tokenize=True,
+                    add_generation_prompt=False,
+                )
+            )
+            if len(full_ids) < len(prompt_ids) or full_ids[:len(prompt_ids)] != prompt_ids:
+                raise ValueError("Chat template is not prefix-stable enough to derive assistant-only targets")
+            # Loss mask is 1 only for assistant's content tokens (after prompt_ids)
+            for j in range(len(prompt_ids), len(full_ids)):
+                loss_mask[j] = 1
+        prefix_ids = _coerce_token_ids(
+            tokenizer.apply_chat_template(
+                sample["messages"][:turn_index],
+                tokenize=True,
+                add_generation_prompt=False,
+            )
+        )
+    if prefix_ids != input_ids:
+        raise ValueError("Prefix tokenization mismatch while deriving assistant-only targets")
+    if len(loss_mask) != len(input_ids):
+        raise ValueError(f"Assistant mask length mismatch: got {len(loss_mask)}, expected {len(input_ids)}")
+    return loss_mask
+def _build_assistant_loss_mask(sample: dict, tokenizer: AutoTokenizer, input_ids: list[int]) -> list[int]:
+    builtin_mask = _try_builtin_assistant_mask(sample, tokenizer, input_ids)
+    if builtin_mask is not None:
+        return builtin_mask
+    return _build_assistant_mask_from_prefixes(sample, tokenizer, input_ids)
+def write_tokenized_dataset(tokenizer, num_samples: int, out_file: str, log) -> dict:
+    if num_samples <= 0:
+        raise RuntimeError("num_samples must be positive when tokenization is enabled")
+    output_dir = os.path.dirname(out_file)
+    if output_dir:
+        os.makedirs(output_dir, exist_ok=True)
+    log.info(f"Streaming {cfg.data.dataset_path}...")
+    ds = load_dataset(cfg.data.dataset_path, split="train", streaming=True, token=cfg.hub.token)
+    buffer_size = int(getattr(cfg.data, "stream_shuffle_buffer_size", 0) or 0)
+    if buffer_size > 0 and hasattr(ds, "shuffle"):
+        seed = int(getattr(cfg.data, "stream_shuffle_seed", 42))
+        log.info(f"Shuffling stream with buffer_size={buffer_size:,} seed={seed}")
+        ds = ds.shuffle(seed=seed, buffer_size=buffer_size)
+    # Check if dataset is pre-tokenized
+    try:
+        first_sample = next(iter(ds))
+        is_pre_tokenized = "input_ids" in first_sample and "loss_mask" in first_sample
+    except StopIteration:
+        raise RuntimeError("Loaded dataset is empty")
+    stats = {
+        "scanned": 0,
+        "written": 0,
+        "too_long_tokens": 0,
+        "too_short_tokens": 0,
+        "template_errors": 0,
+        "no_target_tokens": 0,
+        "invalid_messages": 0,
+        "total_tokens_written": 0,
+        "total_target_tokens_written": 0,
+        "min_tokens_written": 0,
+        "max_tokens_written": 0,
+        "min_target_tokens_written": 0,
+        "max_target_tokens_written": 0,
+    }
+    if is_pre_tokenized:
+        log.info("Auto-detected pre-tokenized dataset on HF Hub. Writing directly to train.jsonl...")
+        # Re-initialize to avoid losing the first element consumed by next(iter())
+        ds = load_dataset(cfg.data.dataset_path, split="train", streaming=True, token=cfg.hub.token)
+        if buffer_size > 0 and hasattr(ds, "shuffle"):
+            ds = ds.shuffle(seed=seed, buffer_size=buffer_size)
+        with open(out_file, "w", encoding="utf-8") as f:
+            for row in ds:
+                stats["scanned"] += 1
+                input_ids = row["input_ids"]
+                loss_mask = row["loss_mask"]
+                token_len = len(input_ids)
+                target_tokens = sum(loss_mask)
+                out_row = {
+                    "input_ids": input_ids,
+                    "loss_mask": loss_mask,
+                    "length": token_len,
+                    "target_tokens": target_tokens,
+                    "source": row.get("source", "unknown"),
+                }
+                f.write(json.dumps(out_row) + "\n")
+                stats["written"] += 1
+                stats["total_tokens_written"] += token_len
+                stats["total_target_tokens_written"] += target_tokens
+                if stats["written"] == 1:
+                    stats["min_tokens_written"] = token_len
+                    stats["max_tokens_written"] = token_len
+                    stats["min_target_tokens_written"] = target_tokens
+                    stats["max_target_tokens_written"] = target_tokens
+                else:
+                    stats["min_tokens_written"] = min(stats["min_tokens_written"], token_len)
+                    stats["max_tokens_written"] = max(stats["max_tokens_written"], token_len)
+                    stats["min_target_tokens_written"] = min(stats["min_target_tokens_written"], target_tokens)
+                    stats["max_target_tokens_written"] = max(stats["max_target_tokens_written"], target_tokens)
+                if stats["written"] >= num_samples:
+                    break
+        return stats
+    log.info("Standard raw text dataset detected. Running tokenization locally...")
+    with open(out_file, "w", encoding="utf-8") as f:
+        for row in ds:
+            stats["scanned"] += 1
+            sample = _extract_clean_sample(row)
+            if sample is None:
+                stats["invalid_messages"] += 1
+                continue
+            try:
+                input_ids = _coerce_token_ids(
+                    tokenizer.apply_chat_template(
+                        sample["messages"],
+                        tokenize=True,
+                        add_generation_prompt=False,
+                    )
+                )
+                token_len = len(input_ids)
+                if token_len < _MIN_TOKEN_LENGTH:
+                    stats["too_short_tokens"] += 1
+                    continue
+                if token_len > cfg.data.max_seq_len:
+                    stats["too_long_tokens"] += 1
+                    continue
+                loss_mask = _build_assistant_loss_mask(sample, tokenizer, input_ids)
+            except _CHAT_TEMPLATE_ERRORS:
+                stats["template_errors"] += 1
+                continue
+            target_tokens = sum(loss_mask)
+            if target_tokens == 0:
+                stats["no_target_tokens"] += 1
+                continue
+            out_row = {
+                "input_ids": input_ids,
+                "loss_mask": loss_mask,
+                "length": token_len,
+                "target_tokens": target_tokens,
+                "source": sample.get("source", "unknown"),
+            }
+            f.write(json.dumps(out_row) + "\n")
+            stats["written"] += 1
+            stats["total_tokens_written"] += token_len
+            stats["total_target_tokens_written"] += target_tokens
+            if stats["written"] == 1:
+                stats["min_tokens_written"] = token_len
+                stats["max_tokens_written"] = token_len
+                stats["min_target_tokens_written"] = target_tokens
+                stats["max_target_tokens_written"] = target_tokens
+            else:
+                stats["min_tokens_written"] = min(stats["min_tokens_written"], token_len)
+                stats["max_tokens_written"] = max(stats["max_tokens_written"], token_len)
+                stats["min_target_tokens_written"] = min(stats["min_target_tokens_written"], target_tokens)
+                stats["max_target_tokens_written"] = max(stats["max_target_tokens_written"], target_tokens)
+            if stats["written"] >= num_samples:
+                break
+    return stats
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Download models and tokenize data")
+    parser.add_argument("--num_samples", type=int, default=cfg.data.num_samples)
+    parser.add_argument("--skip_teacher", action="store_true", help="Skip teacher download.")
+    parser.add_argument("--skip_tokenization", action="store_true", help="Skip data tokenization.")
+    parser.add_argument("--tokenizer_only", action="store_true", help="Download tokenizer artifacts only")
+    args = parser.parse_args()
+    log = setup_logger("DOWNLOAD")
+    log.info("=" * 70)
+    log.info("Download and tokenize")
+    log.info("=" * 70)
+    write_system_info(cfg.paths.system_info, log)
+    log.info(f"  Teacher:      {cfg.model.teacher}")
+    log.info(f"  Teacher rev:  {_config_revision(cfg.model.teacher_revision) or 'unversioned'}")
+    tokenizer_model = getattr(cfg.model, "tokenizer", cfg.model.student)
+    tokenizer_revision = _config_revision(getattr(cfg.model, "tokenizer_revision", cfg.model.student_revision))
+    tokenizer_dir = getattr(cfg.paths, "tokenizer_dir", cfg.paths.student_dir)
+    log.info(f"  Student:      {cfg.model.student}")
+    log.info(f"  Student rev:  {_config_revision(cfg.model.student_revision) or 'unversioned'}")
+    log.info(f"  Tokenizer:    {tokenizer_model}")
+    log.info(f"  Tokenizer rev:{tokenizer_revision or 'unversioned'}")
+    log.info(f"  Student dir:  {cfg.paths.student_dir}")
+    log.info(f"  Tokenizer dir:{tokenizer_dir}")
+    log.info(f"  Remote code:  {cfg.model.allow_remote_code}")
+    log.info(f"  Dataset:      {cfg.data.dataset_path}")
+    log.info(f"  Num samples:  {args.num_samples:,}")
+    log.info(f"  Max seq len:  {cfg.data.max_seq_len}")
+    if torch.cuda.is_available():
+        log.info(f"  GPU:          {torch.cuda.get_device_name(0)}")
+    if not args.tokenizer_only:
+        emit_log_spacing(log)
+        log.info("-" * 70)
+        log.info(f"Downloading student -> ./{cfg.paths.student_dir}/")
+        t0 = time.time()
+        try:
+            snapshot_download(
+                repo_id=cfg.model.student,
+                local_dir=cfg.paths.student_dir,
+                revision=_config_revision(cfg.model.student_revision),
+                ignore_patterns=_IGNORE_PATTERNS,
+            )
+            size_gb = sum(f.stat().st_size for f in Path(cfg.paths.student_dir).rglob("*") if f.is_file()) / 1e9
+            log.info(f"Student downloaded: {size_gb:.1f} GB in {time.time() - t0:.0f}s")
+        except Exception as exc:
+            log.error(f"Failed to download student: {exc}")
+            sys.exit(1)
+    if args.tokenizer_only or Path(tokenizer_dir).resolve() != Path(cfg.paths.student_dir).resolve():
+        emit_log_spacing(log)
+        log.info("-" * 70)
+        _download_tokenizer_artifacts(tokenizer_model, tokenizer_revision, tokenizer_dir, log)
+    if not args.skip_tokenization:
+        emit_log_spacing(log)
+        log.info("-" * 70)
+        log.info(f"Preparing dataset: {cfg.data.dataset_path}")
+        try:
+            tokenizer = AutoTokenizer.from_pretrained(
+                tokenizer_dir,
+                trust_remote_code=cfg.model.allow_remote_code,
+            )
+        except Exception as exc:
+            log.error(format_model_load_error("Student tokenizer load", exc))
+            sys.exit(1)
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+        os.makedirs(cfg.paths.tokenized_dir, exist_ok=True)
+        out_file = os.path.join(cfg.paths.tokenized_dir, "train.jsonl")
+        stats_file = os.path.join(cfg.paths.tokenized_dir, _DATA_STATS_FILENAME)
+        log.info(
+            f"Streaming + tokenizing up to {args.num_samples:,} samples "
+            f"(max_seq_len={cfg.data.max_seq_len}, strict token limit, no truncation)"
+        )
+        t0 = time.time()
+        try:
+            token_stats = write_tokenized_dataset(
+                tokenizer=tokenizer,
+                num_samples=args.num_samples,
+                out_file=out_file,
+                log=log,
+            )
+        except RuntimeError as exc:
+            log.error(str(exc))
+            sys.exit(1)
+        write_data_stats(
+            output_path=stats_file,
+            stats=token_stats,
+            dataset_id=cfg.data.dataset_path,
+            config_name="default",
+            target_samples=args.num_samples,
+            max_seq_len=cfg.data.max_seq_len,
+            logger=log,
+        )
+        n_written = token_stats["written"]
+        if n_written == 0:
+            log.error("Tokenization produced 0 usable rows - aborting.")
+            sys.exit(1)
+        log.info(f"Pretokenization complete: {n_written:,} samples -> {out_file}")
+    else:
+        log.info("Skipping dataset tokenization (--skip_tokenization)")
+    if not args.skip_teacher:
+        emit_log_spacing(log)
+        log.info("-" * 70)
+        log.info(f"Downloading teacher -> ./{cfg.paths.teacher_dir}/")
+        t0 = time.time()
+        try:
+            snapshot_download(
+                repo_id=cfg.model.teacher,
+                local_dir=cfg.paths.teacher_dir,
+                revision=_config_revision(cfg.model.teacher_revision),
+                ignore_patterns=_IGNORE_PATTERNS,
+            )
+            size_gb = sum(f.stat().st_size for f in Path(cfg.paths.teacher_dir).rglob("*") if f.is_file()) / 1e9
+            log.info(f"Teacher downloaded: {size_gb:.1f} GB in {time.time() - t0:.0f}s")
+        except Exception as exc:
+            log.error(f"Failed to download teacher: {exc}")
+            sys.exit(1)
+    else:
+        log.info("Skipping teacher download (--skip_teacher)")
+    emit_log_spacing(log)
+    log.info("-" * 70)
+    log.info("Download complete")
+    log.info("-" * 70)
+if __name__ == "__main__":
+    main()

src/kd_contracts.py ADDED Viewed

	@@ -0,0 +1,95 @@

+from __future__ import annotations
+import hashlib
+import json
+from pathlib import Path
+from typing import Any
+PROVENANCE_SCHEMA_VERSION = 4
+_SHARD_SCHEMA = {
+    "support": "teacher_topk_plus_other_bucket",
+    "layout": "chunked_sample_lists",
+    "logprobs_dtype": "float16",
+    "ids_dtype": "int32",
+    "other_logprob_dtype": "float16",
+}
+_SPECIAL_TOKEN_ID_FIELDS = (
+    "bos_token_id",
+    "eos_token_id",
+    "pad_token_id",
+    "unk_token_id",
+    "cls_token_id",
+    "sep_token_id",
+    "mask_token_id",
+    "additional_special_tokens_ids",
+)
+def normalize_config_revision(value: str | None) -> str | None:
+    if value is None:
+        return None
+    stripped = value.strip()
+    return stripped or None
+def canonical_revision(value: str | None) -> str:
+    return normalize_config_revision(value) or "unversioned"
+def sha256_file(path: str | Path, chunk_size: int = 1 << 20) -> str:
+    digest = hashlib.sha256()
+    with open(path, "rb") as handle:
+        while True:
+            chunk = handle.read(chunk_size)
+            if not chunk:
+                break
+            digest.update(chunk)
+    return digest.hexdigest()
+def _special_token_ids(tokenizer) -> dict[str, Any]:
+    snapshot: dict[str, Any] = {}
+    for field in _SPECIAL_TOKEN_ID_FIELDS:
+        value = getattr(tokenizer, field, None)
+        if isinstance(value, tuple):
+            value = list(value)
+        snapshot[field] = value
+    return snapshot
+def build_tokenizer_contract(tokenizer) -> dict[str, Any]:
+    canonical = {
+        "tokenizer_class": tokenizer.__class__.__name__,
+        "full_vocab_size": len(tokenizer),
+        "special_token_ids": _special_token_ids(tokenizer),
+        "vocab": dict(sorted(tokenizer.get_vocab().items())),
+    }
+    encoded = json.dumps(
+        canonical,
+        sort_keys=True,
+        separators=(",", ":"),
+        ensure_ascii=True,
+    ).encode("utf-8")
+    return {
+        "tokenizer_class": canonical["tokenizer_class"],
+        "full_vocab_size": canonical["full_vocab_size"],
+        "special_token_ids": canonical["special_token_ids"],
+        "fingerprint": hashlib.sha256(encoded).hexdigest(),
+    }
+def build_shard_schema() -> dict[str, str]:
+    return dict(_SHARD_SCHEMA)
+def collect_model_vocab_sizes(model) -> dict[str, int]:
+    sizes: dict[str, int] = {}
+    config_size = getattr(getattr(model, "config", None), "vocab_size", None)
+    if isinstance(config_size, int):
+        sizes["config"] = config_size
+    input_embeddings = model.get_input_embeddings()
+    if input_embeddings is not None and getattr(input_embeddings, "weight", None) is not None:
+        sizes["input_embeddings"] = int(input_embeddings.weight.shape[0])
+    output_embeddings = model.get_output_embeddings()
+    if output_embeddings is not None and getattr(output_embeddings, "weight", None) is not None:
+        sizes["output_embeddings"] = int(output_embeddings.weight.shape[0])
+    return sizes

src/losses.py ADDED Viewed

	@@ -0,0 +1,180 @@

+from __future__ import annotations
+import torch
+import torch.nn.functional as F
+PROB_EPS = 1.0e-12
+def _normalize_support_logprobs(
+    topk_logprobs: torch.Tensor,
+    other_logprob: torch.Tensor,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    topk_probs = topk_logprobs.float().exp()
+    other_prob = other_logprob.float().exp().unsqueeze(-1)
+    support_probs = torch.cat([topk_probs, other_prob], dim=-1).clamp_min(PROB_EPS)
+    support_probs = support_probs / support_probs.sum(dim=-1, keepdim=True).clamp_min(PROB_EPS)
+    return support_probs, support_probs.log()
+def _masked_mean(values: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
+    mask = mask.float()
+    return (values * mask).sum() / mask.sum().clamp(min=1.0)
+def compute_sft_ce(logits: torch.Tensor, labels: torch.Tensor, loss_mask: torch.Tensor) -> torch.Tensor:
+    batch_size = logits.size(0)
+    shift_labels = labels[:, 1:].contiguous()
+    shift_loss_mask = ((loss_mask[:, 1:] > 0) & shift_labels.ne(-100)).contiguous().float()
+    total_loss = torch.tensor(0.0, device=logits.device, dtype=torch.bfloat16)
+    total_weight = torch.tensor(0.0, device=logits.device, dtype=torch.bfloat16)
+    for i in range(batch_size):
+        b_logits = logits[i, :-1, :]
+        b_labels = shift_labels[i]
+        b_mask = shift_loss_mask[i]
+        ce = F.cross_entropy(
+            b_logits,
+            b_labels,
+            ignore_index=-100,
+            reduction="none",
+        )
+        total_loss += (ce * b_mask).sum()
+        total_weight += b_mask.sum()
+    return total_loss / total_weight.clamp(min=1.0)
+def _compute_masked_ce_with_logits(logits, labels, loss_mask):
+    loss_ce = compute_sft_ce(logits, labels, loss_mask)
+    shift_logits = logits[:, :-1, :]
+    shift_labels = labels[:, 1:].contiguous()
+    shift_loss_mask = ((loss_mask[:, 1:] > 0) & shift_labels.ne(-100)).contiguous().float()
+    return loss_ce, shift_logits, shift_loss_mask
+def compute_distillation_loss(
+    student_logits: torch.Tensor,
+    labels: torch.Tensor,
+    teacher_logprobs: torch.Tensor,
+    teacher_ids: torch.Tensor,
+    teacher_other_logprob: torch.Tensor,
+    loss_mask: torch.Tensor,
+    alpha: float,
+    temperature: float,
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    vocab_size = student_logits.size(-1)
+    loss_ce, shift_logits, shift_loss_mask = _compute_masked_ce_with_logits(student_logits, labels, loss_mask)
+    shift_teacher_logprobs = teacher_logprobs[:, :-1, :].contiguous()
+    shift_teacher_ids = teacher_ids[:, :-1, :].contiguous()
+    shift_teacher_other_logprob = teacher_other_logprob[:, :-1].contiguous()
+    shift_student = shift_logits
+    topk_ids_clamped = shift_teacher_ids.clamp(0, vocab_size - 1)
+    student_log_z = torch.logsumexp(shift_student / temperature, dim=-1, keepdim=True).float()
+    student_topk_logprobs = shift_student.gather(-1, topk_ids_clamped).float() / temperature - student_log_z
+    student_topk_probs = student_topk_logprobs.float().exp()
+    student_other_prob = (1.0 - student_topk_probs.sum(dim=-1)).clamp_min(PROB_EPS)
+    student_other_logprob = student_other_prob.log()
+    teacher_support_probs, teacher_support_logprobs = _normalize_support_logprobs(
+        shift_teacher_logprobs,
+        shift_teacher_other_logprob,
+    )
+    _, student_support_logprobs = _normalize_support_logprobs(
+        student_topk_logprobs,
+        student_other_logprob,
+    )
+    positive_teacher = teacher_support_probs > 0
+    kl_terms = torch.where(
+        positive_teacher,
+        teacher_support_probs * (teacher_support_logprobs - student_support_logprobs),
+        torch.zeros_like(teacher_support_probs),
+    )
+    kl_per_token = kl_terms.sum(-1)
+    loss_kd = _masked_mean(kl_per_token, shift_loss_mask) * (temperature**2)
+    loss_total = alpha * loss_ce + (1.0 - alpha) * loss_kd
+    return loss_total, loss_ce.detach(), loss_kd.detach()
+def compute_online_kd_loss(
+    student_logits: torch.Tensor,
+    teacher_logits: torch.Tensor,
+    labels: torch.Tensor,
+    loss_mask: torch.Tensor,
+    alpha: float,
+    temperature: float,
+    token_chunk_size: int = 2048,
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    loss_ce = compute_sft_ce(student_logits, labels, loss_mask)
+    shift_labels = labels[:, 1:].contiguous()
+    shift_loss_mask = (
+        (loss_mask[:, 1:] > 0) & shift_labels.ne(-100)
+    ).contiguous().float()
+    s_shifted = student_logits[:, :-1, :]
+    t_shifted = teacher_logits[:, :-1, :]
+    seq_len = s_shifted.size(1)
+    total_kl = torch.tensor(0.0, device=student_logits.device, dtype=torch.float32)
+    total_weight = torch.tensor(0.0, device=student_logits.device, dtype=torch.float32)
+    for tok_start in range(0, seq_len, token_chunk_size):
+        tok_end = min(tok_start + token_chunk_size, seq_len)
+        s_chunk = s_shifted[:, tok_start:tok_end, :].float()
+        t_chunk = t_shifted[:, tok_start:tok_end, :].float()
+        mask_chunk = shift_loss_mask[:, tok_start:tok_end]
+        chunk_weight = mask_chunk.sum()
+        t_probs = F.softmax(t_chunk / temperature, dim=-1)
+        s_log_probs = F.log_softmax(s_chunk / temperature, dim=-1)
+        kl_tokens = F.kl_div(
+            s_log_probs, t_probs, log_target=False, reduction="none"
+        ).sum(dim=-1)
+        total_kl += (kl_tokens * mask_chunk).sum()
+        total_weight += chunk_weight
+        del s_chunk, t_chunk, t_probs, s_log_probs, kl_tokens, mask_chunk
+    loss_kd = (total_kl / total_weight.clamp(min=1.0)) * (temperature ** 2)
+    loss_kd = loss_kd.to(dtype=student_logits.dtype)
+    loss_total = alpha * loss_ce + (1.0 - alpha) * loss_kd
+    return loss_total, loss_ce.detach(), loss_kd.detach()
+def compute_loss_for_phase(
+    phase: str,
+    logits: torch.Tensor,
+    labels: torch.Tensor,
+    loss_mask: torch.Tensor,
+    batch: dict,
+    alpha: float,
+    temperature: float,
+    teacher_logits: torch.Tensor | None = None,
+    online_kd_token_chunk_size: int = 2048,
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    if phase == "sft":
+        loss_ce = compute_sft_ce(logits, labels, loss_mask)
+        return loss_ce, loss_ce.detach(), torch.tensor(0.0, device=logits.device)
+    if phase == "online_kd":
+        return compute_online_kd_loss(
+            logits,
+            teacher_logits,
+            labels,
+            loss_mask,
+            alpha,
+            temperature,
+            token_chunk_size=online_kd_token_chunk_size,
+        )
+    return compute_distillation_loss(
+        logits,
+        labels,
+        batch["teacher_logprobs"],
+        batch["teacher_ids"],
+        batch["teacher_other_logprob"],
+        loss_mask,
+        alpha,
+        temperature,
+    )

src/optim.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from __future__ import annotations
+import torch
+from configs import cfg
+def fused_adamw_preflight(logger) -> bool:
+    if not torch.cuda.is_available():
+        logger.info("  Optimizer:  fused AdamW requested but CUDA is unavailable; using standard AdamW")
+        return False
+    try:
+        probe = torch.nn.Parameter(torch.ones(8, device="cuda", dtype=torch.bfloat16))
+        probe_optim = torch.optim.AdamW([probe], lr=1.0e-4, fused=True)
+        loss = probe.float().square().sum()
+        loss.backward()
+        probe_optim.step()
+        probe_optim.zero_grad(set_to_none=True)
+        del loss, probe_optim, probe
+        return True
+    except Exception as exc:
+        logger.warning(f"  Optimizer:  fused AdamW preflight failed ({exc}); using standard AdamW")
+        return False
+def build_adamw_optimizer(params: list[torch.nn.Parameter], logger, allow_fused: bool) -> torch.optim.Optimizer:
+    kwargs = {
+        "lr": cfg.training.learning_rate,
+        "weight_decay": cfg.training.weight_decay,
+        "betas": (0.9, 0.999),
+    }
+    fused_requested = bool(getattr(cfg.training, "fused_adamw", False)) and allow_fused
+    if fused_requested and fused_adamw_preflight(logger):
+        try:
+            optimizer = torch.optim.AdamW(params, **kwargs, fused=True)
+            logger.info("  Optimizer:  AdamW fused=True")
+            return optimizer
+        except Exception as exc:
+            logger.warning(f"  Optimizer:  fused AdamW construction failed ({exc}); using standard AdamW")
+    elif bool(getattr(cfg.training, "fused_adamw", False)) and not allow_fused:
+        logger.info("  Optimizer:  fused AdamW disabled for DeepSpeed")
+    optimizer = torch.optim.AdamW(params, **kwargs)
+    logger.info("  Optimizer:  AdamW standard")
+    return optimizer

src/provenance.py ADDED Viewed

	@@ -0,0 +1,173 @@

+from __future__ import annotations
+import json
+import os
+from pathlib import Path
+from configs import cfg
+from src.kd_contracts import (
+    PROVENANCE_SCHEMA_VERSION,
+    build_shard_schema,
+    canonical_revision,
+    collect_model_vocab_sizes,
+    sha256_file,
+)
+def resolve_model_vocab_size(model, tokenizer, label: str, log) -> int:
+    model_sizes = collect_model_vocab_sizes(model)
+    if not model_sizes:
+        log.error(f"{label} model does not expose a usable vocab size view.")
+        raise SystemExit(1)
+    unique_sizes = sorted(set(model_sizes.values()))
+    if len(unique_sizes) != 1:
+        details = ", ".join(f"{name}={size:,}" for name, size in sorted(model_sizes.items()))
+        log.error(f"{label} vocab mismatch across checkpoint artifacts: {details}")
+        raise SystemExit(1)
+    model_vocab_size = unique_sizes[0]
+    tokenizer_vocab_size = len(tokenizer)
+    if model_vocab_size < tokenizer_vocab_size:
+        log.error(
+            f"{label} tokenizer length ({tokenizer_vocab_size:,}) exceeds "
+            f"the model vocab size ({model_vocab_size:,})."
+        )
+        log.error("Repair or regenerate the checkpoint before using it for distillation.")
+        raise SystemExit(1)
+    if model_vocab_size > tokenizer_vocab_size:
+        log.info(
+            f"  {label} model vocab is padded beyond the tokenizer range: "
+            f"tokenizer={tokenizer_vocab_size:,}, model={model_vocab_size:,}"
+        )
+    return model_vocab_size
+def validate_provenance(
+    prov_path: str,
+    data_path: str,
+    dataset,
+    teacher_tokenizer_contract: dict,
+    student_tokenizer_contract: dict,
+    log,
+) -> None:
+    if not os.path.exists(prov_path):
+        log.error("Missing _provenance.json in the logits directory.")
+        log.error("Regenerate the current teacher-logit shard metadata.")
+        raise SystemExit(1)
+    with open(prov_path, "r", encoding="utf-8") as f:
+        prov = json.load(f)
+    schema_version = prov.get("schema_version")
+    if schema_version != PROVENANCE_SCHEMA_VERSION:
+        log.error(
+            f"Unsupported shard provenance schema: {schema_version!r}. "
+            f"Expected {PROVENANCE_SCHEMA_VERSION}."
+        )
+        log.error("Regenerate the teacher-logit shards.")
+        raise SystemExit(1)
+    teacher_meta = prov.get("teacher", {})
+    student_meta = prov.get("student", {})
+    current_data_sha = sha256_file(data_path)
+    actual_shard_count = sum(1 for _ in Path(prov_path).parent.glob("shard_*.pt"))
+    provenance_num_samples = prov.get("num_samples")
+    try:
+        provenance_num_samples_int = int(provenance_num_samples)
+    except (TypeError, ValueError):
+        log.error(f"PROVENANCE MISMATCH: num_samples is {provenance_num_samples!r}, expected an integer.")
+        log.error("Regenerate compatible teacher-logit shards.")
+        raise SystemExit(1)
+    if provenance_num_samples_int < len(dataset):
+        log.error(
+            f"PROVENANCE MISMATCH: num_samples is {provenance_num_samples_int}, "
+            f"but the requested dataset has {len(dataset)} samples."
+        )
+        log.error("Regenerate compatible teacher-logit shards.")
+        raise SystemExit(1)
+    if provenance_num_samples_int > len(dataset):
+        log.warning(
+            f"  Provenance contains {provenance_num_samples_int:,} samples; "
+            f"training is using the first {len(dataset):,}. This is expected for smoke tests."
+        )
+    expected_pairs = [
+        ("shard_count", prov.get("shard_count"), actual_shard_count),
+        ("samples_per_shard", prov.get("samples_per_shard"), dataset.samples_per_shard),
+        ("data_sha256", prov.get("data_sha256"), current_data_sha),
+        ("max_seq_len", prov.get("max_seq_len"), cfg.data.max_seq_len),
+        ("top_k", prov.get("top_k"), cfg.training.top_k),
+        ("temperature", prov.get("temperature"), float(cfg.training.temperature)),
+        ("teacher.model", teacher_meta.get("model"), cfg.model.teacher),
+        (
+            "teacher.revision",
+            teacher_meta.get("revision"),
+            canonical_revision(cfg.model.teacher_revision),
+        ),
+        (
+            "teacher.tokenizer_size",
+            teacher_meta.get("tokenizer_size"),
+            teacher_tokenizer_contract["full_vocab_size"],
+        ),
+        (
+            "teacher.tokenizer_fingerprint",
+            teacher_meta.get("tokenizer_fingerprint"),
+            teacher_tokenizer_contract["fingerprint"],
+        ),
+        ("student.model", student_meta.get("model"), getattr(cfg.model, "tokenizer", cfg.model.student)),
+        (
+            "student.revision",
+            student_meta.get("revision"),
+            canonical_revision(getattr(cfg.model, "tokenizer_revision", cfg.model.student_revision)),
+        ),
+        (
+            "student.tokenizer_size",
+            student_meta.get("tokenizer_size"),
+            student_tokenizer_contract["full_vocab_size"],
+        ),
+        (
+            "student.tokenizer_fingerprint",
+            student_meta.get("tokenizer_fingerprint"),
+            student_tokenizer_contract["fingerprint"],
+        ),
+    ]
+    warn_only_fields = {
+        "teacher.tokenizer_fingerprint",
+        "student.tokenizer_fingerprint",
+    }
+    for field_name, found, expected in expected_pairs:
+        if found != expected:
+            if field_name in warn_only_fields:
+                log.warning(
+                    f"  Provenance WARNING (non-fatal): {field_name} is {found!r}, "
+                    f"expected {expected!r}. This is likely due to a transformers "
+                    f"library version change. Continuing because vocab sizes match."
+                )
+            else:
+                log.error(
+                    f"PROVENANCE MISMATCH: {field_name} is {found!r}, expected {expected!r}."
+                )
+                log.error("Regenerate compatible teacher-logit shards.")
+                raise SystemExit(1)
+    provenance_data_path = prov.get("data_path")
+    current_data_path = str(Path(data_path).resolve())
+    if provenance_data_path != current_data_path:
+        log.warning(
+            "  Provenance data_path differs because logits were likely generated on another machine: "
+            f"{provenance_data_path!r} vs {current_data_path!r}. "
+            "Continuing because data_sha256 matches."
+        )
+    shard_schema = prov.get("shard_schema")
+    expected_shard_schema = build_shard_schema()
+    if shard_schema != expected_shard_schema:
+        log.error(
+            f"PROVENANCE MISMATCH: shard_schema is {shard_schema!r}, "
+            f"expected {expected_shard_schema!r}."
+        )
+        log.error("Regenerate compatible teacher-logit shards.")
+        raise SystemExit(1)
+    log.info("  Provenance:  PASS (teacher shards match the current config and dataset)")

src/sequence_packing.py ADDED Viewed

	@@ -0,0 +1,183 @@

+from __future__ import annotations
+from bisect import bisect_left, insort
+import torch
+import torch.nn.functional as F
+from torch.utils.data import Dataset
+from src.training_data import DistillationDataset
+class SequencePackedDataset(Dataset):
+    def __init__(
+        self,
+        source: DistillationDataset,
+        source_indices: list[int],
+        pack_length: int,
+        eos_token_id: int,
+        pad_token_id: int,
+        mask_first_after_separator: bool = True,
+    ):
+        if pack_length <= 0:
+            raise ValueError(f"pack_length must be positive, got {pack_length}.")
+        if not hasattr(source, "sample_lengths"):
+            raise ValueError("Packed training requires a source dataset with sample_lengths metadata.")
+        if not source_indices:
+            raise ValueError("Packed training requires at least one source row.")
+        self.source = source
+        self.source_indices = [int(index) for index in source_indices]
+        self.source_index_set = set(self.source_indices)
+        if len(self.source_index_set) != len(self.source_indices):
+            raise ValueError("Packed training source indices contain duplicates.")
+        self.pack_length = int(pack_length)
+        self.eos_token_id = int(eos_token_id)
+        self.pad_token_id = int(pad_token_id)
+        self.mask_first_after_separator = bool(mask_first_after_separator)
+        self._length_by_index: dict[int, int] = {}
+        self.plan: list[list[int]] = []
+        for source_index in self.source_indices:
+            try:
+                length = int(source.sample_lengths[source_index])
+            except IndexError as exc:
+                raise IndexError(f"Source index {source_index} is outside the tokenized dataset.") from exc
+            if length > self.pack_length:
+                raise ValueError(
+                    f"Tokenized sample #{source_index} has length {length}, "
+                    f"which exceeds pack_length={self.pack_length}."
+                )
+            self._length_by_index[source_index] = length
+        self._build_plan()
+        self._validate_plan()
+        self.source_sample_count = len(self.source_indices)
+        self.bin_count = len(self.plan)
+        self.original_token_count = sum(self._length_by_index.values())
+        self.separator_token_count = sum(max(0, len(bin_indices) - 1) for bin_indices in self.plan)
+        self.packed_token_count = self.original_token_count + self.separator_token_count
+        self.total_capacity = self.bin_count * self.pack_length
+        self.pad_token_count = self.total_capacity - self.packed_token_count
+        self.average_samples_per_bin = self.source_sample_count / max(self.bin_count, 1)
+        self.utilization = self.packed_token_count / max(self.total_capacity, 1)
+    def _build_plan(self) -> None:
+        items = sorted(
+            ((self._length_by_index[source_index], source_index) for source_index in self.source_indices),
+            key=lambda item: (-item[0], item[1]),
+        )
+        available: list[tuple[int, int]] = []
+        for length, source_index in items:
+            required_existing = length + 1
+            insert_at = bisect_left(available, (required_existing, -1))
+            if insert_at == len(available):
+                bin_id = len(self.plan)
+                self.plan.append([source_index])
+                remaining = self.pack_length - length
+                insort(available, (remaining, bin_id))
+                continue
+            remaining, bin_id = available.pop(insert_at)
+            next_remaining = remaining - required_existing
+            if next_remaining < 0:
+                raise ValueError("Internal packing error: bin capacity became negative.")
+            self.plan[bin_id].append(source_index)
+            insort(available, (next_remaining, bin_id))
+    def _validate_plan(self) -> None:
+        seen: set[int] = set()
+        for bin_id, bin_indices in enumerate(self.plan):
+            if not bin_indices:
+                raise ValueError(f"Packed bin #{bin_id} is empty.")
+            real_length = sum(self._length_by_index[source_index] for source_index in bin_indices)
+            real_length += max(0, len(bin_indices) - 1)
+            if real_length > self.pack_length:
+                raise ValueError(
+                    f"Packed bin #{bin_id} has real_length={real_length}, "
+                    f"which exceeds pack_length={self.pack_length}."
+                )
+            for source_index in bin_indices:
+                if source_index in seen:
+                    raise ValueError(f"Source sample #{source_index} appears in more than one packed bin.")
+                seen.add(source_index)
+        missing = self.source_index_set - seen
+        if missing:
+            first_missing = min(missing)
+            raise ValueError(f"Source sample #{first_missing} was not assigned to a packed bin.")
+    def __len__(self) -> int:
+        return len(self.plan)
+    def __getitem__(self, bin_idx: int) -> dict[str, torch.Tensor]:
+        bin_indices = self.plan[bin_idx]
+        input_parts: list[torch.Tensor] = []
+        mask_parts: list[torch.Tensor] = []
+        original_tokens = 0
+        separator_tokens = 0
+        for sample_offset, source_index in enumerate(bin_indices):
+            item = self.source[source_index]
+            input_ids = item["input_ids"].long()
+            loss_mask = item["loss_mask"].long()
+            original_tokens += int(input_ids.size(0))
+            if sample_offset > 0:
+                input_parts.append(torch.tensor([self.eos_token_id], dtype=torch.long))
+                mask_parts.append(torch.zeros(1, dtype=torch.long))
+                separator_tokens += 1
+                if self.mask_first_after_separator and loss_mask.numel() > 0:
+                    loss_mask = loss_mask.clone()
+                    loss_mask[0] = 0
+            input_parts.append(input_ids)
+            mask_parts.append(loss_mask)
+        input_ids = torch.cat(input_parts)
+        loss_mask = torch.cat(mask_parts)
+        real_length = int(input_ids.size(0))
+        if real_length > self.pack_length:
+            raise ValueError(
+                f"Packed bin #{bin_idx} has real_length={real_length}, "
+                f"which exceeds pack_length={self.pack_length}."
+            )
+        pad_len = self.pack_length - real_length
+        if pad_len:
+            input_ids = F.pad(input_ids, (0, pad_len), value=self.pad_token_id)
+            loss_mask = F.pad(loss_mask, (0, pad_len), value=0)
+        return {
+            "input_ids": input_ids,
+            "loss_mask": loss_mask,
+            "real_length": torch.tensor(real_length, dtype=torch.long),
+            "source_samples": torch.tensor(len(bin_indices), dtype=torch.long),
+            "original_tokens": torch.tensor(original_tokens, dtype=torch.long),
+            "separator_tokens": torch.tensor(separator_tokens, dtype=torch.long),
+        }
+def collate_packed_fn(batch: list[dict], pad_token_id: int) -> dict:
+    del pad_token_id
+    input_ids = torch.stack([item["input_ids"] for item in batch])
+    loss_mask = torch.stack([item["loss_mask"] for item in batch]).long()
+    real_lengths = torch.stack([item["real_length"] for item in batch]).long()
+    seq_len = input_ids.size(1)
+    positions = torch.arange(seq_len, dtype=torch.long).unsqueeze(0)
+    attention_mask = (positions < real_lengths.unsqueeze(1)).long()
+    labels = input_ids.clone()
+    labels = labels.masked_fill(loss_mask == 0, -100)
+    return {
+        "input_ids": input_ids,
+        "attention_mask": attention_mask,
+        "loss_mask": loss_mask,
+        "labels": labels,
+        "real_length": real_lengths,
+        "source_samples": torch.stack([item["source_samples"] for item in batch]).long(),
+        "original_tokens": torch.stack([item["original_tokens"] for item in batch]).long(),
+        "separator_tokens": torch.stack([item["separator_tokens"] for item in batch]).long(),
+    }

src/train.py ADDED Viewed

	@@ -0,0 +1,1219 @@

+from __future__ import annotations
+import argparse
+import csv
+import json
+import math
+import os
+import sys
+import time
+from functools import partial
+from pathlib import Path
+_REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+import torch
+from torch.utils.data import DataLoader, Dataset, Subset
+from transformers import AutoModelForCausalLM, AutoTokenizer, get_cosine_schedule_with_warmup
+from configs import cfg, emit_log_spacing, setup_logger
+from src.checkpoints import (
+    find_latest_training_checkpoint,
+    load_trainer_state,
+    maybe_upload_checkpoint,
+    packing_checkpoint_metadata,
+    read_env_flag,
+    save_checkpoint,
+    validate_resume_packing_state,
+)
+from src.kd_contracts import build_tokenizer_contract
+from src.losses import compute_loss_for_phase
+from src.optim import build_adamw_optimizer
+from src.provenance import resolve_model_vocab_size, validate_provenance
+from src.sequence_packing import SequencePackedDataset, collate_packed_fn
+from src.training_data import (
+    DistillationDataset,
+    collate_fn,
+    extract_shard_id_range,
+    move_batch_to_device,
+    resolve_dataloader_runtime,
+    torch_load_cpu,
+)
+from src.training_schedule import (
+    build_train_validation_subsets,
+    compute_training_schedule,
+    load_deepspeed_runtime_config,
+)
+from src.transformers_compat import format_model_load_error, resolve_attention_backend
+from src.validation import evaluate_validation_loss
+def _log_gpu(logger) -> None:
+    if torch.cuda.is_available():
+        device = torch.cuda.current_device()
+        alloc = torch.cuda.max_memory_allocated(device) / (1024**3)
+        reserved = torch.cuda.max_memory_reserved(device) / (1024**3)
+        total = torch.cuda.get_device_properties(device).total_memory / (1024**3)
+        pct = alloc / total * 100
+        logger.info(f"[GPU] {alloc:.1f}/{total:.0f} GiB ({pct:.0f}%) peak alloc, {reserved:.1f} GiB peak reserved")
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Quintus training (SFT / KD)")
+    packing_cfg = getattr(cfg.training, "sequence_packing", None)
+    sequence_packing_default = bool(getattr(packing_cfg, "enabled", False))
+    pack_length_default = int(getattr(packing_cfg, "pack_length", cfg.data.max_seq_len))
+    mask_first_after_separator = bool(getattr(packing_cfg, "mask_first_token_after_separator", True))
+    parser.add_argument("--num_samples", type=int, default=cfg.data.num_samples)
+    parser.add_argument("--phase", type=str, choices=["sft", "kd", "online_kd"], default="online_kd", help="Training phase")
+    parser.add_argument("--resume_from_checkpoint", action="store_true", help="Resume from latest epoch in current output directory")
+    parser.add_argument("--init_from_checkpoint", type=str, default=None, help="Initialize weights from a specific path before training")
+    parser.add_argument(
+        "--compile_model",
+        action="store_true",
+        default=bool(getattr(cfg.training, "compile_model", False)),
+        help="Enable torch.compile after checkpoint loading. Off by default for KD memory safety.",
+    )
+    parser.add_argument("--local_rank", type=int, default=-1, help=argparse.SUPPRESS)
+    parser.add_argument("--deepspeed", type=str, default=None, help="Enable DeepSpeed with the given config path.")
+    parser.add_argument("--no_deepspeed", action="store_true", help="Run without DeepSpeed.")
+    parser.add_argument(
+        "--allow_partial_final_window",
+        action="store_true",
+        help="Allow DeepSpeed to drop a final incomplete accumulation window during smoke tests.",
+    )
+    parser.add_argument("--teacher_model", type=str, default=cfg.model.teacher)
+    parser.add_argument("--teacher_revision", type=str, default=cfg.model.teacher_revision)
+    parser.add_argument("--student_model", type=str, default=cfg.model.student)
+    parser.add_argument("--student_revision", type=str, default=cfg.model.student_revision)
+    parser.add_argument("--tokenizer_model", type=str, default=getattr(cfg.model, "tokenizer", cfg.model.student))
+    parser.add_argument("--tokenizer_revision", type=str, default=getattr(cfg.model, "tokenizer_revision", cfg.model.student_revision))
+    parser.add_argument("--student_dir", type=str, default=cfg.paths.student_dir)
+    parser.add_argument("--tokenizer_dir", type=str, default=getattr(cfg.paths, "tokenizer_dir", cfg.paths.student_dir))
+    parser.add_argument("--distilled_dir", type=str, default=cfg.paths.distilled_dir)
+    parser.add_argument("--num_epochs", type=int, default=cfg.training.num_epochs)
+    parser.add_argument("--max_steps", type=int, default=-1, help="Stop after this many optimizer steps. -1 = no limit.")
+    parser.add_argument("--learning_rate", type=float, default=float(cfg.training.learning_rate))
+    parser.add_argument("--alpha", type=float, default=cfg.training.alpha)
+    parser.add_argument("--temperature", type=float, default=cfg.training.temperature)
+    parser.add_argument(
+        "--online_kd_token_chunk_size",
+        type=int,
+        default=int(getattr(cfg.training, "online_kd_token_chunk_size", 2048)),
+        help="Token chunk size for full-vocabulary online KD loss.",
+    )
+    parser.add_argument("--micro_batch_size", type=int, default=cfg.training.micro_batch_size)
+    parser.add_argument("--grad_accum_steps", type=int, default=cfg.training.grad_accum_steps)
+    parser.add_argument("--sequence_packing", action="store_true", default=False, help="Enable sequence packing for online_kd.")
+    parser.add_argument("--no_sequence_packing", action="store_true", default=False, help="Disable sequence packing.")
+    parser.add_argument("--pack_length", type=int, default=None, help="Packed sequence length.")
+    parser.add_argument("--disable_checkpointing", action="store_true", default=False, help="Disable intermediate epoch/step/best checkpoint saves.")
+    parser.add_argument("--gradient_checkpointing", action="store_true", default=bool(cfg.training.gradient_checkpointing), help="Enable gradient checkpointing (activation checkpointing).")
+    parser.add_argument("--upload_kd_checkpoints", action="store_true", default=False)
+    parser.add_argument("--upload_step_checkpoints", action="store_true", default=False)
+    parser.add_argument(
+        "--upload_last_checkpoint",
+        action="store_true",
+        default=False,
+        help="Upload the final 'last' checkpoint to the Hub. Off by default.",
+    )
+    parser.add_argument(
+        "--hub_upload_strict",
+        action="store_true",
+        default=read_env_flag("QUINTUS_HUB_UPLOAD_STRICT", False),
+        help="Fail training if a requested Hub checkpoint upload fails.",
+    )
+    parser.add_argument("--hub_repo_id", type=str, default=f"{cfg.hub.username}/{cfg.hub.repo_name}")
+    parser.add_argument("--ckpt_path_in_repo", type=str, default="models/online_kd_8b_17b_ep1_B200_20260608_alpha0.3")
+    parser.add_argument("--commit_message_prefix", type=str, default="Online KD 8B->1.7B B200 Run (alpha=0.3)")
+    args = parser.parse_args()
+    if args.sequence_packing and args.no_sequence_packing:
+        parser.error("Use either --sequence_packing or --no_sequence_packing, not both.")
+    sequence_packing_enabled = sequence_packing_default
+    if args.sequence_packing:
+        sequence_packing_enabled = True
+    elif args.no_sequence_packing:
+        sequence_packing_enabled = False
+    pack_length = int(args.pack_length if args.pack_length is not None else pack_length_default)
+    if pack_length <= 0:
+        parser.error(f"--pack_length must be positive, got {pack_length}.")
+    if pack_length > int(cfg.data.max_seq_len):
+        parser.error(f"--pack_length must be <= data.max_seq_len ({int(cfg.data.max_seq_len)}), got {pack_length}.")
+    if sequence_packing_enabled and args.phase != "online_kd":
+        parser.error("--sequence_packing is supported only with --phase online_kd.")
+    if args.online_kd_token_chunk_size <= 0:
+        parser.error(
+            f"--online_kd_token_chunk_size must be positive, got {args.online_kd_token_chunk_size}."
+        )
+    cfg.model.teacher = args.teacher_model
+    cfg.model.teacher_revision = args.teacher_revision
+    cfg.model.student = args.student_model
+    cfg.model.student_revision = args.student_revision
+    cfg.model.tokenizer = args.tokenizer_model
+    cfg.model.tokenizer_revision = args.tokenizer_revision
+    cfg.paths.student_dir = args.student_dir
+    cfg.paths.tokenizer_dir = args.tokenizer_dir
+    cfg.paths.distilled_dir = args.distilled_dir
+    cfg.training.num_epochs = args.num_epochs
+    cfg.training.learning_rate = args.learning_rate
+    cfg.training.alpha = args.alpha
+    cfg.training.temperature = args.temperature
+    cfg.training.online_kd_token_chunk_size = int(args.online_kd_token_chunk_size)
+    cfg.training.micro_batch_size = args.micro_batch_size
+    cfg.training.grad_accum_steps = args.grad_accum_steps
+    cfg.training.gradient_checkpointing = args.gradient_checkpointing
+    cfg.training.disable_checkpointing = args.disable_checkpointing
+    cfg.training.sequence_packing.enabled = sequence_packing_enabled
+    cfg.training.sequence_packing.pack_length = pack_length
+    cfg.training.sequence_packing.mask_first_token_after_separator = mask_first_after_separator
+    cfg.data.num_samples = args.num_samples
+    from omegaconf import OmegaConf
+    if not hasattr(cfg, "hub"):
+        cfg.hub = OmegaConf.create()
+    cfg.hub.upload_kd_checkpoints = args.upload_kd_checkpoints
+    cfg.hub.upload_step_checkpoints = args.upload_step_checkpoints
+    cfg.hub.upload_last_checkpoint = args.upload_last_checkpoint
+    cfg.hub.hub_upload_strict = args.hub_upload_strict
+    cfg.hub.repo_id = args.hub_repo_id
+    cfg.hub.ckpt_path_in_repo = args.ckpt_path_in_repo
+    cfg.hub.commit_message_prefix = args.commit_message_prefix
+    rank = int(os.environ.get("LOCAL_RANK", args.local_rank))
+    log = setup_logger("TRAIN", rank=rank)
+    log.info("=" * 70)
+    log.info("Quintus Training")
+    log.info("=" * 70)
+    tokenizer_dir = getattr(cfg.paths, "tokenizer_dir", cfg.paths.student_dir)
+    tokenizer_model = getattr(cfg.model, "tokenizer", cfg.model.student)
+    log.info(f"  Student:     {cfg.paths.student_dir}")
+    log.info(f"  Student id:  {cfg.model.student}")
+    log.info(f"  Tokenizer:   {tokenizer_dir}")
+    log.info(f"  Tokenizer id:{tokenizer_model}")
+    log.info(f"  Num samples: {args.num_samples:,}")
+    log.info(f"  Epochs:      {cfg.training.num_epochs}")
+    log.info(f"  LR:          {cfg.training.learning_rate}")
+    log.info(f"  Phase:       {args.phase}")
+    if args.phase in ("kd", "online_kd"):
+        log.info(f"  CE weight:   {cfg.training.alpha}")
+        log.info(f"  Temperature: {cfg.training.temperature}")
+    if args.phase == "online_kd":
+        log.info(f"  KD chunk:    {cfg.training.online_kd_token_chunk_size} tokens")
+    log.info(f"  Micro batch: {cfg.training.micro_batch_size}")
+    log.info(f"  Grad accum:  {cfg.training.grad_accum_steps}")
+    log.info(f"  Eff. batch:  {cfg.training.micro_batch_size * cfg.training.grad_accum_steps}")
+    log.info(f"  Val ratio:   {cfg.training.validation_ratio:.2%}")
+    log.info(f"  Remote code: {cfg.model.allow_remote_code}")
+    log.info(f"  Output dir:  {cfg.paths.distilled_dir}")
+    log.info(f"  Log file:    {cfg.paths.log_file}")
+    log.info(f"  Fused AdamW: {bool(getattr(cfg.training, 'fused_adamw', False))}")
+    log.info(
+        f"  HF upload:   regular={cfg.hub.upload_kd_checkpoints} "
+        f"steps={cfg.hub.upload_step_checkpoints} "
+        f"last={cfg.hub.upload_last_checkpoint} "
+        f"strict={cfg.hub.hub_upload_strict}"
+    )
+    log.info(
+        f"  HF target:   {cfg.hub.repo_id}/"
+        f"{cfg.hub.ckpt_path_in_repo}"
+    )
+    if torch.cuda.is_available():
+        log.info(f"  GPU:         {torch.cuda.get_device_name(0)}")
+    try:
+        t_dir = tokenizer_dir
+        if not os.path.exists(t_dir):
+            log.warning(f"Tokenizer directory '{t_dir}' not found. Falling back to downloading '{tokenizer_model}' from HF Hub.")
+            t_dir = tokenizer_model
+        tokenizer = AutoTokenizer.from_pretrained(
+            t_dir,
+            trust_remote_code=cfg.model.allow_remote_code,
+        )
+    except Exception as exc:
+        log.error(format_model_load_error("Student tokenizer load", exc))
+        sys.exit(1)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    if sequence_packing_enabled:
+        if tokenizer.eos_token_id is None:
+            log.error("Sequence packing requires tokenizer.eos_token_id.")
+            sys.exit(1)
+        if tokenizer.pad_token_id is None:
+            log.error("Sequence packing requires tokenizer.pad_token_id.")
+            sys.exit(1)
+    student_tokenizer_contract = build_tokenizer_contract(tokenizer)
+    student_tokenizer_vocab_size = student_tokenizer_contract["full_vocab_size"]
+    if args.phase == "kd":
+        _prov_path_for_teacher = os.path.join(cfg.paths.logits_dir, "_provenance.json")
+        if os.path.exists(_prov_path_for_teacher):
+            with open(_prov_path_for_teacher, "r", encoding="utf-8") as _pf:
+                _prov_data = json.load(_pf)
+            _teacher_prov = _prov_data.get("teacher", {})
+            teacher_tokenizer_contract = {
+                "full_vocab_size": _teacher_prov.get("tokenizer_size"),
+                "fingerprint": _teacher_prov.get("tokenizer_fingerprint"),
+            }
+            log.info(
+                f"  Teacher contract read from provenance: "
+                f"vocab={teacher_tokenizer_contract['full_vocab_size']}, "
+                f"fingerprint={teacher_tokenizer_contract['fingerprint'][:12]}..."
+            )
+        else:
+            try:
+                teacher_tokenizer = AutoTokenizer.from_pretrained(
+                    cfg.paths.teacher_dir if os.path.exists(cfg.paths.teacher_dir) else cfg.model.teacher,
+                    trust_remote_code=cfg.model.allow_remote_code,
+                )
+            except Exception as exc:
+                log.error(format_model_load_error("Teacher tokenizer load", exc))
+                sys.exit(1)
+            teacher_tokenizer_contract = build_tokenizer_contract(teacher_tokenizer)
+            del teacher_tokenizer
+    else:
+        teacher_tokenizer_contract = None
+    attn_impl = resolve_attention_backend(log)
+    log.info(f"  Attention:   {attn_impl}")
+    try:
+        from liger_kernel.transformers import apply_liger_kernel_to_qwen3
+        apply_liger_kernel_to_qwen3(
+            rope=True,
+            swiglu=True,
+            rms_norm=True,
+            cross_entropy=False,
+            fused_linear_cross_entropy=False,
+        )
+        log.info("  Liger:       enabled")
+    except ImportError:
+        if cfg.training.micro_batch_size >= 6:
+            log.error("  Liger:       missing; install liger-kernel or lower micro_batch_size.")
+            raise RuntimeError("liger_kernel is required for micro_batch_size >= 6.")
+        else:
+            log.warning("  Liger:       not installed")
+    try:
+        s_dir = cfg.paths.student_dir
+        if not os.path.exists(s_dir):
+            log.warning(f"Student model directory '{s_dir}' not found. Falling back to downloading '{cfg.model.student}' from HF Hub.")
+            s_dir = cfg.model.student
+        model = AutoModelForCausalLM.from_pretrained(
+            s_dir,
+            dtype=torch.bfloat16,
+            low_cpu_mem_usage=True,
+            trust_remote_code=cfg.model.allow_remote_code,
+            attn_implementation=attn_impl,
+        )
+    except Exception as exc:
+        log.error(format_model_load_error("Student model load", exc))
+        sys.exit(1)
+    model.config.use_cache = False
+    if getattr(cfg.training, "gradient_checkpointing", False):
+        model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
+        log.info("  Grad ckpt:   enabled")
+    else:
+        log.info("  Grad ckpt:   disabled")
+    start_epoch = 0
+    resume_state: dict = {}
+    if args.resume_from_checkpoint and args.init_from_checkpoint:
+        log.error("Use either --init_from_checkpoint or --resume_from_checkpoint, not both.")
+        sys.exit(1)
+    checkpoint_to_load = args.init_from_checkpoint
+    if args.resume_from_checkpoint:
+        latest_ckpt = find_latest_training_checkpoint(cfg.paths.distilled_dir)
+        if latest_ckpt is None:
+            log.error(
+                f"--resume_from_checkpoint was set, but no epoch_* or step_* checkpoints were found in "
+                f"{cfg.paths.distilled_dir}. Use --init_from_checkpoint for the first KD run."
+            )
+            sys.exit(1)
+        checkpoint_to_load = latest_ckpt
+        resume_state = load_trainer_state(latest_ckpt, log)
+        checkpoint_type = resume_state.get("checkpoint_type", os.path.basename(latest_ckpt).split("_")[0])
+        start_epoch = int(resume_state.get("start_epoch", 0) or 0)
+        if checkpoint_type == "epoch":
+            log.info(f"Interrupted run detected. Resuming after completed epoch {start_epoch}")
+        else:
+            log.info(
+                f"Interrupted run detected. Resuming from {os.path.basename(latest_ckpt)} "
+                f"at epoch_index={start_epoch}, next_batch_in_epoch="
+                f"{int(resume_state.get('next_batch_in_epoch', 0) or 0)}"
+            )
+        validate_resume_packing_state(
+            resume_state,
+            enabled=sequence_packing_enabled,
+            pack_length=pack_length,
+            max_seq_len=int(cfg.data.max_seq_len),
+            log=log,
+        )
+    if checkpoint_to_load:
+        log.info(f"Loading weights from: {checkpoint_to_load}")
+        try:
+            from safetensors.torch import load_file
+            ckpt_file = os.path.join(checkpoint_to_load, "model.safetensors")
+            if not os.path.exists(ckpt_file):
+                ckpt_file = os.path.join(checkpoint_to_load, "pytorch_model.bin")
+            if ckpt_file.endswith(".safetensors"):
+                state_dict = load_file(ckpt_file)
+            else:
+                state_dict = torch.load(ckpt_file, map_location="cpu")
+            new_state_dict = {}
+            for k, v in state_dict.items():
+                if k.startswith("_orig_mod."):
+                    new_state_dict[k[len("_orig_mod."):]] = v
+                else:
+                    new_state_dict[k] = v
+            model.load_state_dict(new_state_dict)
+            log.info("Weights loaded.")
+        except Exception as e:
+            log.error(f"Failed to load weights: {e}")
+            sys.exit(1)
+    model.train()
+    if args.compile_model:
+        log.info("  Compile:     enabled")
+        model = torch.compile(model, dynamic=True)
+    else:
+        log.info("  Compile:     disabled")
+    torch.set_float32_matmul_precision("high")
+    student_model_vocab_size = resolve_model_vocab_size(model, tokenizer, "Student", log)
+    log.info(
+        f"  Student V:   tokenizer={student_tokenizer_vocab_size:,}  "
+        f"model={student_model_vocab_size:,}"
+    )
+    _log_gpu(log)
+    if args.phase == "kd":
+        shard0 = os.path.join(cfg.paths.logits_dir, "shard_000000.pt")
+        if os.path.exists(shard0):
+            test_shard = torch_load_cpu(shard0)
+            try:
+                min_id, max_id = extract_shard_id_range(test_shard, shard0)
+            except (KeyError, ValueError) as exc:
+                log.error(str(exc))
+                sys.exit(1)
+            if min_id < 0:
+                log.error(f"  Negative IDs (min={min_id}); int16 overflow.")
+                log.error("  Regenerate shards.")
+                sys.exit(1)
+            if max_id >= student_tokenizer_vocab_size:
+                log.error(
+                    f"VOCAB MISMATCH: shard max_id={max_id} >= "
+                    f"student tokenizer vocab={student_tokenizer_vocab_size}"
+                )
+                sys.exit(1)
+            log.info(
+                f"  Vocab check: PASS (ids in [{min_id}, {max_id}], "
+                f"reachable tokenizer V={student_tokenizer_vocab_size})"
+            )
+        else:
+            log.warning(f"  Shard {shard0} not found; skipping vocab check")
+    data_path = os.path.join(cfg.paths.tokenized_dir, "train.jsonl")
+    dataset = DistillationDataset(data_path, cfg.paths.logits_dir, cfg.data.max_seq_len, args.num_samples, args.phase)
+    log.info(f"  Dataset:     {len(dataset):,} samples")
+    if args.phase == "kd":
+        prov_path = os.path.join(cfg.paths.logits_dir, "_provenance.json")
+        validate_provenance(
+            prov_path=prov_path,
+            data_path=data_path,
+            dataset=dataset,
+            teacher_tokenizer_contract=teacher_tokenizer_contract,
+            student_tokenizer_contract=student_tokenizer_contract,
+            log=log,
+        )
+    pad_id = tokenizer.pad_token_id
+    if args.no_deepspeed:
+        args.deepspeed = None
+    use_ds = args.deepspeed is not None
+    world_size = int(os.environ.get("WORLD_SIZE", 1))
+    if world_size != 1:
+        log.error("This training path is single-GPU only. Re-run with NUM_GPUS=1.")
+        sys.exit(1)
+    is_main = rank in (-1, 0)
+    ds_runtime_config = None
+    if use_ds:
+        try:
+            ds_runtime_config = load_deepspeed_runtime_config(
+                args.deepspeed,
+                micro_batch_size=cfg.training.micro_batch_size,
+                grad_accum=cfg.training.grad_accum_steps,
+            )
+        except (OSError, ValueError, json.JSONDecodeError) as exc:
+            log.error(str(exc))
+            sys.exit(1)
+    train_dataset, val_dataset, split_meta = build_train_validation_subsets(
+        dataset=dataset,
+        validation_ratio=float(cfg.training.validation_ratio),
+        split_seed=int(cfg.training.split_seed),
+        micro_batch_size=cfg.training.micro_batch_size,
+        grad_accum=cfg.training.grad_accum_steps,
+        num_epochs=cfg.training.num_epochs,
+        use_ds=use_ds,
+    )
+    log.info(
+        f"  Train split: {len(train_dataset):,} samples | "
+        f"Val split: {int(split_meta['validation_size']):,} samples"
+    )
+    if bool(split_meta["accumulation_aligned"]):
+        log.info(
+            f"  Accum align: train split is divisible by effective batch "
+            f"{int(split_meta['effective_batch_size']):,}"
+        )
+    else:
+        if use_ds:
+            log.warning(
+                f"  Accum align: train split leaves "
+                f"{int(split_meta['train_remainder_batches'])} partial accumulation batches per epoch; "
+                "DeepSpeed will carry partial accumulation across epoch boundaries"
+            )
+        else:
+            log.warning(
+                f"  Accum align: train split leaves "
+                f"{int(split_meta['train_remainder_batches'])} partial accumulation batches per epoch; "
+                "the fallback flush path will rescale gradients correctly"
+            )
+    if bool(split_meta["adjusted"]):
+        log.info(
+            f"  Val align:   requested {int(split_meta['requested_validation_size']):,} "
+            f"({float(split_meta['requested_validation_ratio']) * 100:.2f}%), "
+            f"using {int(split_meta['validation_size']):,} "
+            f"({float(split_meta['actual_validation_ratio']) * 100:.2f}%) "
+            "to preserve the training schedule"
+        )
+    elif val_dataset is not None:
+        log.info(
+            f"  Val split:   using {float(split_meta['actual_validation_ratio']) * 100:.2f}% "
+            f"held out with split_seed={cfg.training.split_seed}"
+        )
+    else:
+        log.warning("  Validation disabled; tracking training loss.")
+    effective_train_dataset: Dataset = train_dataset
+    train_collate = partial(collate_fn, pad_token_id=pad_id)
+    val_collate = partial(collate_fn, pad_token_id=pad_id)
+    if sequence_packing_enabled:
+        if isinstance(train_dataset, Subset):
+            source_dataset = train_dataset.dataset
+            train_source_indices = [int(index) for index in train_dataset.indices]
+        else:
+            source_dataset = train_dataset
+            train_source_indices = list(range(len(train_dataset)))
+        if not isinstance(source_dataset, DistillationDataset):
+            log.error("Sequence packing requires DistillationDataset as the split source.")
+            sys.exit(1)
+        val_source_indices: set[int] = set()
+        if isinstance(val_dataset, Subset) and val_dataset.dataset is source_dataset:
+            val_source_indices = {int(index) for index in val_dataset.indices}
+        try:
+            packed_train_dataset = SequencePackedDataset(
+                source=source_dataset,
+                source_indices=train_source_indices,
+                pack_length=pack_length,
+                eos_token_id=int(tokenizer.eos_token_id),
+                pad_token_id=int(tokenizer.pad_token_id),
+                mask_first_after_separator=mask_first_after_separator,
+            )
+        except (IndexError, ValueError) as exc:
+            log.error(str(exc))
+            sys.exit(1)
+        overlap = packed_train_dataset.source_index_set.intersection(val_source_indices)
+        if overlap:
+            first_overlap = min(overlap)
+            log.error(f"Sequence packing split error: validation sample #{first_overlap} appears in training bins.")
+            sys.exit(1)
+        effective_train_dataset = packed_train_dataset
+        train_collate = partial(collate_packed_fn, pad_token_id=pad_id)
+        log.info("  Packing:     enabled")
+        log.info(f"  Pack length: {packed_train_dataset.pack_length:,}")
+        log.info(f"  Train bins:  {packed_train_dataset.bin_count:,}")
+        log.info(f"  Train rows:  {packed_train_dataset.source_sample_count:,}")
+        log.info(f"  Avg samples: {packed_train_dataset.average_samples_per_bin:.2f} per bin")
+        log.info(f"  Original tokens: {packed_train_dataset.original_token_count:,}")
+        log.info(f"  Separator tokens: {packed_train_dataset.separator_token_count:,}")
+        log.info(f"  Pad tokens:  {packed_train_dataset.pad_token_count:,}")
+        log.info(f"  Utilization: {packed_train_dataset.utilization * 100:.1f}%")
+    else:
+        log.info("  Packing:     disabled")
+    dataloader_runtime = resolve_dataloader_runtime()
+    log.info(
+        "  DataLoader:  "
+        f"workers={int(dataloader_runtime['num_workers'])} "
+        f"pin_memory={bool(dataloader_runtime['pin_memory'])} "
+        f"persistent={bool(dataloader_runtime.get('persistent_workers', False))}"
+    )
+    dataloader = DataLoader(
+        effective_train_dataset,
+        batch_size=cfg.training.micro_batch_size,
+        shuffle=(args.phase != "kd"),
+        collate_fn=train_collate,
+        drop_last=True,
+        **dataloader_runtime,
+    )
+    if args.phase == "kd":
+        log.info("  KD sampler:  sequential shard-local order (split membership remains randomized)")
+    val_dataloader = None
+    if val_dataset is not None:
+        val_dataloader = DataLoader(
+            val_dataset,
+            batch_size=cfg.training.micro_batch_size,
+            shuffle=False,
+            collate_fn=val_collate,
+            drop_last=False,
+            **dataloader_runtime,
+        )
+    grad_accum = cfg.training.grad_accum_steps
+    schedule = compute_training_schedule(
+        dataset_size=len(effective_train_dataset),
+        micro_batch_size=cfg.training.micro_batch_size,
+        grad_accum=grad_accum,
+        num_epochs=cfg.training.num_epochs,
+        use_ds=use_ds,
+        drop_last=True,
+    )
+    batches_per_epoch = int(schedule["batches_per_epoch"])
+    remainder_batches = int(schedule["remainder_batches"])
+    has_remainder = bool(schedule["has_remainder"])
+    total_micro_batches = int(schedule["total_micro_batches"])
+    steps_per_epoch = int(schedule["steps_per_epoch"])
+    total_steps = int(schedule["total_steps"])
+    final_remainder = int(schedule["final_remainder"])
+    if batches_per_epoch == 0:
+        schedule_unit = "packed bins" if sequence_packing_enabled else "samples"
+        log.error(
+            f"Dataset too small for micro_batch_size={cfg.training.micro_batch_size}. "
+            f"Train split has {len(effective_train_dataset)} {schedule_unit} and drop_last=True would produce 0 batches."
+        )
+        sys.exit(1)
+    dropped_samples_per_epoch = int(schedule["dropped_samples_per_epoch"])
+    if dropped_samples_per_epoch:
+        schedule_unit = "packed bins" if sequence_packing_enabled else "samples"
+        log.warning(
+            f"  drop_last=True will discard {dropped_samples_per_epoch} {schedule_unit} per epoch "
+            "before gradient accumulation begins"
+        )
+    if use_ds and final_remainder:
+        dropped_total = int(schedule["dropped_samples_total"])
+        schedule_unit = "packed bins" if sequence_packing_enabled else "samples"
+        message = (
+            f"DeepSpeed would drop the final {final_remainder} micro-batches "
+            f"({dropped_total} {schedule_unit} total) because {batches_per_epoch} batches per epoch "
+            f"across {cfg.training.num_epochs} epochs yields {total_micro_batches} micro-batches, "
+            f"which is not divisible by grad_accum={grad_accum}."
+        )
+        if not args.allow_partial_final_window:
+            log.error(message)
+            log.error(
+                "Adjust num_samples, micro_batch_size, grad_accum_steps, or num_epochs "
+                "so total micro-batches is divisible by grad_accum, or rerun with "
+                "--allow_partial_final_window for a smoke test."
+            )
+            sys.exit(1)
+        log.warning(message)
+        log.warning("Proceeding because --allow_partial_final_window was set.")
+    warmup_steps = int(total_steps * cfg.training.warmup_ratio)
+    if has_remainder:
+        if use_ds:
+            log.info(
+                f"  NOTE: {batches_per_epoch} batches are not divisible by grad_accum={grad_accum}; "
+                f"DeepSpeed carries {remainder_batches} leftover micro-batches across epoch boundaries"
+            )
+            if final_remainder and args.allow_partial_final_window:
+                log.info(
+                    f"  NOTE: only the final {final_remainder} micro-batches of "
+                    "the last epoch are dropped because they never reach a full accumulation window"
+                )
+        else:
+            log.info(
+                f"  NOTE: {batches_per_epoch} batches are not divisible by grad_accum={grad_accum}; "
+                f"the training loop will flush {remainder_batches} leftover micro-batches each epoch"
+            )
+    if not use_ds:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        model.to(device)
+    optimizer = build_adamw_optimizer(list(model.parameters()), log, allow_fused=not use_ds)
+    scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)
+    resume_global_step = int(resume_state.get("global_step", 0) or 0) if args.resume_from_checkpoint else 0
+    saved_run_epochs = int(resume_state.get("num_epochs", cfg.training.num_epochs) or cfg.training.num_epochs)
+    extending_completed_run = (
+        args.resume_from_checkpoint
+        and saved_run_epochs < cfg.training.num_epochs
+        and start_epoch >= saved_run_epochs
+    )
+    scheduler_state_path = os.path.join(checkpoint_to_load, "scheduler.pt") if checkpoint_to_load else None
+    if (
+        extending_completed_run
+        and read_env_flag("QUINTUS_FRESH_SCHEDULER_ON_EXTEND", True)
+    ):
+        remaining_steps = max(1, total_steps - resume_global_step)
+        extension_warmup_steps = int(remaining_steps * cfg.training.warmup_ratio)
+        scheduler = get_cosine_schedule_with_warmup(optimizer, extension_warmup_steps, remaining_steps)
+        log.info(
+            f"  Scheduler:   fresh extension schedule "
+            f"({remaining_steps:,} remaining steps, {extension_warmup_steps:,} warmup); "
+            f"checkpoint was saved for {saved_run_epochs} epochs"
+        )
+    elif args.resume_from_checkpoint and scheduler_state_path and os.path.exists(scheduler_state_path):
+        try:
+            scheduler.load_state_dict(torch.load(scheduler_state_path, map_location="cpu"))
+            for param_group, lr in zip(optimizer.param_groups, scheduler.get_last_lr()):
+                param_group["lr"] = lr
+            log.info(f"  Scheduler:   restored from {scheduler_state_path}")
+        except Exception as exc:
+            log.warning(f"  Scheduler restore failed ({exc}); continuing with a fresh schedule")
+    log.info(f"  Batches/ep:  {batches_per_epoch:,}")
+    step_label = "Steps/ep"
+    step_note = ""
+    if has_remainder:
+        if use_ds:
+            step_label = "Steps/ep*"
+            step_note = "  (floor; cross-epoch carry shifts exact epoch boundaries)"
+        else:
+            step_note = "  (includes remainder flush)"
+    log.info(f"  {step_label}:    {steps_per_epoch:,}{step_note}")
+    log.info(f"  Steps total: {total_steps:,}  ({warmup_steps:,} warmup)")
+    log.info(
+        "  Best ckpt:   held-out validation loss"
+        if val_dataloader is not None
+        else "  Best ckpt:   training loss (validation disabled)"
+    )
+    if use_ds:
+        import deepspeed
+        model, optimizer, _, scheduler = deepspeed.initialize(
+            model=model,
+            optimizer=optimizer,
+            lr_scheduler=scheduler,
+            config=ds_runtime_config,
+        )
+        device = model.device
+        log.info("[DS] DeepSpeed ZeRO-2 initialized")
+        log.info(f"[DS] DeepSpeed will accumulate over {grad_accum} micro-batches internally")
+    else:
+        log.info(f"  Device:      {device}")
+    _log_gpu(log)
+    teacher_model = None
+    if args.phase == "online_kd":
+        teacher_source = cfg.paths.teacher_dir if os.path.exists(cfg.paths.teacher_dir) else cfg.model.teacher
+        if teacher_source != cfg.model.teacher:
+            log.info(f"Loading frozen teacher model from local directory '{teacher_source}' on device {device}...")
+        else:
+            log.info(f"Loading frozen teacher model '{teacher_source}' on device {device}...")
+        try:
+            teacher_model = AutoModelForCausalLM.from_pretrained(
+                teacher_source,
+                dtype=torch.bfloat16,
+                low_cpu_mem_usage=True,
+                trust_remote_code=cfg.model.allow_remote_code,
+                attn_implementation=attn_impl,
+            ).to(device)
+            for p in teacher_model.parameters():
+                p.requires_grad = False
+            teacher_model.eval()
+            log.info(f"Teacher model '{teacher_source}' loaded and frozen.")
+        except Exception as exc:
+            log.error(f"Failed to load teacher model: {exc}")
+            sys.exit(1)
+    checkpoint_packing_metadata = packing_checkpoint_metadata(
+        enabled=sequence_packing_enabled,
+        pack_length=pack_length,
+        max_seq_len=int(cfg.data.max_seq_len),
+    )
+    os.makedirs(cfg.paths.distilled_dir, exist_ok=True)
+    loss_log: list[dict] = []
+    global_step = resume_global_step
+    micro_step_global = int(resume_state.get("micro_step_global", 0) or 0) if args.resume_from_checkpoint else 0
+    best_metric_name = "validation loss" if val_dataloader is not None else "training loss"
+    best_selection_loss = float("inf")
+    if args.resume_from_checkpoint and "best_selection_loss" in resume_state:
+        try:
+            best_selection_loss = float(resume_state["best_selection_loss"])
+            log.info(f"  Best resume: restored prior best {best_metric_name}={best_selection_loss:.4f}")
+        except (TypeError, ValueError):
+            log.warning("  Best resume: prior best_selection_loss was unreadable; recomputing from this run")
+    best_checkpoint_tag = resume_state.get("best_checkpoint_tag")
+    best_ckpt_path = os.path.join(cfg.paths.distilled_dir, "best")
+    if not os.path.isdir(best_ckpt_path):
+        best_ckpt_path = None
+        if best_checkpoint_tag:
+            candidate_best_path = os.path.join(cfg.paths.distilled_dir, str(best_checkpoint_tag))
+            if os.path.isdir(candidate_best_path):
+                best_ckpt_path = candidate_best_path
+                log.info(f"  Best resume: using {best_checkpoint_tag} as the current best checkpoint")
+    t_start = time.time()
+    alpha = cfg.training.alpha
+    temperature = cfg.training.temperature
+    log_every = max(1, min(50, total_steps // 20))
+    checkpoint_every_steps = max(0, int(os.environ.get("TRAIN_CHECKPOINT_EVERY_STEPS", "2000")))
+    if getattr(cfg.training, "disable_checkpointing", False):
+        checkpoint_every_steps = 0
+    running_loss = 0.0
+    running_ce = 0.0
+    running_kd = 0.0
+    running_count = 0
+    emit_log_spacing(log)
+    log.info("-" * 70)
+    log.info("Training Start")
+    if checkpoint_every_steps:
+        log.info(f"  Mid-epoch checkpoint interval: every {checkpoint_every_steps:,} optimizer steps")
+    else:
+        log.info("  Mid-epoch checkpoints disabled")
+    log.info("-" * 70)
+    window_tokens = 0
+    window_t_start = time.time()
+    _gpu_loss_accum = torch.zeros(1, device=device)
+    _gpu_ce_accum = torch.zeros(1, device=device)
+    _gpu_kd_accum = torch.zeros(1, device=device)
+    _gpu_tokens_accum = torch.zeros(1, dtype=torch.long, device=device)
+    training_complete = False
+    for epoch in range(start_epoch, cfg.training.num_epochs):
+        if training_complete:
+            break
+        t_epoch = time.time()
+        epoch_loss = 0.0
+        epoch_ce = 0.0
+        epoch_kd = 0.0
+        epoch_steps = 0
+        epoch_tokens = 0
+        micro_in_epoch = 0
+        resume_batch_offset = 0
+        if args.resume_from_checkpoint and epoch == start_epoch:
+            resume_batch_offset = int(resume_state.get("next_batch_in_epoch", 0) or 0)
+            if resume_batch_offset:
+                log.info(f"  Resume: skipping {resume_batch_offset:,} already-processed batches in epoch {epoch + 1}")
+        for batch_idx, batch in enumerate(dataloader):
+            if resume_batch_offset and batch_idx < resume_batch_offset:
+                continue
+            batch = move_batch_to_device(batch, device)
+            input_ids = batch["input_ids"]
+            attention_mask = batch["attention_mask"]
+            labels = batch["labels"]
+            loss_mask = batch["loss_mask"]
+            logits = model(input_ids=input_ids, attention_mask=attention_mask).logits
+            if args.phase == "online_kd" and teacher_model is not None:
+                with torch.no_grad():
+                    teacher_logits = teacher_model(input_ids=input_ids, attention_mask=attention_mask).logits
+            else:
+                teacher_logits = None
+            loss, ce, kd = compute_loss_for_phase(
+                args.phase,
+                logits,
+                labels,
+                loss_mask,
+                batch,
+                alpha,
+                temperature,
+                teacher_logits=teacher_logits,
+                online_kd_token_chunk_size=int(cfg.training.online_kd_token_chunk_size),
+            )
+            if not torch.isfinite(loss):
+                log.error(
+                    f"Non-finite loss in phase={args.phase}: "
+                    f"loss={loss.item()} ce={ce.item()} kd={kd.item()}"
+                )
+                if args.phase == "kd":
+                    log.error("Action: regenerate teacher logits.")
+                else:
+                    log.error("Action: check dataset / reduce LR.")
+                sys.exit(1)
+            micro_in_epoch += 1
+            micro_step_global += 1
+            _gpu_loss_accum += loss.detach()
+            _gpu_ce_accum += ce.detach()
+            _gpu_kd_accum += kd.detach()
+            _gpu_tokens_accum += attention_mask.sum()
+            if use_ds:
+                model.backward(loss)
+                model.step()
+            else:
+                scaled = loss / grad_accum
+                scaled.backward()
+                if micro_in_epoch % grad_accum == 0:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+                    optimizer.step()
+                    scheduler.step()
+                    optimizer.zero_grad(set_to_none=True)
+            is_optim_step = (
+                (micro_step_global % grad_accum == 0) if use_ds else (micro_in_epoch % grad_accum == 0)
+            )
+            if is_optim_step:
+                global_step += 1
+                epoch_steps += 1
+                running_count += 1
+                step_loss = _gpu_loss_accum.item() / grad_accum
+                step_ce = _gpu_ce_accum.item() / grad_accum
+                step_kd = _gpu_kd_accum.item() / grad_accum
+                step_tokens = _gpu_tokens_accum.item()
+                _gpu_loss_accum.zero_()
+                _gpu_ce_accum.zero_()
+                _gpu_kd_accum.zero_()
+                _gpu_tokens_accum.zero_()
+                epoch_tokens += step_tokens
+                window_tokens += step_tokens
+                epoch_loss += step_loss
+                epoch_ce += step_ce
+                epoch_kd += step_kd
+                running_loss += step_loss
+                running_ce += step_ce
+                running_kd += step_kd
+                if global_step % log_every == 0 or global_step == total_steps:
+                    avg_loss = running_loss / max(running_count, 1)
+                    avg_ce = running_ce / max(running_count, 1)
+                    avg_kd = running_kd / max(running_count, 1)
+                    try:
+                        lr = scheduler.get_last_lr()[0]
+                    except Exception:
+                        lr = cfg.training.learning_rate
+                    window_elapsed = max(time.time() - window_t_start, 0.1)
+                    rolling_tok_s = window_tokens / window_elapsed
+                    rolling_eta_s = (window_elapsed / max(running_count, 1)) * (total_steps - global_step) / log_every * running_count
+                    cum_tok_s = epoch_tokens / max(time.time() - t_epoch, 1)
+                    log.info(
+                        f"  E{epoch + 1}/{cfg.training.num_epochs} "
+                        f"S{global_step:>4}/{total_steps} | "
+                        f"loss={avg_loss:.4f}  ce={avg_ce:.4f}  kd={avg_kd:.4f} | "
+                        f"lr={lr:.2e} | {rolling_tok_s:,.0f} tok/s (avg {cum_tok_s:,.0f}) | ETA {rolling_eta_s / 60:.1f}m"
+                    )
+                    loss_log.append(
+                        {
+                            "step": global_step,
+                            "epoch": epoch + 1,
+                            "loss_total": round(avg_loss, 5),
+                            "loss_ce": round(avg_ce, 5),
+                            "loss_kd": round(avg_kd, 5),
+                            "lr": lr,
+                            "tok_per_sec": round(rolling_tok_s, 0),
+                            "tok_per_sec_cumulative": round(cum_tok_s, 0),
+                        }
+                    )
+                    window_tokens = 0
+                    window_t_start = time.time()
+                    running_loss = 0.0
+                    running_ce = 0.0
+                    running_kd = 0.0
+                    running_count = 0
+                if checkpoint_every_steps and global_step % checkpoint_every_steps == 0 and is_main:
+                    log.info(f"  Saving mid-epoch checkpoint at step {global_step}...")
+                    step_tag = f"step_{global_step}"
+                    step_ckpt_path = save_checkpoint(
+                        model,
+                        tokenizer,
+                        cfg.paths.distilled_dir,
+                        step_tag,
+                        log,
+                        scheduler=scheduler,
+                        trainer_state={
+                            **checkpoint_packing_metadata,
+                            "checkpoint_type": "step",
+                            "phase": args.phase,
+                            "epoch_index": epoch,
+                            "start_epoch": epoch,
+                            "global_step": global_step,
+                            "micro_step_global": micro_step_global,
+                            "next_batch_in_epoch": micro_in_epoch,
+                            "num_epochs": cfg.training.num_epochs,
+                            "micro_batch_size": cfg.training.micro_batch_size,
+                            "grad_accum_steps": grad_accum,
+                        },
+                    )
+                    maybe_upload_checkpoint(step_ckpt_path, step_tag, log)
+                if args.max_steps > 0 and global_step >= args.max_steps:
+                    log.info(f"Reached max_steps={args.max_steps}. Stopping training.")
+                    training_complete = True
+                    break
+        if training_complete:
+            break
+        if not use_ds:
+            remainder = micro_in_epoch % grad_accum
+            if remainder != 0:
+                flush_scale = grad_accum / remainder
+                for parameter in model.parameters():
+                    if parameter.grad is not None:
+                        parameter.grad.mul_(flush_scale)
+                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+                optimizer.step()
+                scheduler.step()
+                optimizer.zero_grad(set_to_none=True)
+                global_step += 1
+                epoch_steps += 1
+                step_loss = _gpu_loss_accum.item() / remainder
+                step_ce = _gpu_ce_accum.item() / remainder
+                step_kd = _gpu_kd_accum.item() / remainder
+                step_tokens = _gpu_tokens_accum.item()
+                _gpu_loss_accum.zero_()
+                _gpu_ce_accum.zero_()
+                _gpu_kd_accum.zero_()
+                _gpu_tokens_accum.zero_()
+                epoch_tokens += step_tokens
+                window_tokens += step_tokens
+                running_loss += step_loss
+                running_ce += step_ce
+                running_kd += step_kd
+                running_count += 1
+                avg_loss = running_loss / max(running_count, 1)
+                avg_ce = running_ce / max(running_count, 1)
+                avg_kd = running_kd / max(running_count, 1)
+                epoch_loss += step_loss
+                epoch_ce += step_ce
+                epoch_kd += step_kd
+                running_loss = 0.0
+                running_ce = 0.0
+                running_kd = 0.0
+                running_count = 0
+                elapsed = time.time() - t_start
+                try:
+                    lr = scheduler.get_last_lr()[0]
+                except Exception:
+                    lr = cfg.training.learning_rate
+                tok_s = epoch_tokens / max(time.time() - t_epoch, 1)
+                eta_s = (elapsed / max(global_step, 1)) * (total_steps - global_step)
+                log.info(
+                    f"  E{epoch + 1}/{cfg.training.num_epochs} "
+                    f"S{global_step:>4}/{total_steps} | "
+                    f"loss={avg_loss:.4f}  ce={avg_ce:.4f}  kd={avg_kd:.4f} | "
+                    f"lr={lr:.2e} | {tok_s:,.0f} tok/s | ETA {eta_s / 60:.1f}m  [flush]"
+                )
+                loss_log.append(
+                    {
+                        "step": global_step,
+                        "epoch": epoch + 1,
+                        "loss_total": round(avg_loss, 5),
+                        "loss_ce": round(avg_ce, 5),
+                        "loss_kd": round(avg_kd, 5),
+                        "lr": lr,
+                        "tok_per_sec": round(tok_s, 0),
+                    }
+                )
+                window_tokens = 0
+                window_t_start = time.time()
+                log.info(f"  Epoch {epoch + 1}: flushed {remainder} leftover micro-batches")
+            else:
+                optimizer.zero_grad(set_to_none=True)
+        elif (micro_step_global % grad_accum) != 0 and epoch < cfg.training.num_epochs - 1:
+            carry = micro_step_global % grad_accum
+            log.info(f"  Epoch {epoch + 1}: carrying {carry} micro-batches into the next epoch")
+        avg_epoch_loss = epoch_loss / max(epoch_steps, 1)
+        avg_epoch_ce = epoch_ce / max(epoch_steps, 1)
+        avg_epoch_kd = epoch_kd / max(epoch_steps, 1)
+        epoch_elapsed = time.time() - t_epoch
+        log.info(
+            f"  Epoch {epoch + 1} done | "
+            f"avg_loss={avg_epoch_loss:.4f} ce={avg_epoch_ce:.4f} kd={avg_epoch_kd:.4f} | "
+            f"{epoch_tokens:,} tok | {epoch_elapsed / 60:.1f}m"
+        )
+        _log_gpu(log)
+        val_metrics = None
+        if val_dataloader is not None:
+            val_start = time.time()
+            val_limit = min(20, len(val_dataloader)) if args.max_steps > 0 else -1
+            if val_limit > 0:
+                log.info(f"  Validation start | capping at {val_limit} batches for dry run (total {len(val_dataloader)} batches)")
+            else:
+                log.info(f"  Validation start | {len(val_dataloader):,} batches")
+            val_metrics = evaluate_validation_loss(
+                phase=args.phase,
+                model=model,
+                dataloader=val_dataloader,
+                device=device,
+                alpha=alpha,
+                temperature=temperature,
+                online_kd_token_chunk_size=int(cfg.training.online_kd_token_chunk_size),
+                teacher_model=teacher_model,
+                max_batches=val_limit,
+            )
+            log.info(
+                f"  Validation | loss={val_metrics['loss']:.4f} ce={val_metrics['ce']:.4f} "
+                f"kd={val_metrics['kd']:.4f} | {int(val_metrics['batches'])} batches | "
+                f"{(time.time() - val_start) / 60:.1f}m"
+            )
+        if is_main:
+            selection_loss = val_metrics["loss"] if val_metrics is not None else avg_epoch_loss
+            is_new_best = selection_loss < best_selection_loss
+            epoch_tag = f"epoch_{epoch + 1}"
+            if is_new_best:
+                best_selection_loss = selection_loss
+                best_checkpoint_tag = epoch_tag
+                log.info(f"  Best update: {best_metric_name}={best_selection_loss:.4f} from {epoch_tag}")
+            else:
+                log.info(
+                    f"  Best unchanged: current {best_metric_name}={selection_loss:.4f}; "
+                    f"best={best_selection_loss:.4f} from {best_checkpoint_tag}"
+                )
+            epoch_state = {
+                **checkpoint_packing_metadata,
+                "checkpoint_type": "epoch",
+                "phase": args.phase,
+                "epoch_index": epoch,
+                "start_epoch": epoch + 1,
+                "global_step": global_step,
+                "micro_step_global": micro_step_global,
+                "next_batch_in_epoch": 0,
+                "num_epochs": cfg.training.num_epochs,
+                "micro_batch_size": cfg.training.micro_batch_size,
+                "grad_accum_steps": grad_accum,
+                "selection_loss": float(selection_loss),
+                "best_selection_loss": float(best_selection_loss),
+                "best_metric_name": best_metric_name,
+                "best_checkpoint_tag": best_checkpoint_tag,
+            }
+            if read_env_flag("QUINTUS_SAVE_EPOCH_CHECKPOINTS", True) and not getattr(cfg.training, "disable_checkpointing", False):
+                epoch_ckpt_path = save_checkpoint(
+                    model,
+                    tokenizer,
+                    cfg.paths.distilled_dir,
+                    epoch_tag,
+                    log,
+                    scheduler=scheduler,
+                    trainer_state=epoch_state,
+                )
+                maybe_upload_checkpoint(epoch_ckpt_path, epoch_tag, log)
+            else:
+                log.info(f"  Skipping intermediate {epoch_tag} save")
+            if is_new_best and not getattr(cfg.training, "disable_checkpointing", False):
+                best_ckpt_path = save_checkpoint(
+                    model,
+                    tokenizer,
+                    cfg.paths.distilled_dir,
+                    "best",
+                    log,
+                    scheduler=scheduler,
+                    trainer_state=dict(epoch_state, checkpoint_type="best"),
+                )
+    if use_ds and final_remainder:
+        model.zero_grad()
+        running_loss = 0.0
+        running_ce = 0.0
+        running_kd = 0.0
+        running_count = 0
+        log.warning(f"  Training end: dropped final {final_remainder} leftover micro-batches")
+    if is_main:
+        if best_ckpt_path and os.path.isdir(best_ckpt_path) and not getattr(cfg.training, "disable_checkpointing", False):
+            maybe_upload_checkpoint(best_ckpt_path, "best", log)
+        last_ckpt_path = save_checkpoint(
+            model,
+            tokenizer,
+            cfg.paths.distilled_dir,
+            "last",
+            log,
+            scheduler=scheduler,
+            trainer_state={
+                **checkpoint_packing_metadata,
+                "checkpoint_type": "last",
+                "phase": args.phase,
+                "start_epoch": cfg.training.num_epochs,
+                "global_step": global_step,
+                "micro_step_global": micro_step_global,
+                "next_batch_in_epoch": 0,
+                "num_epochs": cfg.training.num_epochs,
+                "micro_batch_size": cfg.training.micro_batch_size,
+                "grad_accum_steps": grad_accum,
+                "best_selection_loss": float(best_selection_loss) if math.isfinite(best_selection_loss) else None,
+                "best_metric_name": best_metric_name,
+                "best_checkpoint_tag": best_checkpoint_tag,
+            },
+        )
+        maybe_upload_checkpoint(last_ckpt_path, "last", log)
+    csv_path = os.path.join(cfg.paths.distilled_dir, cfg.paths.loss_csv)
+    if loss_log and is_main:
+        with open(csv_path, "w", newline="", encoding="utf-8") as f:
+            writer = csv.DictWriter(f, fieldnames=loss_log[0].keys())
+            writer.writeheader()
+            writer.writerows(loss_log)
+        log.info(f"Loss CSV -> {csv_path}")
+    total_elapsed = time.time() - t_start
+    emit_log_spacing(log)
+    log.info("=" * 70)
+    log.info("Training complete")
+    log.info(f"  Wall time:    {total_elapsed / 3600:.2f}h ({total_elapsed / 60:.1f}m)")
+    log.info(f"  Optim steps:  {global_step}")
+    log.info(f"  Micro steps:  {micro_step_global}")
+    log.info(f"  Best {best_metric_name}: {best_selection_loss:.4f}")
+    log.info(f"  Best ckpt:    {best_ckpt_path}")
+    log.info(f"  Output dir:   {cfg.paths.distilled_dir}/")
+    log.info("=" * 70)
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception:
+        try:
+            setup_logger("TRAIN").exception("Uncaught training failure")
+        except Exception:
+            pass
+        raise

src/training_data.py ADDED Viewed

	@@ -0,0 +1,375 @@

+from __future__ import annotations
+import json
+import os
+import torch
+import torch.nn.functional as F
+from torch.utils.data import Dataset
+from configs import cfg
+PAD_MULTIPLE = 128
+def torch_load_cpu(path: str) -> dict:
+    try:
+        return torch.load(path, map_location="cpu", weights_only=True)
+    except TypeError:
+        return torch.load(path, map_location="cpu")
+def extract_shard_id_range(shard_payload: dict, shard_path: str) -> tuple[int, int]:
+    try:
+        ids_payload = shard_payload["ids"]
+    except KeyError as exc:
+        raise KeyError(
+            f"Teacher shard {shard_path} is missing 'ids'. Regenerate the teacher-logit shards."
+        ) from exc
+    if torch.is_tensor(ids_payload):
+        if ids_payload.numel() == 0:
+            raise ValueError(
+                f"Teacher shard {shard_path} has an empty ids tensor. Regenerate the teacher-logit shards."
+            )
+        return int(ids_payload.min().item()), int(ids_payload.max().item())
+    if not isinstance(ids_payload, list) or not ids_payload:
+        raise ValueError(
+            f"Teacher shard {shard_path} has an incompatible ids payload. "
+            "Regenerate the teacher-logit shards."
+        )
+    min_id: int | None = None
+    max_id: int | None = None
+    for sample_idx, ids_tensor in enumerate(ids_payload):
+        if not torch.is_tensor(ids_tensor):
+            raise ValueError(
+                f"Teacher shard {shard_path} sample #{sample_idx} has a non-tensor ids payload. "
+                "Regenerate the teacher-logit shards."
+            )
+        if ids_tensor.numel() == 0:
+            continue
+        sample_min = int(ids_tensor.min().item())
+        sample_max = int(ids_tensor.max().item())
+        min_id = sample_min if min_id is None else min(min_id, sample_min)
+        max_id = sample_max if max_id is None else max(max_id, sample_max)
+    if min_id is None or max_id is None:
+        raise ValueError(
+            f"Teacher shard {shard_path} only contains empty ids tensors. "
+            "Regenerate the teacher-logit shards."
+        )
+    return min_id, max_id
+class DistillationDataset(Dataset):
+    def __init__(self, data_path: str, logits_dir: str, max_seq_len: int, num_samples: int = -1, phase: str = "kd"):
+        self.phase = phase
+        self.data_path = data_path
+        self.logits_dir = logits_dir
+        self.max_seq_len = max_seq_len
+        self.samples_per_shard = self._resolve_samples_per_shard()
+        self.sample_offsets: list[int] = []
+        self.sample_lengths: list[int] = []
+        self.sample_target_counts: list[int] = []
+        self._data_handle = None
+        self._cached_shard_idx: int | None = None
+        self._cached_shard_path: str | None = None
+        self._cached_shard_payload: dict | None = None
+        with open(data_path, "r", encoding="utf-8") as f:
+            while True:
+                if 0 < num_samples <= len(self.sample_offsets):
+                    break
+                offset = f.tell()
+                line = f.readline()
+                if not line:
+                    break
+                i = len(self.sample_offsets)
+                raw_sample = json.loads(line)
+                input_ids_list, loss_mask_list = self._coerce_tokenized_row(raw_sample, i)
+                self.sample_offsets.append(offset)
+                self.sample_lengths.append(len(input_ids_list))
+                self.sample_target_counts.append(sum(loss_mask_list))
+    def __len__(self) -> int:
+        return len(self.sample_offsets)
+    def __getstate__(self) -> dict:
+        state = self.__dict__.copy()
+        state["_data_handle"] = None
+        state["_cached_shard_idx"] = None
+        state["_cached_shard_path"] = None
+        state["_cached_shard_payload"] = None
+        return state
+    def __del__(self) -> None:
+        data_handle = getattr(self, "_data_handle", None)
+        if data_handle is not None:
+            try:
+                data_handle.close()
+            except Exception:
+                pass
+    def _resolve_samples_per_shard(self) -> int:
+        prov_path = os.path.join(self.logits_dir, "_provenance.json")
+        if not os.path.exists(prov_path):
+            return 1
+        try:
+            with open(prov_path, "r", encoding="utf-8") as f:
+                prov = json.load(f)
+        except (OSError, json.JSONDecodeError):
+            return 1
+        shard_schema = prov.get("shard_schema", {})
+        if shard_schema.get("layout") != "chunked_sample_lists":
+            return 1
+        raw_value = prov.get("samples_per_shard", 1)
+        try:
+            value = int(raw_value)
+        except (TypeError, ValueError):
+            return 1
+        return max(value, 1)
+    def _coerce_tokenized_row(self, raw_sample: dict, idx: int) -> tuple[list[int], list[int]]:
+        try:
+            input_ids = raw_sample["input_ids"][: self.max_seq_len]
+        except KeyError as exc:
+            raise KeyError(
+                f"Tokenized sample #{idx} is missing 'input_ids'. "
+                "Re-run download.py to regenerate the tokenized dataset."
+            ) from exc
+        try:
+            loss_mask = raw_sample["loss_mask"][: len(input_ids)]
+        except KeyError as exc:
+            raise KeyError(
+                "Tokenized sample is missing 'loss_mask'. Re-run download.py to regenerate "
+                "assistant-only training targets before distilling."
+            ) from exc
+        if not isinstance(input_ids, list) or len(input_ids) == 0:
+            raise ValueError(
+                f"Tokenized sample #{idx} has incompatible input_ids payload. "
+                "Re-run download.py to regenerate."
+            )
+        if not isinstance(loss_mask, list) or len(loss_mask) != len(input_ids):
+            raise ValueError(
+                f"Tokenized sample #{idx} has incompatible loss_mask length {len(loss_mask)}. "
+                "Re-run download.py to regenerate assistant-only targets."
+            )
+        normalized_mask = [int(value) for value in loss_mask]
+        if any(value not in (0, 1) for value in normalized_mask):
+            raise ValueError(
+                f"Tokenized sample #{idx} has non-binary loss_mask values. "
+                "Re-run download.py to regenerate assistant-only targets."
+            )
+        if sum(normalized_mask) == 0:
+            raise ValueError(
+                f"Tokenized sample #{idx} has no assistant target tokens. "
+                "Re-run download.py to filter invalid conversations."
+            )
+        return [int(token_id) for token_id in input_ids], normalized_mask
+    def _data_file(self):
+        if self._data_handle is None:
+            self._data_handle = open(self.data_path, "r", encoding="utf-8")
+        return self._data_handle
+    def _load_raw_sample(self, idx: int) -> dict:
+        data_file = self._data_file()
+        data_file.seek(self.sample_offsets[idx])
+        line = data_file.readline()
+        if not line:
+            raise IndexError(f"Tokenized sample #{idx} could not be read from {self.data_path}.")
+        return json.loads(line)
+    def _load_shard_payload(self, shard_idx: int) -> tuple[str, dict]:
+        if self._cached_shard_idx == shard_idx and self._cached_shard_payload is not None and self._cached_shard_path is not None:
+            return self._cached_shard_path, self._cached_shard_payload
+        shard_path = os.path.join(self.logits_dir, f"shard_{shard_idx:06d}.pt")
+        if not os.path.exists(shard_path):
+            raise FileNotFoundError(
+                f"Missing teacher logit shard: {shard_path}. "
+                "Regenerate the teacher-logit shards."
+            )
+        payload = torch_load_cpu(shard_path)
+        self._cached_shard_idx = shard_idx
+        self._cached_shard_path = shard_path
+        self._cached_shard_payload = payload
+        return shard_path, payload
+    def _load_teacher_tensors(self, idx: int, seq_len: int) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        if self.samples_per_shard <= 1:
+            shard_path, shard = self._load_shard_payload(idx)
+            try:
+                teacher_logprobs = shard["logprobs"][:seq_len]
+                teacher_ids = shard["ids"][:seq_len]
+                teacher_other_logprob = shard["other_logprob"][:seq_len]
+            except KeyError as exc:
+                missing = exc.args[0]
+                raise KeyError(
+                    f"Shard {shard_path} is missing {missing!r}. "
+                    "Regenerate the current teacher-logit shards."
+                ) from exc
+            return teacher_logprobs, teacher_ids, teacher_other_logprob
+        shard_idx = idx // self.samples_per_shard
+        sample_offset = idx % self.samples_per_shard
+        shard_path, shard = self._load_shard_payload(shard_idx)
+        try:
+            count = int(shard["count"])
+            start_idx = int(shard["start_idx"])
+            logprobs_list = shard["logprobs"]
+            ids_list = shard["ids"]
+            other_list = shard["other_logprob"]
+        except KeyError as exc:
+            missing = exc.args[0]
+            raise KeyError(
+                f"Grouped shard {shard_path} is missing {missing!r}. "
+                "Regenerate the current teacher-logit shards."
+            ) from exc
+        expected_start_idx = shard_idx * self.samples_per_shard
+        if start_idx != expected_start_idx:
+            raise ValueError(
+                f"Grouped shard {shard_path} starts at sample {start_idx}, "
+                f"expected {expected_start_idx}. Regenerate the teacher-logit shards."
+            )
+        if not (len(logprobs_list) == len(ids_list) == len(other_list) == count):
+            raise ValueError(
+                f"Grouped shard {shard_path} has inconsistent sample counts. "
+                "Regenerate the current teacher-logit shards."
+            )
+        if sample_offset >= count:
+            raise FileNotFoundError(
+                f"Grouped shard {shard_path} does not contain sample #{idx} "
+                f"(start_idx={start_idx}, count={count}). Regenerate the teacher-logit shards."
+            )
+        try:
+            teacher_logprobs = logprobs_list[sample_offset][:seq_len]
+            teacher_ids = ids_list[sample_offset][:seq_len]
+            teacher_other_logprob = other_list[sample_offset][:seq_len]
+        except (IndexError, TypeError) as exc:
+            raise ValueError(
+                f"Grouped shard {shard_path} has an incompatible payload layout. "
+                "Regenerate the current teacher-logit shards."
+            ) from exc
+        return teacher_logprobs, teacher_ids, teacher_other_logprob
+    def __getitem__(self, idx: int) -> dict:
+        raw_sample = self._load_raw_sample(idx)
+        input_ids_list, loss_mask_list = self._coerce_tokenized_row(raw_sample, idx)
+        input_ids = torch.tensor(input_ids_list, dtype=torch.long)
+        loss_mask = torch.tensor(loss_mask_list, dtype=torch.long)
+        seq_len = int(input_ids.size(0))
+        if self.phase in ("sft", "online_kd"):
+            return {"input_ids": input_ids, "loss_mask": loss_mask}
+        teacher_logprobs, teacher_ids, teacher_other_logprob = self._load_teacher_tensors(idx, seq_len)
+        if teacher_logprobs.shape[0] != seq_len:
+            raise ValueError(
+                f"Teacher shard for sample #{idx} has length {teacher_logprobs.shape[0]}, "
+                f"but the tokenized row has length {seq_len}. Regenerate the teacher-logit shards; "
+                "teacher shards must be in original JSONL row order."
+            )
+        if teacher_logprobs.ndim != 2 or teacher_ids.shape != teacher_logprobs.shape:
+            raise ValueError(
+                f"Teacher shard for sample #{idx} has incompatible top-k tensor shapes: "
+                f"logprobs={tuple(teacher_logprobs.shape)}, ids={tuple(teacher_ids.shape)}. "
+                "Regenerate the current teacher-logit shards."
+            )
+        if teacher_other_logprob.ndim != 1 or teacher_other_logprob.shape[0] != teacher_logprobs.shape[0]:
+            raise ValueError(
+                f"Teacher shard for sample #{idx} has incompatible other-bucket shape: "
+                f"other_logprob={tuple(teacher_other_logprob.shape)}, "
+                f"expected ({teacher_logprobs.shape[0]},). "
+                "Regenerate the current teacher-logit shards."
+            )
+        if teacher_logprobs.shape[1] != cfg.training.top_k:
+            raise ValueError(
+                f"Teacher shard for sample #{idx} stores top_k={teacher_logprobs.shape[1]}, "
+                f"expected {cfg.training.top_k}. "
+                "Regenerate compatible teacher-logit shards."
+            )
+        return {
+            "input_ids": input_ids,
+            "loss_mask": loss_mask,
+            "teacher_logprobs": teacher_logprobs,
+            "teacher_ids": teacher_ids.long(),
+            "teacher_other_logprob": teacher_other_logprob,
+        }
+def collate_fn(batch: list[dict], pad_token_id: int) -> dict:
+    raw_max = max(item["input_ids"].size(0) for item in batch)
+    max_len = ((raw_max + PAD_MULTIPLE - 1) // PAD_MULTIPLE) * PAD_MULTIPLE
+    input_ids_list, mask_list, loss_mask_list, labels_list = [], [], [], []
+    teacher_logprobs_list, teacher_ids_list, teacher_other_logprob_list = [], [], []
+    for item in batch:
+        seq_len = item["input_ids"].size(0)
+        pad_len = max_len - seq_len
+        padded_loss_mask = F.pad(item["loss_mask"], (0, pad_len), value=0)
+        padded_labels = F.pad(item["input_ids"].clone(), (0, pad_len), value=pad_token_id)
+        padded_labels = padded_labels.masked_fill(padded_loss_mask == 0, -100)
+        input_ids_list.append(F.pad(item["input_ids"], (0, pad_len), value=pad_token_id))
+        mask_list.append(
+            torch.cat(
+                [
+                    torch.ones(seq_len, dtype=torch.long),
+                    torch.zeros(pad_len, dtype=torch.long),
+                ]
+            )
+        )
+        loss_mask_list.append(padded_loss_mask)
+        labels_list.append(padded_labels)
+        if "teacher_logprobs" in item:
+            teacher_seq_len = item["teacher_logprobs"].size(0)
+            teacher_pad_len = max_len - teacher_seq_len
+            teacher_logprobs_list.append(
+                F.pad(item["teacher_logprobs"], (0, 0, 0, teacher_pad_len), value=float("-inf"))
+            )
+            teacher_ids_list.append(F.pad(item["teacher_ids"], (0, 0, 0, teacher_pad_len), value=0))
+            teacher_other_logprob_list.append(
+                F.pad(item["teacher_other_logprob"], (0, teacher_pad_len), value=float("-inf"))
+            )
+    result = {
+        "input_ids": torch.stack(input_ids_list),
+        "attention_mask": torch.stack(mask_list),
+        "loss_mask": torch.stack(loss_mask_list).long(),
+        "labels": torch.stack(labels_list),
+    }
+    if teacher_logprobs_list:
+        result["teacher_logprobs"] = torch.stack(teacher_logprobs_list)
+        result["teacher_ids"] = torch.stack(teacher_ids_list)
+        result["teacher_other_logprob"] = torch.stack(teacher_other_logprob_list)
+    return result
+def resolve_dataloader_runtime() -> dict[str, int | bool]:
+    cpu_count = max(1, os.cpu_count() or 1)
+    configured_workers = int(getattr(cfg.training, "dataloader_workers", 4))
+    num_workers = max(0, min(configured_workers, cpu_count))
+    runtime: dict[str, int | bool] = {
+        "num_workers": num_workers,
+        "pin_memory": torch.cuda.is_available(),
+    }
+    if num_workers > 0:
+        runtime["persistent_workers"] = True
+        runtime["prefetch_factor"] = max(1, int(getattr(cfg.training, "prefetch_factor", 2)))
+    return runtime
+def move_batch_to_device(batch: dict[str, torch.Tensor], device: torch.device) -> dict[str, torch.Tensor]:
+    non_blocking = device.type == "cuda"
+    return {
+        name: tensor.to(device, non_blocking=non_blocking)
+        for name, tensor in batch.items()
+    }

src/training_schedule.py ADDED Viewed

	@@ -0,0 +1,165 @@

+from __future__ import annotations
+import json
+import math
+import torch
+from torch.utils.data import Dataset, Subset
+def compute_training_schedule(
+    dataset_size: int,
+    micro_batch_size: int,
+    grad_accum: int,
+    num_epochs: int,
+    use_ds: bool,
+    drop_last: bool = True,
+) -> dict[str, int | bool]:
+    if dataset_size < 0:
+        raise ValueError("dataset_size must be >= 0")
+    if micro_batch_size <= 0 or grad_accum <= 0 or num_epochs <= 0:
+        raise ValueError("micro_batch_size, grad_accum, and num_epochs must all be positive")
+    if drop_last:
+        batches_per_epoch = dataset_size // micro_batch_size
+        used_samples_per_epoch = batches_per_epoch * micro_batch_size
+        dropped_samples_per_epoch = dataset_size - used_samples_per_epoch
+    else:
+        batches_per_epoch = math.ceil(dataset_size / micro_batch_size) if dataset_size else 0
+        used_samples_per_epoch = dataset_size
+        dropped_samples_per_epoch = 0
+    total_micro_batches = batches_per_epoch * num_epochs
+    remainder_batches = batches_per_epoch % grad_accum if batches_per_epoch else 0
+    has_remainder = remainder_batches != 0
+    if use_ds:
+        steps_per_epoch = batches_per_epoch // grad_accum
+        total_steps = total_micro_batches // grad_accum
+        final_remainder = total_micro_batches % grad_accum
+    else:
+        steps_per_epoch = batches_per_epoch // grad_accum + (1 if has_remainder and batches_per_epoch else 0)
+        total_steps = steps_per_epoch * num_epochs
+        final_remainder = 0
+    return {
+        "batches_per_epoch": batches_per_epoch,
+        "used_samples_per_epoch": used_samples_per_epoch,
+        "dropped_samples_per_epoch": dropped_samples_per_epoch,
+        "remainder_batches": remainder_batches,
+        "has_remainder": has_remainder,
+        "total_micro_batches": total_micro_batches,
+        "steps_per_epoch": steps_per_epoch,
+        "total_steps": total_steps,
+        "final_remainder": final_remainder,
+        "dropped_samples_total": final_remainder * micro_batch_size if use_ds else 0,
+    }
+def choose_validation_size(
+    dataset_size: int,
+    validation_ratio: float,
+    micro_batch_size: int,
+    grad_accum: int,
+    num_epochs: int,
+    use_ds: bool,
+) -> int:
+    if not 0.0 <= validation_ratio < 1.0:
+        raise ValueError(f"validation_ratio must be in [0, 1), got {validation_ratio}")
+    if dataset_size < 2 or validation_ratio <= 0:
+        return 0
+    desired_val_size = max(1, int(round(dataset_size * validation_ratio)))
+    aligned_candidates: list[tuple[int, int]] = []
+    fallback_candidates: list[tuple[int, int]] = []
+    for val_size in range(1, dataset_size):
+        train_size = dataset_size - val_size
+        schedule = compute_training_schedule(
+            dataset_size=train_size,
+            micro_batch_size=micro_batch_size,
+            grad_accum=grad_accum,
+            num_epochs=num_epochs,
+            use_ds=use_ds,
+            drop_last=True,
+        )
+        if int(schedule["batches_per_epoch"]) == 0:
+            continue
+        if int(schedule["dropped_samples_per_epoch"]) != 0:
+            continue
+        candidate = (abs(val_size - desired_val_size), val_size)
+        if int(schedule["remainder_batches"]) == 0 and int(schedule["final_remainder"]) == 0:
+            aligned_candidates.append(candidate)
+        else:
+            fallback_candidates.append(candidate)
+    if aligned_candidates:
+        return min(aligned_candidates)[1]
+    if fallback_candidates:
+        return min(fallback_candidates)[1]
+    return min(desired_val_size, dataset_size - 1)
+def build_train_validation_subsets(
+    dataset: Dataset,
+    validation_ratio: float,
+    split_seed: int,
+    micro_batch_size: int,
+    grad_accum: int,
+    num_epochs: int,
+    use_ds: bool,
+) -> tuple[Dataset, Dataset | None, dict[str, float | int | bool]]:
+    dataset_size = len(dataset)
+    validation_size = choose_validation_size(
+        dataset_size=dataset_size,
+        validation_ratio=validation_ratio,
+        micro_batch_size=micro_batch_size,
+        grad_accum=grad_accum,
+        num_epochs=num_epochs,
+        use_ds=use_ds,
+    )
+    requested_validation_size = max(1, int(round(dataset_size * validation_ratio))) if validation_ratio > 0 else 0
+    metadata: dict[str, float | int | bool] = {
+        "dataset_size": dataset_size,
+        "requested_validation_size": requested_validation_size,
+        "validation_size": validation_size,
+        "train_size": dataset_size - validation_size,
+        "requested_validation_ratio": validation_ratio,
+        "actual_validation_ratio": (validation_size / dataset_size) if dataset_size else 0.0,
+        "adjusted": validation_size != requested_validation_size,
+    }
+    train_schedule = compute_training_schedule(
+        dataset_size=dataset_size - validation_size,
+        micro_batch_size=micro_batch_size,
+        grad_accum=grad_accum,
+        num_epochs=num_epochs,
+        use_ds=use_ds,
+        drop_last=True,
+    )
+    metadata.update(
+        {
+            "effective_batch_size": micro_batch_size * grad_accum,
+            "train_batches_per_epoch": int(train_schedule["batches_per_epoch"]),
+            "train_remainder_batches": int(train_schedule["remainder_batches"]),
+            "train_dropped_samples_per_epoch": int(train_schedule["dropped_samples_per_epoch"]),
+            "accumulation_aligned": int(train_schedule["remainder_batches"]) == 0
+            and int(train_schedule["final_remainder"]) == 0,
+        }
+    )
+    if validation_size == 0:
+        return dataset, None, metadata
+    generator = torch.Generator().manual_seed(split_seed)
+    permutation = torch.randperm(dataset_size, generator=generator).tolist()
+    val_indices = sorted(permutation[:validation_size])
+    train_indices = sorted(permutation[validation_size:])
+    return Subset(dataset, train_indices), Subset(dataset, val_indices), metadata
+def load_deepspeed_runtime_config(config_path: str, micro_batch_size: int, grad_accum: int) -> dict:
+    with open(config_path, "r", encoding="utf-8") as f:
+        ds_cfg = json.load(f)
+    if not isinstance(ds_cfg, dict):
+        raise ValueError(f"DeepSpeed config in {config_path} must be a JSON object.")
+    runtime_cfg = dict(ds_cfg)
+    runtime_cfg["train_micro_batch_size_per_gpu"] = micro_batch_size
+    runtime_cfg["gradient_accumulation_steps"] = grad_accum
+    return runtime_cfg

src/transformers_compat.py ADDED Viewed

	@@ -0,0 +1,110 @@

+from __future__ import annotations
+import importlib
+import importlib.util
+import os
+from configs import cfg
+def _false() -> bool:
+    return False
+def describe_exception_chain(exc: Exception) -> str:
+    messages: list[str] = []
+    seen: set[int] = set()
+    current: BaseException | None = exc
+    while current is not None and id(current) not in seen:
+        seen.add(id(current))
+        message = f"{type(current).__name__}: {current}"
+        if message not in messages:
+            messages.append(message)
+        current = current.__cause__ or current.__context__
+    return " -> ".join(messages)
+def disable_flash_attn_for_transformers() -> None:
+    try:
+        import transformers.utils as tf_utils
+        tf_utils.is_flash_attn_2_available = _false
+        if hasattr(tf_utils, "is_flash_attn_3_available"):
+            tf_utils.is_flash_attn_3_available = _false
+    except Exception:
+        pass
+    try:
+        from transformers.utils import import_utils as tf_import_utils
+        tf_import_utils.is_flash_attn_2_available = _false
+        if hasattr(tf_import_utils, "is_flash_attn_3_available"):
+            tf_import_utils.is_flash_attn_3_available = _false
+    except Exception:
+        pass
+    try:
+        import transformers.modeling_utils as modeling_utils
+        if hasattr(modeling_utils, "is_flash_attn_2_available"):
+            modeling_utils.is_flash_attn_2_available = _false
+        if hasattr(modeling_utils, "is_flash_attn_3_available"):
+            modeling_utils.is_flash_attn_3_available = _false
+    except Exception:
+        pass
+    try:
+        import transformers.modeling_flash_attention_utils as flash_utils
+        flash_utils.is_flash_attn_2_available = _false
+        if hasattr(flash_utils, "is_flash_attn_3_available"):
+            flash_utils.is_flash_attn_3_available = _false
+    except Exception:
+        pass
+def resolve_attention_backend(logger) -> str:
+    forced_backend = os.environ.get("QUINTUS_ATTENTION_BACKEND")
+    if forced_backend:
+        logger.info(f"  [ATTENTION] Forced backend via QUINTUS_ATTENTION_BACKEND={forced_backend!r}.")
+        return forced_backend
+    try:
+        from transformers.utils import is_flash_attn_3_available
+        if is_flash_attn_3_available():
+            logger.info("  [ATTENTION] Using flash_attention_3.")
+            return "flash_attention_3"
+    except Exception:
+        pass
+    try:
+        importlib.import_module("flash_attn")
+        logger.info("  [ATTENTION] Using flash_attention_2.")
+        return "flash_attention_2"
+    except Exception as exc:
+        if importlib.util.find_spec("flash_attn") is not None:
+            disable_flash_attn_for_transformers()
+            logger.warning(
+                "flash-attn appears installed but failed to import (%s: %s); "
+                "masking flash-attn from Transformers and falling back to sdpa.",
+                type(exc).__name__,
+                exc,
+            )
+        else:
+            logger.info("  [ATTENTION] Using PyTorch SDPA.")
+        return "sdpa"
+def _requires_remote_code_opt_in(exc: Exception) -> bool:
+    message = str(exc).lower()
+    return (
+        "trust_remote_code" in message
+        or "requires you to execute the configuration file" in message
+        or "requires remote code" in message
+    )
+def format_model_load_error(subject: str, exc: Exception) -> str:
+    if not cfg.model.allow_remote_code and _requires_remote_code_opt_in(exc):
+        return (
+            f"{subject} failed because the selected model/tokenizer requires remote code, "
+            "but Quintus is configured with allow_remote_code=false. Review the upstream "
+            "repository and rerun with QUINTUS_ALLOW_REMOTE_CODE=1 only if you explicitly "
+            "trust that code."
+        )
+    return f"{subject} failed: {describe_exception_chain(exc)}"

src/validation.py ADDED Viewed

	@@ -0,0 +1,70 @@

+from __future__ import annotations
+import torch
+from torch.utils.data import DataLoader
+from src.losses import compute_loss_for_phase
+from src.training_data import move_batch_to_device
+def evaluate_validation_loss(
+    phase: str,
+    model,
+    dataloader: DataLoader,
+    device: torch.device,
+    alpha: float,
+    temperature: float,
+    online_kd_token_chunk_size: int = 2048,
+    teacher_model=None,
+    max_batches: int = -1,
+) -> dict[str, float | int]:
+    was_training = model.training
+    model.eval()
+    total_loss = 0.0
+    total_ce = 0.0
+    total_kd = 0.0
+    batches = 0
+    with torch.inference_mode():
+        for batch in dataloader:
+            if max_batches > 0 and batches >= max_batches:
+                break
+            batch = move_batch_to_device(batch, device)
+            input_ids = batch["input_ids"]
+            attention_mask = batch["attention_mask"]
+            labels = batch["labels"]
+            loss_mask = batch["loss_mask"]
+            logits = model(input_ids=input_ids, attention_mask=attention_mask).logits
+            if phase == "online_kd" and teacher_model is not None:
+                teacher_logits = teacher_model(input_ids=input_ids, attention_mask=attention_mask).logits
+            else:
+                teacher_logits = None
+            loss, ce, kd = compute_loss_for_phase(
+                phase,
+                logits,
+                labels,
+                loss_mask,
+                batch,
+                alpha,
+                temperature,
+                teacher_logits=teacher_logits,
+                online_kd_token_chunk_size=online_kd_token_chunk_size,
+            )
+            total_loss += float(loss.detach().item())
+            total_ce += float(ce.detach().item())
+            total_kd += float(kd.detach().item())
+            batches += 1
+    if was_training:
+        model.train()
+    denom = max(batches, 1)
+    return {
+        "loss": total_loss / denom,
+        "ce": total_ce / denom,
+        "kd": total_kd / denom,
+        "batches": batches,
+    }

weight_audit/quintus_weight_audit.py ADDED Viewed

	@@ -0,0 +1,818 @@

+"""
+Usage : python audit.py \
+        --base_model Qwen/Qwen3-1.7B-Base \
+        --distilled_model iamrahulreddy/Quintus \
+        --output_file weight_audit_report.txt \
+        --alpha 0.3
+"""
+import argparse
+import collections
+import math
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+from huggingface_hub import snapshot_download
+from transformers import AutoConfig, AutoModelForCausalLM
+# Formatting utilities
+def fmt_num(n: int) -> str:
+    if n >= 1_000_000_000:
+        return f"{n:,}  ({n / 1e9:.6f} B)"
+    if n >= 1_000_000:
+        return f"{n:,}  ({n / 1e6:.6f} M)"
+    return f"{n:,}"
+def fmt_size(b: int) -> str:
+    if b >= 1 << 30:
+        return f"{b / (1 << 30):.3f} GiB"
+    if b >= 1 << 20:
+        return f"{b / (1 << 20):.3f} MiB"
+    if b >= 1 << 10:
+        return f"{b / (1 << 10):.3f} KiB"
+    return f"{b} B"
+def divider(char: str = "-", width: int = 88) -> str:
+    return char * width
+def section_header(index: int, title: str) -> str:
+    return f"\n[{index:02d}] {title}"
+def sub_header(title: str) -> str:
+    return f"\n  -- {title}"
+# Layer classification
+LAYER_TYPE_MAP = {
+    "embed_tokens":              "embedding",
+    "lm_head":                   "lm_head",
+    "self_attn.q_proj":          "attn_q",
+    "self_attn.k_proj":          "attn_k",
+    "self_attn.v_proj":          "attn_v",
+    "self_attn.o_proj":          "attn_o",
+    "self_attn.q_norm":          "attn_qnorm",
+    "self_attn.k_norm":          "attn_knorm",
+    "mlp.gate_proj":             "mlp_gate",
+    "mlp.up_proj":               "mlp_up",
+    "mlp.down_proj":             "mlp_down",
+    "input_layernorm":           "layernorm",
+    "post_attention_layernorm":  "layernorm",
+    "model.norm":                "final_norm",
+}
+def classify_layer(name: str) -> str:
+    for pattern, label in LAYER_TYPE_MAP.items():
+        if pattern in name:
+            return label
+    return "other"
+# Tensor statistics
+def tensor_stats(t: torch.Tensor) -> dict:
+    tf   = t.float()
+    flat = tf.view(-1)
+    mean = flat.mean().item()
+    std  = flat.std().item()
+    sparsity   = (flat.abs() < 1e-6).float().mean().item()
+    sat_thresh = flat.abs().max().item() * 0.99
+    saturation = (flat.abs() >= sat_thresh).float().mean().item()
+    kurtosis   = (((flat - mean) / std) ** 4).mean().item() - 3.0 if std > 1e-10 else 0.0
+    outlier_r  = (flat.abs() > (flat.abs().mean() + 3.0 * std)).float().mean().item()
+    row_l2_stats = {}
+    if tf.ndim == 2:
+        row_norms = tf.norm(2, dim=1)
+        row_l2_stats = {
+            "row_l2_mean": row_norms.mean().item(),
+            "row_l2_std":  row_norms.std().item(),
+            "row_l2_min":  row_norms.min().item(),
+            "row_l2_max":  row_norms.max().item(),
+            "dead_rows":   int((row_norms < 1e-6).sum().item()),
+        }
+    return {
+        "shape":         list(tf.shape),
+        "numel":         flat.numel(),
+        "dtype":         str(t.dtype),
+        "mean":          mean,
+        "std":           std,
+        "min":           flat.min().item(),
+        "max":           flat.max().item(),
+        "abs_mean":      flat.abs().mean().item(),
+        "l2_norm":       flat.norm(2).item(),
+        "l1_norm":       flat.norm(1).item(),
+        "sparsity":      sparsity,
+        "saturation":    saturation,
+        "kurtosis":      kurtosis,
+        "outlier_ratio": outlier_r,
+        **row_l2_stats,
+    }
+# Divergence between two tensors
+def tensor_divergence(t_base: torch.Tensor, t_dist: torch.Tensor, chunk_size: int = 10_000_000) -> dict:
+    a_flat = t_base.detach().view(-1)
+    b_flat = t_dist.detach().view(-1)
+    n_elements = a_flat.numel()
+    # Running accumulators in float64 (on CPU/Python) to prevent memory spikes
+    dot_prod = 0.0
+    a_sq_sum = 0.0
+    b_sq_sum = 0.0
+    # Delta statistics
+    max_delta = 0.0
+    sum_delta = 0.0
+    l2_delta_sq = 0.0
+    sum_abs_a = 0.0
+    # Process in chunks to keep memory footprint extremely small (~80MB peak per chunk)
+    for i in range(0, n_elements, chunk_size):
+        a_chunk = a_flat[i : i + chunk_size].to(torch.float64)
+        b_chunk = b_flat[i : i + chunk_size].to(torch.float64)
+        # Accumulate dot product and norms
+        dot_prod += torch.dot(a_chunk, b_chunk).item()
+        a_sq_sum += torch.dot(a_chunk, a_chunk).item()
+        b_sq_sum += torch.dot(b_chunk, b_chunk).item()
+        # Accumulate delta stats
+        delta_chunk = (b_chunk - a_chunk).abs()
+        max_delta = max(max_delta, delta_chunk.max().item())
+        sum_delta += delta_chunk.sum().item()
+        l2_delta_sq += torch.dot(delta_chunk, delta_chunk).item()
+        sum_abs_a += a_chunk.abs().sum().item()
+    # Final metrics
+    a_norm = math.sqrt(a_sq_sum)
+    b_norm = math.sqrt(b_sq_sum)
+    if a_norm > 0 and b_norm > 0:
+        cos_sim_raw = dot_prod / (a_norm * b_norm)
+    else:
+        cos_sim_raw = 0.0
+    cos_sim = max(-1.0, min(1.0, cos_sim_raw))
+    rel_err = sum_delta / (sum_abs_a + 1e-12)
+    base_l2 = a_norm
+    delta_l2 = math.sqrt(l2_delta_sq)
+    snr_db = 20.0 * math.log10(base_l2 / (delta_l2 + 1e-12)) if base_l2 > 0 else 0.0
+    # Standard deviation of delta
+    mean_delta = sum_delta / n_elements
+    mean_delta_sq = l2_delta_sq / n_elements
+    var_delta = max(0.0, mean_delta_sq - mean_delta**2)
+    std_delta = math.sqrt(var_delta)
+    return {
+        "max_delta":  max_delta,
+        "mean_delta": mean_delta,
+        "std_delta":  std_delta,
+        "l2_delta":   delta_l2,
+        "cos_sim":    cos_sim,
+        "cos_sim_raw": cos_sim_raw,
+        "rel_err":    rel_err,
+        "snr_db":     snr_db,
+        "changed":    max_delta > 1e-7,
+    }
+# Isotropy
+def isotropy_score(t: torch.Tensor, n_samples: int = 2048) -> float:
+    """
+    Average pairwise cosine similarity of randomly sampled row vectors.
+    Near 0 = isotropic (healthy). Near 1 = collapsed representations.
+    Only valid for 2D tensors with >= 2 rows.
+    """
+    if t.ndim != 2 or t.shape[0] < 2:
+        return float("nan")
+    tf    = t.float()
+    n     = min(t.shape[0], n_samples)
+    # Add deterministic seed for isotropy sampling
+    gen   = torch.Generator().manual_seed(42)
+    idx   = torch.randperm(t.shape[0], generator=gen)[:n].to(t.device)
+    rows  = tf[idx]
+    norms = rows.norm(2, dim=1, keepdim=True).clamp(min=1e-12)
+    normed = rows / norms
+    sim    = normed @ normed.T
+    mask   = ~torch.eye(n, dtype=torch.bool)
+    return sim[mask].mean().item()
+# Config helpers
+def config_architecture_lines(config, label: str, model_id: str) -> list[str]:
+    cfg      = config.to_dict()
+    n_q      = cfg.get("num_attention_heads", 1)
+    n_kv     = cfg.get("num_key_value_heads", n_q)
+    h        = cfg.get("hidden_size", 0)
+    head_dim = h // n_q if n_q else 0
+    gqa      = n_q // n_kv if n_kv else 1
+    return [
+        f"  label                    : {label}  ({model_id})",
+        f"  model_type               : {cfg.get('model_type', 'unknown')}",
+        f"  architecture             : {cfg.get('architectures', ['unknown'])[0]}",
+        "",
+        "  Vocabulary",
+        f"    vocab_size             : {cfg.get('vocab_size', 'N/A'):,}",
+        f"    bos / eos / pad        : {cfg.get('bos_token_id')} / {cfg.get('eos_token_id')} / {cfg.get('pad_token_id')}",
+        "",
+        "  Positional encoding",
+        f"    max_position_embeddings: {cfg.get('max_position_embeddings', 'N/A'):,}",
+        f"    rope_theta             : {cfg.get('rope_theta', 'N/A')}",
+        f"    rope_scaling           : {cfg.get('rope_scaling', 'None')}",
+        "",
+        "  Transformer dimensions",
+        f"    hidden_size            : {h}",
+        f"    num_hidden_layers      : {cfg.get('num_hidden_layers', 'N/A')}",
+        f"    intermediate_size      : {cfg.get('intermediate_size', 'N/A')}",
+        "",
+        "  Attention",
+        f"    num_attention_heads    : {n_q}",
+        f"    num_key_value_heads    : {n_kv}",
+        f"    head_dim               : {head_dim}",
+        f"    GQA ratio              : {gqa}:1",
+        f"    attention_bias         : {cfg.get('attention_bias', False)}",
+        f"    use_qk_norm            : {cfg.get('use_qk_norm', False) or 'qwen3' in model_id.lower() or 'qwen3' in cfg.get('model_type', '').lower()}",
+        f"    sliding_window         : {cfg.get('sliding_window', 'None')}",
+        "",
+        "  Feed-forward",
+        f"    hidden_act             : {cfg.get('hidden_act', 'silu')}",
+        f"    mlp_bias               : {cfg.get('mlp_bias', False)}",
+        "",
+        "  Misc",
+        f"    rms_norm_eps           : {cfg.get('rms_norm_eps', 1e-6)}",
+        f"    tie_word_embeddings    : {cfg.get('tie_word_embeddings', True)}",
+        f"    use_cache              : {cfg.get('use_cache', True)}",
+        f"    torch_dtype            : {cfg.get('torch_dtype', 'float32')}",
+        f"    initializer_range      : {cfg.get('initializer_range', 'N/A')}",
+    ]
+def get_params_info(config, model_id: str = "") -> dict:
+    h     = config.hidden_size
+    l     = config.num_hidden_layers
+    v     = config.vocab_size
+    embed = v * h
+    tie   = getattr(config, "tie_word_embeddings", True)
+    n_q   = config.num_attention_heads
+    n_kv  = getattr(config, "num_key_value_heads", n_q)
+    head_dim  = h // n_q
+    qkv_proj  = (n_q + 2 * n_kv) * head_dim * h
+    o_proj    = h * h
+    use_qk_norm = (
+        getattr(config, "use_qk_norm", False) or
+        "qwen3" in model_id.lower() or
+        "qwen3" in getattr(config, "model_type", "").lower()
+    )
+    qk_norm   = 2 * head_dim if use_qk_norm else 0
+    mlp       = 3 * h * config.intermediate_size
+    norms     = 2 * h
+    per_layer = qkv_proj + o_proj + qk_norm + mlp + norms
+    total_layers = l * per_layer
+    lm_head  = 0 if tie else embed
+    unique   = embed + lm_head + total_layers + h  # +h for final norm
+    return {
+        "raw":       unique + (embed if tie else 0),
+        "embed":     embed,
+        "lm_head":   embed,
+        "tied":      tie,
+        "unique":    unique,
+        "non_embed": total_layers + h,
+        "per_layer": per_layer,
+    }
+def param_lines(config, p: dict, label: str) -> list[str]:
+    return [
+        f"  {label}",
+        f"    raw (all named)        : {fmt_num(p['raw'])}",
+        f"    embedding              : {fmt_num(p['embed'])}",
+        f"    lm_head                : {fmt_num(p['lm_head'])}",
+        f"    tied                   : {p['tied']}",
+        f"    unique (deduped)       : {fmt_num(p['unique'])}",
+        f"    non-embedding          : {fmt_num(p['non_embed'])}",
+        f"    per layer (approx)     : {p['per_layer']:,}",
+    ]
+# Main
+def main():
+    parser = argparse.ArgumentParser(description="Quintus Deep Weight Audit")
+    parser.add_argument("--base_model",       type=str,   default="Qwen/Qwen3-1.7B-Base")
+    parser.add_argument("--distilled_model",  type=str,   default="iamrahulreddy/Quintus")
+    parser.add_argument("--output_file",      type=str,   default="weight_audit_report.txt")
+    parser.add_argument("--alpha",            type=float, default=0.3)
+    parser.add_argument("--isotropy_samples", type=int,   default=2048)
+    parser.add_argument("--trust_remote_code", action="store_true", help="Allow custom code from model repositories.")
+    args = parser.parse_args()
+    # Determine compute device
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    dtype  = torch.bfloat16 if torch.cuda.is_available() else torch.float32
+    utc_ts  = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
+    loc_ts  = datetime.now().strftime("%Y-%m-%d %H:%M:%S local")
+    R: list[str] = []
+    def log(line: str = ""):
+        print(line)
+        R.append(line)
+    def loglines(lines: list[str]):
+        for ln in lines:
+            log(ln)
+    # Header
+    loglines([
+        divider("="),
+        "  QUINTUS WEIGHT AUDIT",
+        divider("="),
+        f"  {utc_ts}  ({loc_ts})",
+        f"  base model      : {args.base_model}",
+        f"  distilled model : {args.distilled_model}",
+        f"  alpha           : {args.alpha}",
+        f"  device          : {device}  |  dtype: {dtype}",
+        f"  python          : {sys.version.split()[0]}  |  torch: {torch.__version__}",
+        divider("="),
+    ])
+    # [01] Resolve checkpoints
+    log(section_header(1, "Resolve checkpoints"))
+    # Resolve base model commit hash (pin and report base commit)
+    base_commit = "local"
+    if not Path(args.base_model).exists():
+        try:
+            base_local_dir = Path(snapshot_download(repo_id=args.base_model))
+            base_commit = base_local_dir.name
+        except Exception:
+            base_commit = "unknown"
+    dist_commit = "local"
+    if not Path(args.distilled_model).exists():
+        log(f"  Downloading '{args.distilled_model}' from HuggingFace Hub...")
+        t0 = time.time()
+        try:
+            local_dir = snapshot_download(repo_id=args.distilled_model)
+            distilled_path = Path(local_dir)
+            dist_commit = distilled_path.name
+        except Exception as e:
+            log(f"  ERROR: {e}")
+            sys.exit(1)
+        log(f"  Done in {time.time() - t0:.1f}s")
+    else:
+        distilled_path = Path(args.distilled_model)
+        if "snapshots" in distilled_path.parts:
+            dist_commit = distilled_path.name
+    # Redact absolute local HF cache paths for sharing
+    redacted_root = "<HF_CACHE_DIR>/snapshots"
+    log(f"  base model commit  : {base_commit}")
+    log(f"  distilled commit   : {dist_commit}")
+    log(f"  snapshot root      : {redacted_root}")
+    if not (distilled_path / "config.json").exists():
+        log("  ERROR: config.json missing from checkpoint directory.")
+        sys.exit(1)
+    files = sorted(f for f in distilled_path.iterdir() if f.is_file())
+    total_ckpt_bytes = sum(f.stat().st_size for f in files)
+    log("")
+    log(f"  {'Filename':<52} {'Size':>12}  Modified")
+    for f in files:
+        mtime = datetime.fromtimestamp(f.stat().st_mtime).strftime("%Y-%m-%d %H:%M")
+        log(f"  {f.name:<52} {fmt_size(f.stat().st_size):>12}  {mtime}")
+    log(f"  {'total':<52} {fmt_size(total_ckpt_bytes):>12}")
+    # [02] Architecture configuration
+    log(section_header(2, "Architecture configuration"))
+    log("  Loading base config...")
+    try:
+        base_config = AutoConfig.from_pretrained(args.base_model, trust_remote_code=args.trust_remote_code)
+    except Exception as e:
+        log(f"  ERROR: {e}"); sys.exit(1)
+    log("  Loading distilled config...")
+    try:
+        distilled_config = AutoConfig.from_pretrained(str(distilled_path), trust_remote_code=args.trust_remote_code)
+    except Exception as e:
+        log(f"  ERROR: {e}"); sys.exit(1)
+    log(sub_header("Base"))
+    loglines(config_architecture_lines(base_config, "base", args.base_model))
+    log(sub_header("Distilled"))
+    loglines(config_architecture_lines(distilled_config, "distilled", args.distilled_model))
+    log(sub_header("Config diff  (ignoring: _name_or_path, transformers_version)"))
+    ignore_keys = {"_name_or_path", "transformers_version"}
+    base_dict   = base_config.to_dict()
+    dist_dict   = distilled_config.to_dict()
+    config_diffs = [
+        (k, base_dict.get(k), dist_dict.get(k))
+        for k in sorted(set(base_dict) | set(dist_dict))
+        if k not in ignore_keys and base_dict.get(k) != dist_dict.get(k)
+    ]
+    if not config_diffs:
+        log("  No differences — configs identical (expected for same-architecture KD).")
+    else:
+        log(f"  {'Key':<40} {'Base':>28}  Distilled")
+        for k, vb, vd in config_diffs:
+            log(f"  {k:<40} {str(vb):>28}  {vd}")
+    # [03] Parameter accounting
+    log(section_header(3, "Parameter accounting"))
+    base_params = get_params_info(base_config, args.base_model)
+    dist_params = get_params_info(distilled_config, args.distilled_model)
+    log(sub_header("Base"))
+    loglines(param_lines(base_config, base_params, "base"))
+    log(sub_header("Distilled"))
+    loglines(param_lines(distilled_config, dist_params, "distilled"))
+    log(sub_header("Delta"))
+    du = dist_params["unique"] - base_params["unique"]
+    log(f"  unique param delta     : {du:+,}  ({du / base_params['unique'] * 100:+.4f} %)")
+    log(f"  non-embed param delta  : {dist_params['non_embed'] - base_params['non_embed']:+,}")
+    # [04] Load weights onto GPU
+    log(section_header(4, "Load weights"))
+    log(f"  device: {device}  |  dtype: {dtype}")
+    load_kwargs = dict(dtype=dtype, device_map=device, trust_remote_code=args.trust_remote_code)
+    log(f"  Loading base model   : {args.base_model}")
+    t0 = time.time()
+    base_model = AutoModelForCausalLM.from_pretrained(args.base_model, **load_kwargs)
+    log(f"  Done in {time.time() - t0:.1f}s")
+    log(f"  Loading distilled    : {args.distilled_model}")
+    t0 = time.time()
+    distilled_model = AutoModelForCausalLM.from_pretrained(str(distilled_path), **load_kwargs)
+    log(f"  Done in {time.time() - t0:.1f}s")
+    base_sd = base_model.state_dict()
+    dist_sd = distilled_model.state_dict()
+    log(f"  base tensors         : {len(base_sd)}")
+    log(f"  distilled tensors    : {len(dist_sd)}")
+    only_base = set(base_sd) - set(dist_sd)
+    only_dist = set(dist_sd) - set(base_sd)
+    if only_base:
+        log(f"  keys only in base    : {sorted(only_base)[:5]} ...")
+    if only_dist:
+        log(f"  keys only in distilled: {sorted(only_dist)[:5]} ...")
+    tied = torch.equal(
+        base_sd["model.embed_tokens.weight"],
+        base_sd.get("lm_head.weight", base_sd["model.embed_tokens.weight"]),
+    )
+    log(f"  weight tying confirmed (embed == lm_head): {tied}")
+    def sd_bytes(sd):
+        return sum(t.numel() * t.element_size() for t in sd.values())
+    log(f"  base weight memory   : {fmt_size(sd_bytes(base_sd))}")
+    log(f"  distilled memory     : {fmt_size(sd_bytes(dist_sd))}")
+    # All subsequent tensor ops: move to CPU float32 only during computation,
+    # keep storage on GPU in bfloat16.
+    all_names = list(dist_sd.keys())
+    # [05] Full per-tensor statistics (distilled)
+    log(section_header(5, "Per-tensor weight statistics  (distilled)"))
+    col = (
+        f"  {'Layer':<68} {'Shape':<22} {'Mean':>8} {'Std':>8} "
+        f"{'Min':>8} {'Max':>8} {'Sparse':>7} {'KurtD':>7} "
+        f"{'OutlR':>7} {'RowL2':>8} {'DeadR':>6}"
+    )
+    log(col)
+    log(f"  {divider('-', 170)}")
+    # Helper to calculate kurtosis statistics for base comparison
+    all_stats: dict[str, dict] = {}
+    type_buckets: dict[str, list[str]] = collections.defaultdict(list)
+    for name in all_names:
+        # Move to CPU float32 for stats only
+        t  = dist_sd[name].cpu()
+        st = tensor_stats(t)
+        # Calculate base model kurtosis if present
+        if name in base_sd:
+            t_base = base_sd[name].cpu()
+            st_base = tensor_stats(t_base)
+            kurt_base = st_base["kurtosis"]
+        else:
+            kurt_base = 0.0
+        st["kurtosis_base"] = kurt_base
+        st["kurtosis_delta"] = st["kurtosis"] - kurt_base
+        all_stats[name] = st
+        type_buckets[classify_layer(name)].append(name)
+        rl2  = st.get("row_l2_mean", float("nan"))
+        dead = st.get("dead_rows",    float("nan"))
+        log(
+            f"  {name:<68} {str(st['shape']):<22} "
+            f"{st['mean']:8.4f} {st['std']:8.4f} "
+            f"{st['min']:8.4f} {st['max']:8.4f} "
+            f"{st['sparsity']:7.4f} {st['kurtosis_delta']:7.2f} "
+            f"{st['outlier_ratio']:7.4f} "
+            f"{rl2:8.4f} "
+            f"{str(int(dead)) if not math.isnan(dead) else 'N/A':>6}"
+        )
+    # [06] Layer-type aggregation (distilled)
+    log(section_header(6, "Layer-type aggregated statistics  (distilled)"))
+    log(f"  {'Type':<18} {'Count':>5} {'Params':>16} {'AvgMean':>9} {'AvgStd':>9} {'AvgSparse':>10} {'AvgKurtD':>9}")
+    log(f"  {divider('-', 82)}")
+    for ltype in sorted(type_buckets):
+        names  = type_buckets[ltype]
+        n      = len(names)
+        params = sum(all_stats[x]["numel"] for x in names)
+        log(
+            f"  {ltype:<18} {n:>5} {params:>16,} "
+            f"{sum(all_stats[x]['mean'] for x in names)/n:>9.5f} "
+            f"{sum(all_stats[x]['std']  for x in names)/n:>9.5f} "
+            f"{sum(all_stats[x]['sparsity'] for x in names)/n:>10.5f} "
+            f"{sum(all_stats[x]['kurtosis_delta'] for x in names)/n:>9.3f}"
+        )
+    # [07] Per-transformer-block breakdown (distilled)
+    log(section_header(7, "Per-transformer-block breakdown  (distilled)"))
+    n_layers = distilled_config.num_hidden_layers
+    sublayer_order = [
+        "input_layernorm", "self_attn.q_proj", "self_attn.k_proj",
+        "self_attn.v_proj", "self_attn.o_proj", "self_attn.q_norm",
+        "self_attn.k_norm", "post_attention_layernorm",
+        "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj",
+    ]
+    log(f"  {'Blk':>4}  {'Sublayer':<35} {'Shape':<22} {'L2':>9} {'AbsMn':>9} {'Std':>9} {'Sparse':>8} {'RowL2':>9}")
+    log(f"  {divider('-', 115)}")
+    for blk in range(n_layers):
+        prefix = f"model.layers.{blk}."
+        for sub in sublayer_order:
+            nm = prefix + sub + ".weight"
+            if nm not in dist_sd:
+                continue
+            st  = all_stats[nm]
+            rl2 = st.get("row_l2_mean", float("nan"))
+            log(
+                f"  {blk:>4}  {sub:<35} {str(st['shape']):<22} "
+                f"{st['l2_norm']:>9.3f} {st['abs_mean']:>9.5f} "
+                f"{st['std']:>9.5f} {st['sparsity']:>8.5f} {rl2:>9.5f}"
+            )
+        log("")
+    # [08] Isotropy analysis (distilled)
+    log(section_header(8, "Isotropy analysis  (distilled, 2D tensors only)"))
+    log(f"  Sampling up to {args.isotropy_samples} rows per layer.")
+    log(f"  Score near 0 = isotropic (healthy).  Score near 1 = representation collapse.")
+    log("")
+    log(f"  {'Layer':<68} {'Shape':<20} {'Score':>10}")
+    log(f"  {divider('-', 102)}")
+    iso_scores: dict[str, float] = {}
+    for name in all_names:
+        t   = dist_sd[name].cpu()
+        iso = isotropy_score(t, n_samples=args.isotropy_samples)
+        iso_scores[name] = iso
+        if not math.isnan(iso):
+            log(f"  {name:<68} {str(all_stats[name]['shape']):<20} {iso:>10.6f}")
+    valid_iso = [v for v in iso_scores.values() if not math.isnan(v)]
+    if valid_iso:
+        log("")
+        log(f"  Global (across {len(valid_iso)} 2D layers)")
+        log(f"    mean : {sum(valid_iso)/len(valid_iso):.6f}")
+        log(f"    min  : {min(valid_iso):.6f}")
+        log(f"    max  : {max(valid_iso):.6f}")
+    # [09] Base vs distilled divergence — all shared layers
+    log(section_header(9, "Base vs distilled divergence  (all shared layers)"))
+    shared  = sorted(set(base_sd) & set(dist_sd))
+    all_div: dict[str, dict] = {}
+    changed   = []
+    unchanged = []
+    log(f"  Shared tensors: {len(shared)}")
+    log("")
+    log(
+        f"  {'Layer':<68} {'MaxDelta':>9} {'MeanDelta':>10} "
+        f"{'L2Delta':>9} {'CosSim':>8} {'RelErr':>8} {'SNR_dB':>7} {'Chg':>4}"
+    )
+    log(f"  {divider('-', 135)}")
+    for name in shared:
+        b  = base_sd[name]
+        d  = dist_sd[name]
+        dv = tensor_divergence(b, d)
+        all_div[name] = dv
+        (changed if dv["changed"] else unchanged).append(name)
+        log(
+            f"  {name:<68} "
+            f"{dv['max_delta']:>9.5f} {dv['mean_delta']:>10.6f} "
+            f"{dv['l2_delta']:>9.4f} {dv['cos_sim']:>8.5f} "
+            f"{dv['rel_err']:>8.5f} {dv['snr_db']:>7.2f} "
+            f"{'Y' if dv['changed'] else 'N':>4}"
+        )
+    log("")
+    log(f"  Changed  : {len(changed)} / {len(shared)}")
+    log(f"  Unchanged: {len(unchanged)} / {len(shared)}")
+    if unchanged:
+        log(f"  Unchanged (first 10): {unchanged[:10]}")
+        log("\n  Note: Unchanged tensors are primarily normalization layers (input_layernorm, q_norm, k_norm, model.norm).")
+        log("        This demonstrates that the SFT/KD process modified the primary semantic projection weights")
+        log("        (attention and MLP projections) while preserving basic layer scaling characteristics.")
+    # [10] Cosine similarity distribution histogram
+    log(section_header(10, "Cosine similarity distribution histogram"))
+    cos_vals = [all_div[n]["cos_sim_raw"] for n in shared]
+    bins     = [
+        (float('-inf'), 0.900),
+        (0.900, 0.990),
+        (0.990, 0.999),
+        (0.999, 0.9999),
+        (0.9999, 0.99999),
+        (0.99999, 1.00001),
+        (1.00001, 1.001),
+        (1.001, float('inf'))
+    ]
+    def fmt_bnd(v: float) -> str:
+        if v == float('-inf'):
+            return "-inf"
+        if v == float('inf'):
+            return "inf"
+        return f"{v:7.5f}"
+    counts = []
+    for lo, hi in bins:
+        cnt = sum(1 for v in cos_vals if lo <= v < hi)
+        counts.append(cnt)
+    max_cnt = max(counts) if counts else 0
+    max_bar_width = 40
+    log(f"  {'Range':<22} {'Count':>6}  Histogram")
+    for (lo, hi), cnt in zip(bins, counts):
+        bar_len = int(round((cnt / max_cnt) * max_bar_width)) if max_cnt > 0 and cnt > 0 else 0
+        label = f"[{fmt_bnd(lo):>8},  {fmt_bnd(hi):>8})"
+        log(f"  {label:<22} {cnt:>6}  {'#' * bar_len}")
+    # [11] Attention geometry per block
+    log(section_header(11, "Attention geometry per transformer block"))
+    n_q      = distilled_config.num_attention_heads
+    n_kv     = getattr(distilled_config, "num_key_value_heads", n_q)
+    head_dim = distilled_config.hidden_size // n_q
+    log(f"  Query heads: {n_q}  |  KV heads: {n_kv}  |  head_dim: {head_dim}  |  GQA: {n_q//n_kv}:1")
+    log("")
+    log(
+        f"  {'Blk':>4}  {'Q shape':<20} {'K shape':<20} {'V shape':<20} {'O shape':<20} "
+        f"{'Q L2':>8} {'K L2':>8} {'V L2':>8} {'O L2':>8}"
+    )
+    log(f"  {divider('-', 130)}")
+    for blk in range(n_layers):
+        p = f"model.layers.{blk}.self_attn."
+        def attn(key):
+            nm = p + key + ".weight"
+            if nm in dist_sd:
+                st = all_stats[nm]
+                return str(st["shape"]), st["l2_norm"]
+            return "N/A", float("nan")
+        qs, ql = attn("q_proj")
+        ks, kl = attn("k_proj")
+        vs, vl = attn("v_proj")
+        os_, ol = attn("o_proj")
+        log(
+            f"  {blk:>4}  {qs:<20} {ks:<20} {vs:<20} {os_:<20} "
+            f"{ql:>8.3f} {kl:>8.3f} {vl:>8.3f} {ol:>8.3f}"
+        )
+    # [12] MLP geometry per block
+    log(section_header(12, "MLP feed-forward geometry per transformer block"))
+    log(f"  intermediate_size: {distilled_config.intermediate_size}  |  activation: {getattr(distilled_config, 'hidden_act', 'silu')}")
+    log("")
+    log(
+        f"  {'Blk':>4}  {'Gate shape':<22} {'Up shape':<22} {'Down shape':<22} "
+        f"{'Gate L2':>8} {'Up L2':>8} {'Down L2':>9} "
+        f"{'GateSp':>8} {'UpSp':>8} {'DnSp':>8}"
+    )
+    log(f"  {divider('-', 135)}")
+    for blk in range(n_layers):
+        p = f"model.layers.{blk}.mlp."
+        def mlp(key):
+            nm = p + key + ".weight"
+            if nm in dist_sd:
+                st = all_stats[nm]
+                return str(st["shape"]), st["l2_norm"], st["sparsity"]
+            return "N/A", float("nan"), float("nan")
+        gs, gl, gsp = mlp("gate_proj")
+        us, ul, usp = mlp("up_proj")
+        ds, dl, dsp = mlp("down_proj")
+        log(
+            f"  {blk:>4}  {gs:<22} {us:<22} {ds:<22} "
+            f"{gl:>8.3f} {ul:>8.3f} {dl:>9.3f} "
+            f"{gsp:>8.5f} {usp:>8.5f} {dsp:>8.5f}"
+        )
+    # [13] Health diagnostics
+    log(section_header(13, "Weight health diagnostics"))
+    high_sparsity   = [(n, all_stats[n]["sparsity"])      for n in all_names if all_stats[n]["sparsity"] > 0.10]
+    high_kurtosis   = [(n, all_stats[n]["kurtosis_delta"]) for n in all_names if abs(all_stats[n]["kurtosis_delta"]) > 5.0]
+    high_outlier    = [(n, all_stats[n]["outlier_ratio"]) for n in all_names if all_stats[n]["outlier_ratio"] > 0.01]
+    dead_rows       = [(n, int(all_stats[n].get("dead_rows", 0))) for n in all_names
+                       if not math.isnan(all_stats[n].get("dead_rows", float("nan")))
+                       and all_stats[n].get("dead_rows", 0) > 0]
+    low_cos         = [(n, all_div[n]["cos_sim"]) for n in shared if all_div[n]["cos_sim"] < 0.95]
+    low_snr         = [(n, all_div[n]["snr_db"])  for n in shared if all_div[n]["snr_db"]  < 20.0]
+    def diag_block(title: str, rows: list, fmt):
+        log(f"\n  {title}")
+        if not rows:
+            log("    none")
+        else:
+            for n, v in rows:
+                log(f"    {n:<70}  {fmt(v)}")
+    def get_percentiles(vals: list[float]) -> dict:
+        if not vals:
+            return {"mean": 0.0, "median": 0.0, "p10": 0.0, "p90": 0.0}
+        t = torch.tensor(vals, dtype=torch.float64)
+        return {
+            "mean":   t.mean().item(),
+            "median": t.median().item(),
+            "p10":    torch.quantile(t, 0.10).item(),
+            "p90":    torch.quantile(t, 0.90).item(),
+        }
+    diag_block("Sparsity > 10%",           high_sparsity, lambda v: f"sparsity={v:.5f}")
+    diag_block("|Kurtosis Delta| > 5.0",   high_kurtosis, lambda v: f"kurt_delta={v:+.3f}")
+    diag_block("Outlier ratio > 1%",       high_outlier,  lambda v: f"outlier_ratio={v:.5f}")
+    diag_block("Dead rows (L2 < 1e-6)",    dead_rows,     lambda v: f"dead_rows={v}")
+    diag_block("Low cosine sim vs base (<0.95)", low_cos, lambda v: f"cos_sim={v:.6f}")
+    diag_block("Low SNR vs base (< 20 dB)", low_snr,     lambda v: f"snr_db={v:.2f}")
+    log("\n  Note on kurtosis delta: Kurtosis values are reported as the difference (delta) compared to the base model.")
+    log("  A high kurtosis delta on tiny vectors (like norm/q-k-norm vectors of size 128) is statistically expected")
+    log("  due to small sample sizes and does not indicate a model health or representation collapse issue.")
+    # [14] Executive summary
+    log(section_header(14, "Executive summary"))
+    all_cos  = [all_div[n]["cos_sim"] for n in shared]
+    all_snr  = [all_div[n]["snr_db"]  for n in shared]
+    all_rel  = [all_div[n]["rel_err"] for n in shared]
+    cos_stats = get_percentiles(all_cos)
+    snr_stats = get_percentiles(all_snr)
+    rel_stats = get_percentiles(all_rel)
+    log(f"  shared tensors                   : {len(shared)}")
+    log(f"  tensors changed vs base          : {len(changed)} / {len(shared)}")
+    log(f"  cosine similarity                : mean = {cos_stats['mean']:.6f} | median = {cos_stats['median']:.6f} | p10 = {cos_stats['p10']:.6f} | p90 = {cos_stats['p90']:.6f}")
+    log(f"  relative error                   : mean = {rel_stats['mean']:.6f} | median = {rel_stats['median']:.6f} | p10 = {rel_stats['p10']:.6f} | p90 = {rel_stats['p90']:.6f}")
+    log(f"  SNR dB                           : mean = {snr_stats['mean']:.2f} | median = {snr_stats['median']:.2f} | p10 = {snr_stats['p10']:.2f} | p90 = {snr_stats['p90']:.2f}")
+    log(f"  high-sparsity layers (>10%)      : {len(high_sparsity)}")
+    log(f"  heavy-tail layers (|kurt_d|>5.0) : {len(high_kurtosis)}")
+    log(f"  dead-row layers                  : {len(dead_rows)}")
+    log(f"  low-cos layers (<0.95)           : {len(low_cos)}")
+    log(f"  low-SNR layers (<20 dB)          : {len(low_snr)}")
+    log(f"  distillation alpha               : {args.alpha}")
+    log("")
+    log(f"  checkpoint size on disk          : {fmt_size(total_ckpt_bytes)}")
+    log(f"  base weights in memory           : {fmt_size(sd_bytes(base_sd))}")
+    log(f"  distilled weights in memory      : {fmt_size(sd_bytes(dist_sd))}")
+    log("")
+    log(divider("="))
+    log("  END OF REPORT")
+    log(divider("="))
+    # Write to file
+    out = Path(args.output_file)
+    out.write_text("\n".join(R) + "\n", encoding="utf-8")
+    print(f"\nReport written to: {out.resolve()}")
+if __name__ == "__main__":
+    main()

weight_audit/weight_audit_report.txt ADDED Viewed

The diff for this file is too large to render. See raw diff