Quintus

Open in Colab Hugging Face Model Docs Benchmarks License: MIT Base Model Teacher

Quintus-1.7B is a compact English-focused assistant built from Qwen/Qwen3-1.7B-Base. The project uses online full-vocabulary knowledge distillation from a Qwen/Qwen3-8B teacher, followed by a targeted SFT stage for assistant behavior, identity grounding, and generation stability.

Final model weights: iamrahulreddy/Quintus

Core Technical Points

  • Dense KD signal: the final training path streams the teacher's full vocabulary distribution live instead of relying on sparse cached top-k logits.
  • Base-student strategy: the student starts from Qwen/Qwen3-1.7B-Base, leaving more room for distillation before assistant-format tuning.
  • Assistant-only supervision: prompt text, chat headers, separators, and padding are masked out of the supervised target region.
  • Sequence packing: deterministic first-fit decreasing packing improves useful-token throughput at 4096-token context length.
  • Public benchmark controls: raw/chat prompt format, metric extraction, generation budget, and artifact hygiene are documented explicitly.

Training Summary

The release training path is a two-stage pipeline:

  1. Online KD: train the 1.7B base student against live teacher logits from a Qwen3-8B teacher.
  2. Targeted SFT: tune the distilled checkpoint for assistant-style interaction, persona consistency, and repetition control.

Reuse As A KD Framework

Quintus is released as a trained 1.7B assistant, but the repository is also a reusable reference pipeline for compact-model distillation. The same structure can be adapted to other teacher/student pairs with changes to the model IDs, tokenizer, dataset source, local paths, sequence length, batch schedule, and hardware-specific memory settings in configs/config.yaml.

The reusable pieces are split across the codebase: assistant-only masking, sequence packing, online full-vocabulary KD loss, checkpoint/resume metadata, validation, provenance checks, SFT, and evaluation. The final pattern is:

  1. Distill a smaller base student from a stronger teacher with online KD.
  2. Apply targeted SFT to recover assistant behavior, formatting, identity, and generation stability.

Quintus Architecture

Core KD objective:

Ltotal=αLCE+(1−α)LKD \mathcal{L}_{\text{total}} = \alpha \mathcal{L}_{\text{CE}} + (1 - \alpha)\mathcal{L}_{\text{KD}}

For the final run,

α=0.3,T=2.0 \alpha = 0.3,\quad T = 2.0

Configuration snapshot:

Setting Value
Teacher Qwen/Qwen3-8B
Student Qwen/Qwen3-1.7B-Base
Tokenizer Qwen/Qwen3-1.7B
Data ~90K English-only samples from DistilQwen_100k
Max sequence length 4096
Epochs 1
Learning rate 5.0e-6
Weight decay 0.1
Warmup ratio 0.05
Online KD token chunk 2048
Micro batch 4
Gradient accumulation 2
Sequence packing enabled, pack_length = 4096
Attention FlashAttention-2 when available
Liger kernels enabled for compatible Qwen-family ops
Optimizer fused AdamW
torch.compile disabled
Gradient checkpointing disabled
Seed 25

FlashAttention-2, Liger kernels, and fused AdamW are acceleration paths. Keep the baseline load path compatible with standard Transformers and vLLM APIs before publishing checkpoints. torch.compile stayed disabled because this KD shape showed high Inductor memory overhead, dynamic-shape graph breaks, recompile overhead, and checkpoint portability risk from _orig_mod. state dict prefixes when compiled modules are not unwrapped before saving.

The B200-oriented defaults are conservative for the 8B teacher to 1.7B student workload. Smaller teacher/student pairs may tolerate larger micro-batches, but full-vocabulary KD scales sharply with vocabulary width.

The editable run configuration lives in configs/config.yaml. Paths and Hub destinations are left as placeholders so each runner can set local directories and repository names directly.

Why Online KD Replaced Offline Top-K KD

Earlier experiments cached only the teacher's top-k logits. That made storage smaller, but with a Qwen vocabulary around 151K tokens, $k = 8$ exposes only:

k∣V∣=8151,665≈5.3×10−5=0.0053% \frac{k}{|V|} = \frac{8}{151{,}665} \approx 5.3 \times 10^{-5} = 0.0053\%

of the vocabulary support at each position. The sparse signal could perturb the student, but it did not consistently transfer deeper reasoning behavior.

The final online path keeps the teacher and student in memory together and computes KL divergence against the teacher's full-vocabulary distribution. Token chunking keeps that dense objective feasible without materializing a single large KL workspace.

Benchmark Scoreboard

The final public scoreboard compares Qwen/Qwen3-1.7B-Base, Qwen/Qwen3-1.7B-Instruct, and Quintus-1.7B.

Model Evaluation Scoreboard

The strongest signal is the reasoning crossover: Quintus beats both the base and official 1.7B instruct model on GSM8K, ARC-Challenge, and WinoGrande while remaining at the same parameter scale.

See docs/benchmarks.md for the numeric table and interpretation. See docs/evaluation_methodology.md for benchmark controls.

Evaluation Notes

Evaluation uses a mixture of EvalPlus and lm-evaluation-harness/vLLM style benchmarks. The repository keeps evaluation methodology separate because prompt format can change the result:

  • Raw completion comparisons are used for base capability.
  • Chat-template comparisons are used for assistant-format behavior.
  • Log-likelihood tasks such as ARC-Challenge and PIQA should usually stay raw.
  • GSM8K can differ between strict #### parsing and flexible number extraction.
  • Metric extraction must ignore stderr, aliases, and wrong filter keys.
  • Runtime versions, checkpoint identity, generation budget, and stale output cleanup are part of the evaluation contract.

The active benchmark runner is sft/evaluate.py. It covers EvalPlus code tasks and lm-evaluation-harness/vLLM tasks, including GSM8K 10-shot evaluation with an extended generation budget.

Repository Map

configs/        Public run profile and DeepSpeed Zero-2 template.
src/            Data prep, online KD, losses, packing, checkpoints, provenance.
sft/            Post-KD SFT, local chat, and consolidated evaluation runner.
docs/           Public architecture, training, evaluation, and release notes.
weight_audit/   Checkpoint structure and weight-divergence audit material.

Key files:

Commands

Install the base dependencies:

pip install -r requirements.txt

For training and benchmark runs, install the matching extras:

pip install -r requirements-train.txt
pip install -r requirements-eval.txt

Inspect or prepare data/model assets:

python -m src.download --help

Run the final KD path after editing configs/config.yaml for local paths and hardware:

python -m src.train --phase online_kd

Hub checkpoint uploads are off by default for local runs. Pass --upload_last_checkpoint or the step/epoch upload flags only after setting the target repository and HF_TOKEN.

Run the consolidated benchmark suite:

python sft/evaluate.py

Start local chat with a downloaded or local checkpoint:

python sft/chat.py --model_path path/to/quintus/checkpoint

Interactive Chat

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

PUBLIC_REPO_ID = "iamrahulreddy/Quintus"

print(f"Loading Quintus from {PUBLIC_REPO_ID}...")
tokenizer = AutoTokenizer.from_pretrained(PUBLIC_REPO_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    PUBLIC_REPO_ID,
    device_map="auto",
    dtype=torch.float16,
    trust_remote_code=True,
)

stop_tokens = ["<|endoftext|>", "<|im_end|>"]
eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []
for token in stop_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)
    if token_id is not None and token_id not in eos_token_ids:
        eos_token_ids.append(token_id)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

conversation_history = [
    {
        "role": "system",
        "content": (
            "You are Quintus, a highly capable AI assistant created by "
            "Muskula Rahul. You are helpful, precise, and logically sound."
        ),
    }
]

print()
print("Quintus Chat (type 'quit' to exit)")
print()

while True:
    try:
        user_input = input("You: ").strip()
        if user_input.lower() in ["quit", "exit"]:
            print("\nGoodbye!")
            break
        if not user_input:
            continue

        conversation_history.append({"role": "user", "content": user_input})

        prompt = tokenizer.apply_chat_template(
            conversation_history,
            tokenize=False,
            add_generation_prompt=True,
        )

        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        print("Quintus: ", end="", flush=True)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.7,
                top_p=0.9,
                do_sample=True,
                streamer=streamer,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=eos_token_ids,
            )

        generated_ids = outputs[0][inputs.input_ids.shape[-1]:]
        assistant_response = tokenizer.decode(
            generated_ids,
            skip_special_tokens=True,
        ).strip()
        conversation_history.append({"role": "assistant", "content": assistant_response})
        print()

    except KeyboardInterrupt:
        print("\n\nGoodbye!")
        break

Documentation

Limitations

  • Quintus is still a 1.7B model and inherits compact-model capacity limits.
  • Factual answers can be confidently wrong and should be verified.
  • Code generation may still contradict stated complexity or edge-case requirements.
  • Raw and chat-template results are not interchangeable.
  • Additional preference tuning or DPO would likely improve calibration, refusal behavior, and open-ended assistant polish.

Credits

Quintus builds on open model, dataset, and tooling work from the broader LLM community:

License And Author

This software is distributed under the MIT License. Refer to the LICENSE file for full text.

Author: Muskula Rahul - @iamrahulreddy

Citation

If this model, codebase, or training pipeline is useful in your work, please cite this repository and acknowledge the upstream Qwen3 models.

Downloads last month
103
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iamrahulreddy/Quintus

Finetuned
(374)
this model

Dataset used to train iamrahulreddy/Quintus