Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth

This model is a fine-tuned version of Qwen3.5-4B optimized for reasoning distillation (chain-of-thought) using Unsloth for 2x faster training and 60% less VRAM.

Trained on the claude-reasoning-distillation dataset, which contains 10,477 samples of Claude's reasoning traces with <think> blocks for chain-of-thought learning.

Overview

Property Value
Developed by ermiaazarkhalili
License APACHE-2.0
Language English
Base Model Qwen3.5-4B
Model Size 4B parameters
Training Framework Unsloth + TRL
Training Method SFT with QLoRA (4-bit)
Context Length 2,048 tokens
GGUF Available Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth-GGUF

Training Configuration

SFT + LoRA Settings

Parameter Value
Unsloth Class FastLanguageModel
Chat Template built-in Qwen3.5
Learning Rate 2e-4
Batch Size 2 per device
Gradient Accumulation 4 steps
Effective Batch Size 8
Max Steps 1 epoch (full dataset)
Optimizer AdamW 8-bit
LR Scheduler Linear
Warmup Steps 5
Precision Auto (BF16/FP16)
Gradient Checkpointing Enabled (Unsloth optimized)
Seed 3407

LoRA Configuration

Parameter Value
LoRA Rank (r) 64
LoRA Alpha 64
LoRA Dropout 0
Quantization 4-bit QLoRA
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Dataset

Property Value
Dataset Claude Reasoning Distillation
Training Samples 10,477
Format Messages with thinking field for chain-of-thought

Hardware

Property Value
GPU NVIDIA H100 80GB HBM3 (MIG 3g.40gb slice)
Cluster DRAC Fir (Compute Canada)
Execution Papermill on SLURM

Training Outcome

Metric Value
SLURM Job ID 37204026
Runtime 1h 30m 35s (5435s)
Final Training Loss 0.918
Peak VRAM 26.64 GB
GPU H100 80GB HBM3 (MIG 3g.40gb)

Usage

Quick Start (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ermiaazarkhalili/Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Solve step by step: What is the sum of the first 10 prime numbers?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Using with Unsloth (Fastest)

from unsloth import FastLanguageModel

model, processor = FastLanguageModel.from_pretrained(
    "ermiaazarkhalili/Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth",
    max_seq_length=2048,
    load_in_4bit=True,
)
tokenizer = processor.tokenizer  # Extract text tokenizer from processor

4-bit Quantized Inference

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "ermiaazarkhalili/Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth",
    quantization_config=quantization_config,
    device_map="auto",
)

GGUF Versions

Quantized GGUF versions for CPU and edge inference are available at: Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth-GGUF

Format Description
Q4_K_M Recommended — good balance of quality and size
Q5_K_M Higher quality, slightly larger
Q8_0 Near-lossless, largest GGUF size

Using with Ollama

ollama pull hf.co/ermiaazarkhalili/Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth-GGUF:Q4_K_M
ollama run hf.co/ermiaazarkhalili/Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth-GGUF:Q4_K_M "Solve step by step: What is the sum of the first 10 prime numbers?"

Using with llama.cpp

./llama-cli -m Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth-Q4_K_M.gguf -p "Solve step by step: What is the sum of the first 10 prime numbers?" -n 512

Limitations

  • Language: Primarily trained on English data
  • Knowledge Cutoff: Limited to base model's training data cutoff
  • Hallucinations: May generate plausible-sounding but incorrect information
  • Context Length: Fine-tuned with 2,048 token context window
  • Safety: Not extensively safety-tuned; use with appropriate guardrails

Training Framework Versions

Package Version
Unsloth 2026.4.4
TRL 0.24.0
Transformers 5.5.0
PyTorch 2.9.0
Datasets 4.3.0
PEFT 0.18.1
BitsAndBytes 0.49.2

Citation

@misc{ermiaazarkhalili_qwen35_4b_sft_claude_opus_reasoning_unsloth,
    author = {ermiaazarkhalili},
    title = {Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth: Fine-tuned Qwen3.5-4B with Unsloth},
    year = {2026},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/ermiaazarkhalili/Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth}}
}

Acknowledgments

Downloads last month
200
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ermiaazarkhalili/Qwen3.5-4B-SFT-Claude-Opus-Reasoning-Unsloth

Finetuned
Qwen/Qwen3.5-4B
Adapter
(29)
this model
Adapters
2 models
Quantizations
1 model