Text Generation
Safetensors
GGUF
Vietnamese
dpo
preference-learning
vietnamese
qlora
lora
unsloth
trl
conversational

Lab 22 — DPO-Aligned Qwen2.5-3B (VN)

Vietnamese-aligned model produced by VinUniversity AICB Day-22 lab (Track 3 — DPO/ORPO Alignment).

Pipeline: SFT (1k VN Alpaca) → DPO (2k UltraFeedback, β=0.1, lr=5e-7) on top of unsloth/Qwen2.5-3B-bnb-4bit.

Files

  • adapter_config.json + adapter_model.safetensors — DPO LoRA (rank=16, α=32). Stack on top of the base + SFT-mini adapter.
  • lab22-dpo-Q4_K_M.gguf — merged + quantized GGUF for llama.cpp / llama-cpp-python deployment (~1.9 GB).

Quick start

Inference via llama-cpp-python

from llama_cpp import Llama
llm = Llama(model_path="lab22-dpo-Q4_K_M.gguf", n_ctx=512)
print(llm.create_chat_completion(messages=[{"role": "user", "content": "Giải thích quicksort."}])
        ["choices"][0]["message"]["content"])

Inference via transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B", torch_dtype="bfloat16", device_map="cuda")
model = PeftModel.from_pretrained(base, "hiepphambk/lab22-dpo-vn")  # this DPO adapter

Training details

Hyperparameter Value
Base unsloth/Qwen2.5-3B-bnb-4bit (4-bit NF4)
LoRA rank · α 16 · 32
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
DPO β 0.1
Learning rate 5e-7
Optimizer adamw_8bit
Effective batch 1 × 8 grad-accum = 8
Epochs 1
Max seq length 512
Compute RTX 5070 12 GB (Blackwell sm_120, CUDA 12.8)

Evaluation results

Metric SFT-only SFT+DPO
Reward gap (chosen − rejected, end of training) n/a +0.114
Final DPO loss n/a 0.798
AlpacaEval-lite (50 prompts, gpt-4o-mini judge) 0.50 0.47
Manual eval (8 VN prompts, judge gpt-4o-mini) 2/8 wins 5/8 wins (62.5%)

See full report (incl. reward curves analysis, alignment-tax interpretation, and W&B run link) in the lab repo.

Caveats

  • Trained on English UltraFeedback pref data — VN behavior improves via transfer; native-VN pref dataset would be better (deck §5.4).
  • 3B + 1k SFT + 2k DPO is demonstrative scale, not production-ready. For production, use ≥ 7B base + ≥ 50k pref pairs.
  • Likelihood displacement observed (deck §3.4): both chosen and rejected reward decrease, gap widens because rejected falls faster.

Citation / acknowledgements

  • Lab template: VinUniversity AICB Day-22 (Track 3, A20 cohort 2026).
  • Stack: Unsloth, TRL, PEFT, bitsandbytes, llama.cpp, lm-eval-harness.

Trained by Phạm Hữu Hoàng Hiệp (MSSV 2A202600415).

Downloads last month
8
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hiepphambk/lab22-dpo-vn

Base model

Qwen/Qwen2.5-3B
Adapter
(425)
this model

Datasets used to train hiepphambk/lab22-dpo-vn