--- language: - vi license: apache-2.0 base_model: Qwen/Qwen2.5-3B tags: - dpo - preference-learning - vietnamese - qlora - lora - unsloth - trl datasets: - bkai-foundation-models/vi-alpaca - argilla/ultrafeedback-binarized-preferences-cleaned pipeline_tag: text-generation --- # Lab 22 — DPO-Aligned Qwen2.5-3B (VN) Vietnamese-aligned model produced by VinUniversity AICB Day-22 lab (Track 3 — DPO/ORPO Alignment). **Pipeline:** SFT (1k VN Alpaca) → DPO (2k UltraFeedback, β=0.1, lr=5e-7) on top of `unsloth/Qwen2.5-3B-bnb-4bit`. ## Files - `adapter_config.json` + `adapter_model.safetensors` — DPO LoRA (rank=16, α=32). Stack on top of the base + SFT-mini adapter. - `lab22-dpo-Q4_K_M.gguf` — merged + quantized GGUF for llama.cpp / llama-cpp-python deployment (~1.9 GB). ## Quick start ### Inference via llama-cpp-python ```python from llama_cpp import Llama llm = Llama(model_path="lab22-dpo-Q4_K_M.gguf", n_ctx=512) print(llm.create_chat_completion(messages=[{"role": "user", "content": "Giải thích quicksort."}]) ["choices"][0]["message"]["content"]) ``` ### Inference via transformers + PEFT ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B") base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B", torch_dtype="bfloat16", device_map="cuda") model = PeftModel.from_pretrained(base, "hiepphambk/lab22-dpo-vn") # this DPO adapter ``` ## Training details | Hyperparameter | Value | |---|---| | Base | `unsloth/Qwen2.5-3B-bnb-4bit` (4-bit NF4) | | LoRA rank · α | 16 · 32 | | LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | DPO β | 0.1 | | Learning rate | 5e-7 | | Optimizer | adamw_8bit | | Effective batch | 1 × 8 grad-accum = 8 | | Epochs | 1 | | Max seq length | 512 | | Compute | RTX 5070 12 GB (Blackwell sm_120, CUDA 12.8) | ## Evaluation results | Metric | SFT-only | SFT+DPO | |---|---:|---:| | Reward gap (chosen − rejected, end of training) | n/a | **+0.114** | | Final DPO loss | n/a | 0.798 | | AlpacaEval-lite (50 prompts, gpt-4o-mini judge) | 0.50 | 0.47 | | Manual eval (8 VN prompts, judge gpt-4o-mini) | 2/8 wins | **5/8 wins (62.5%)** | See full report (incl. reward curves analysis, alignment-tax interpretation, and W&B run link) in the [lab repo](https://github.com/hiepphambk/Day22-Track3-DPO-Alignment-Lab). ## Caveats - Trained on English UltraFeedback pref data — VN behavior improves via transfer; native-VN pref dataset would be better (deck §5.4). - 3B + 1k SFT + 2k DPO is *demonstrative scale*, not production-ready. For production, use ≥ 7B base + ≥ 50k pref pairs. - Likelihood displacement observed (deck §3.4): both chosen and rejected reward decrease, gap widens because rejected falls faster. ## Citation / acknowledgements - Lab template: VinUniversity AICB Day-22 (Track 3, A20 cohort 2026). - Stack: Unsloth, TRL, PEFT, bitsandbytes, llama.cpp, lm-eval-harness. Trained by Phạm Hữu Hoàng Hiệp (MSSV 2A202600415).