--- language: - fr - en library_name: transformers tags: - dpo - post-training - french - alignment - model-merging - qwen3 - chocolatine - comparia license: apache-2.0 base_model: Qwen/Qwen3-4B-Instruct-2507 datasets: - jpacifico/comparia-dpo-pairs-bt-6k - jpacifico/french-orca-dpo-pairs-revised --- # Chocolatine-2-4B-Instruct-DPO-v2.1 **Chocolatine-2-4B-Instruct-DPO-v2.1** is a post-trained version of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507), designed to improve instruction-following, reasoning, and overall performance on French benchmarks, while preserving strong multilingual capabilities. Although the post-training pipeline focuses on French preference data, improvements are also observed on English tasks, suggesting positive cross-lingual transfer. Optimized variants (MLX, GGUF) are also available, making the model particularly suitable for local inference and agent-oriented systems (e.g. OpenClaw, NemoClaw). ## Model Overview - **Base model:** Qwen/Qwen3-4B-Instruct-2507 (non-thinking) - **Parameters:** 4.0B - **Context Length:** 262,144 natively - **Post training methods:** DPO + Model Merging Note: This model supports only non-thinking mode and does not generate `` blocks in its outputs. The choice of Qwen3 over Qwen3.5 is intentional: Qwen3 provides a more direct text-oriented generation style, better aligned with the objectives of this post-training. Empirically, my post-training on Qwen3 yielded *more consistent improvements across evaluated benchmarks* compared to Qwen3.5. For use cases requiring explicit reasoning traces or structured thinking outputs, Qwen3.5 (thinking mode) is recommended. Chocolatine-2-4B-Instruct-DPO-v2.1 is therefore optimized for direct instruction-following and practical deployment, rather than explicit CoT generation. **Model Variants** - Chocolatine-2-4B-Instruct-DPO-v2.1 (this repo): Contains the retrainable weights in BF16 format - Quantized GGUF versions : [Q4_K_M](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-Q4_K_M-GGUF) / [Q8_0](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-Q8_0-GGUF) - MLX (optimized for Apple silicon): [4Bit](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-mlx-4Bit) / [6Bit](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-mlx-8Bit) ## Benchmarks Performance improves consistently across all tested FR benchmarks : | Benchmark | Qwen3-4B-Thinking-2507 (base) | Chocolatine-2-4B-Instruct-DPO-v2.1 | |---|---:|---:| | french_bench_arc_challenge | 47.13 | **49.79** | | french_bench_grammar | 70.59 | **72.27** | | french_bench_boolqa | 88.76 | **89.89** | | french_bench_hellaswag | 56.99 | **58.03** | | global_mmlu_fr | 63.75 | **64.75** | | fr_mt_bench | 6.22 | **6.44** | *FR-MT-Bench* evaluation is performed on [MT-Bench-French](https://huggingface.co/datasets/bofenghuang/mt-bench-french), using [multilingual-mt-bench](https://github.com/jpacifico/multilingual_mt_bench) with OpenAI/GPT-5 as the LLM judge. All other benchmark results were obtained using [EleutherAI LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) in a **0-shot** evaluation setting. ## Training & Alignment Pipeline Chocolatine-2-4B-Instruct-DPO-v2.1 is derived from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using a multi-step post-training pipeline: **Stage 1 – DPO (Compar:IA adaptation)** Direct Preference Optimization (DPO) on a DPO-adapted version of **[Compar:IA](https://comparia.beta.gouv.fr/datasets)** data, derived from the preference dataset [comparia-votes](https://huggingface.co/datasets/ministere-culture/comparia-votes), part of a public initiative led by the Ministry of Culture (French gov). Previous iterations of the Chocolatine model series also were selected as part of this initiative. I constructed an original DPO dataset from these votes by transforming them into preference pairs (chosen / rejected), with additional filtering and formatting steps to make them suitable for DPO fine-tuning. Two dataset variants were created ([6k](https://huggingface.co/datasets/jpacifico/comparia-dpo-pairs-bt-6k) and [13k](https://huggingface.co/datasets/jpacifico/comparia-dpo-pairs-bt-13k) preference pairs). The **6k variant** was used for the DPO training reported in this release. **Stage 2 – DPO (French-ORCA pairs)** A second DPO stage using a french-version of ORCA preference pairs, based on the dataset **[jpacifico/french-orca-dpo-pairs-revised](https://huggingface.co/datasets/jpacifico/french-orca-dpo-pairs-revised)**, commonly used in the Chocolatine training pipeline. This stage further improves : general instruction alignment, robustness across tasks, cross-lingual capabilities. **Stage 3 – Model Merging (MergeKit + TIES)** The resulting checkpoints were merged using **MergeKit** with the TIES method. TIES merging: selects task-relevant parameter updates, reduces destructive interference between models and preserves base model stability. MergeKit configuration: ```yaml # ties2 recipe models: - model: jpacifico/Qwen3-4B-Instruct-DPO-test2 parameters: density: 0.5 weight: 0.5 - model: jpacifico/Qwen3-4B-Instruct-DPO-test-b3 parameters: density: 0.5 weight: 0.5 merge_method: ties base_model: Qwen/Qwen3-4B-Instruct-2507 parameters: normalize: false int8_mask: true dtype: bfloat16 ``` ## Limitations The Chocolatine-2 model series is a quick demonstration that a base model can be easily fine-tuned to achieve compelling performance. It does not have any moderation mechanism. Developed by: Jonathan Pacifico, 2026 Model type: LLM Language(s) (NLP): French, English License: Apache-2.0 Made with ❤️ in France