File size: 7,859 Bytes
f22e841
 
 
 
 
 
 
 
 
 
 
 
 
6c258e3
f22e841
 
 
 
 
 
 
6c258e3
f22e841
d50beba
052dd5e
d50beba
b693637
f22e841
 
 
 
b693637
40be032
f192fd7
40be032
f192fd7
 
052dd5e
 
f22e841
5eecaaf
 
 
2ab7dd4
5eecaaf
 
4b4b720
 
40be032
 
052dd5e
40be032
b693637
40be032
4f00f56
40be032
 
 
 
 
4f00f56
40be032
 
 
4f00f56
56d294c
4f00f56
b693637
 
 
 
 
 
 
 
f22e841
 
 
01cf917
f22e841
9f8aa8d
f22e841
f192fd7
 
 
40be032
f22e841
9f8aa8d
f22e841
9f8aa8d
 
f22e841
9f8aa8d
f22e841
 
 
9f8aa8d
f22e841
9f8aa8d
f22e841
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40be032
 
 
da0c424
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40be032
 
 
 
 
 
 
 
 
 
6c258e3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
language:
- fr
- en
library_name: transformers
tags:
- dpo
- post-training
- french
- alignment
- model-merging
- qwen3
- chocolatine
- comparia
license: apache-2.0
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
- jpacifico/comparia-dpo-pairs-bt-6k
- jpacifico/french-orca-dpo-pairs-revised
---

# Chocolatine-2-4B-Instruct-DPO-v2.1

**Chocolatine-2-4B-Instruct-DPO-v2.1** is a post-trained version of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507), designed to improve instruction-following, reasoning, and overall performance in French, while preserving strong multilingual capabilities.  
In my evaluation setup, it delivers consistent gains across the tested French benchmarks, pointing to a broad improvement in French capabilities.  
Although the post-training pipeline focuses on French preference data, no degradation is observed on English tasks, and slight improvements are sometimes seen, suggesting positive cross-lingual transfer.  
Optimized variants (MLX, GGUF) are also available, making the model particularly suitable for local inference.  


## Model Overview

- **Base model:** Qwen/Qwen3-4B-Instruct-2507
- **Parameters:** 4.0B
- **Context Length:** 262,144 natively
- **Post training methods:** DPO + Model Merging

Note: This model supports only non-thinking mode and does not generate `<think></think>` blocks in its outputs.  
This design is consistent with the goals of the post-training setup, which favors a compact dense instruct model focused on direct generation efficiency and practical downstream use.    
For use cases requiring explicit reasoning traces or structured thinking outputs, Qwen/Qwen3.5-4B (thinking mode) is recommended.

**Model Variants**

- Chocolatine-2-4B-Instruct-DPO-v2.1 (this repo): Contains the retrainable weights in BF16 format
- Quantized GGUF versions : [Q4_K_M](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-Q4_K_M-GGUF) / [Q8_0](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-Q8_0-GGUF) and more from mradermacher [here](https://huggingface.co/mradermacher/Chocolatine-2-4B-Instruct-DPO-v2.1-GGUF)
- MLX (optimized for Apple silicon): [4Bit](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-mlx-4Bit) / [6Bit](https://huggingface.co/jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1-mlx-8Bit)

**Ollama** : In addition to the Hugging Face release, quantized 4-bit and 8-bit variants are also available [here](https://ollama.com/jpacifico/chocolatine-2.1) on Ollama for convenient local inference.

## Benchmarks

The results indicate a consistent improvement across the tested French benchmarks, covering several capability types. This suggests a broad gain in French performance, while English results remain overall stable.

| Benchmark fr | Qwen3-4B-Instruct-2507 (base) | Chocolatine-2-4B-Instruct-DPO-v2.1 |
|---|---:|---:|
| gpqa-fr:diamond | 28.93 | **32.49** |
| french_bench_arc_challenge | 47.13 | **49.79** |
| french_bench_grammar | 70.59 | **72.27** |
| french_bench_boolqa | 88.76 | **89.89** |
| french_bench_hellaswag | 56.99 | **58.03** |
| global_mmlu_fr | 63.75 | **64.75** |
| xwinograd_fr | 66.27 | **67.47** |
| fr_mt_bench | 6.22 | **6.44** |

*FR-MT-Bench* evaluation is performed on [MT-Bench-French](https://huggingface.co/datasets/bofenghuang/mt-bench-french), using [multilingual-mt-bench](https://github.com/jpacifico/multilingual_mt_bench) with OpenAI/GPT-5 as the LLM judge.  
*global_mmlu_fr*, *xwinograd_fr* and *french_bench* results were obtained using [EleutherAI LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) in a **0-shot** evaluation setting.  
*gpqa-fr:diamond* using LightEval/vLLM via [kurakurai/Luth](https://github.com/kurakurai/Luth.git) process eval.

| Benchmark eng | Qwen3-4B-Instruct-2507 (base) | Chocolatine-2-4B-Instruct-DPO-v2.1 |
|---|---:|---:|
| arc_challenge | **58.79** | 58.45 |
| hellaswag | 69.08 | **70.16** |
| boolq | 84.80 | **85.32** |
| gpqa_diamond_zeroshot | **38.89** | 38.38 |

English benchmark results were obtained using [EleutherAI LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) in a **0-shot** evaluation setting.

## Training & Alignment Pipeline  

Chocolatine-2-4B-Instruct-DPO-v2.1 is derived from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using a multi-step post-training pipeline:

**Stage 1 – DPO (Compar:IA adaptation)**

Direct Preference Optimization (DPO) on a DPO-adapted version of **[Compar:IA](https://comparia.beta.gouv.fr/datasets)** data, derived from the preference dataset [comparia-votes](https://huggingface.co/datasets/ministere-culture/comparia-votes), part of a public initiative led by the Ministry of Culture (French gov). Previous iterations of the Chocolatine model series also were selected as part of this initiative.  
I constructed an original DPO dataset from these votes by transforming them into preference pairs (chosen / rejected), with additional filtering and formatting steps to make them suitable for DPO fine-tuning.  
Two dataset variants were created ([6k](https://huggingface.co/datasets/jpacifico/comparia-dpo-pairs-bt-6k) and [13k](https://huggingface.co/datasets/jpacifico/comparia-dpo-pairs-bt-13k) preference pairs).  
The **6k variant** was used for the DPO training reported in this release.

**Stage 2 – DPO (French-ORCA pairs)**

A second DPO stage using a french-version of ORCA preference pairs, based on the dataset **[jpacifico/french-orca-dpo-pairs-revised](https://huggingface.co/datasets/jpacifico/french-orca-dpo-pairs-revised)**, commonly used in the Chocolatine training pipeline.  
This stage further improves : general instruction alignment, robustness across tasks, cross-lingual capabilities. 

**Stage 3 – Model Merging (MergeKit + TIES)**

The resulting checkpoints were merged using **MergeKit** with the TIES method.

TIES merging: selects task-relevant parameter updates, reduces destructive interference between models and preserves base model stability.

MergeKit configuration:

```yaml
# ties2 recipe
models:
  - model: jpacifico/Qwen3-4B-Instruct-DPO-test2
    parameters:
      density: 0.5
      weight: 0.5
  - model: jpacifico/Qwen3-4B-Instruct-DPO-test-b3
    parameters:
      density: 0.5
      weight: 0.5

merge_method: ties
base_model: Qwen/Qwen3-4B-Instruct-2507

parameters:
  normalize: false
  int8_mask: true

dtype: bfloat16
```

## Usage

The following contains a code snippet illustrating how to use the model generate content based on given inputs.  
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)
```

## Limitations  

The Chocolatine-2 model series is a quick demonstration that a base model can be easily fine-tuned to achieve compelling performance.
It does not have any moderation mechanism.  

Developed by: Jonathan Pacifico, 2026   
Model type: LLM  
Language(s) (NLP): French, English  
License: Apache-2.0  

Made with ❤️ in France