---
language:
- nl
license: apache-2.0
library_name: transformers
tags:
- automatic-speech-recognition
- speech
- qwen3-asr
- qwen
- dutch
- fine-tuned
- common-voice
datasets:
- fsicoli/common_voice_22_0
- fixie-ai/common_voice_17_0
- yuriyvnv/synthetic_transcript_pt
base_model: Qwen/Qwen3-ASR-0.6B
pipeline_tag: automatic-speech-recognition
model-index:
- name: Qwen3-ASR-0.6B-NL
results:
- task:
type: automatic-speech-recognition
dataset:
name: Common Voice 17 (test)
type: fixie-ai/common_voice_17_0
config: nl
split: test
metrics:
- type: wer
value: 8.11
name: WER
- type: cer
value: 2.68
name: CER
- task:
type: automatic-speech-recognition
dataset:
name: Common Voice 22 (test)
type: fsicoli/common_voice_22_0
config: nl
split: test
metrics:
- type: wer
value: 8.31
name: WER
- type: cer
value: 2.76
name: CER
---
# ποΈ Qwen3-ASR-0.6B-NL β Dutch Speech Recognition
A Dutch-specialised automatic speech recognition (ASR) model,
fine-tuned from [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B). It outputs
cased, punctuated Dutch text and works as a drop-in replacement for
the base model.
> **Update.** v2 release. The v1 run (mixed_nl: synthetic_nl + CV22-nl train, 5 epochs) reached 9.06% WER on CV17-nl test and 8.95% WER on CV22-nl test. For v2 we fold CV22-nl train + validation together with the full synthetic_transcript_nl corpus (~34.9k clips) into one training set, train with a 3-epoch budget, and validate only on the held-out CV17-nl and CV22-nl test sets. The run was early-stopped at the start of epoch 2 because eval_loss bottomed at step 800 (epoch 1.69, eval_loss 0.1249) and started rising at step 1000 β the checkpoint published here is step 800.
On **Common Voice 17 (test)** it reaches **8.11% WER** (down from 12.42% zero-shot, **-34.7%** relative).
---
## π Results
WER and CER on held-out Common Voice test sets β same samples, same protocol,
no test-time tricks. "Zero-shot" is the base
[Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B) called with
`language="Dutch"`. The fine-tuned numbers are **bold**.
| Test set | Samples | Zero-shot WER | **Fine-tuned WER** | Ξ WER | CER (zero-shot β fine-tuned) |
|---|---:|---:|---:|---:|:--:|
| Common Voice 17 (test) | 11,266 | 12.42 | 8.11 | **-4.31** (-34.7%) | 4.06 β 2.68 |
| Common Voice 22 (test) | 12,033 | 12.46 | 8.31 | **-4.15** (-33.3%) | 4.11 β 2.76 |
Lower is better. Both held-out test sets see roughly a **one-third relative
reduction in word error rate** versus the already-strong base model.
> π¬ **Reproducibility note.** Both the zero-shot baseline and the fine-tuned
> numbers above were measured with the *same* evaluation function
> (`train_qwen3_asr.evaluate_model`), the *same* greedy decoding settings, and
> the *same* reference normalisation (see next section). This is an
> apples-to-apples comparison.
## π§Ή Reference / target normalisation
Common Voice transcripts are crowd-sourced and inconsistent in casing and
trailing punctuation. To give the model a clean, predictable target
distribution we apply a small, deterministic **written-form normalisation**
to every reference at load time, both during training and during evaluation:
1. **Capitalise the first letter** if it is lowercase.
2. **Collapse trailing dots** β any sequence of `.`, `β¦`, `..`, `...` at the
end is replaced with a single `.`.
3. **Append a terminal period** if the sentence does not already end in
terminal punctuation (`. ! ? β¦`) or a closing bracket / quote
(`) ] } " '` etc.).
The exact function lives in `src/evaluation/score_written_form.py` of the
project repository. Concretely:
| Raw reference | Normalised |
|--------------------------------|----------------------------------|
| `bom dia` | `Bom dia.` |
| `o gato dorme...` | `O gato dorme.` |
| `como estΓ‘s?` | `Como estΓ‘s?` *(unchanged)* |
| `"oi"` | `"Oi"` *(closing quote β no `.`)* |
Because the **same** normalisation is applied to references used for the
zero-shot baseline above, the gain reported in the results table reflects the
fine-tune itself β **not** a metric quirk caused by mismatched references.
## π How to use
Install the official `qwen-asr` package, then load this model exactly the
same way you would load the base Qwen3-ASR:
```bash
pip install qwen-asr
```
```python
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"yuriyvnv/Qwen3-ASR-0.6B-NL",
dtype=torch.bfloat16,
device_map="cuda:0",
)
result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)
```
Batch inference, automatic language detection, streaming, and vLLM serving
all work identically to the base model β see the
[upstream Qwen3-ASR documentation](https://github.com/QwenLM/Qwen3-ASR) for
details.
## π οΈ Training
**Dataset:** [yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl)](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) β Common Voice 22 Dutch train + validation combined with the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips), shuffled with the run seed.
After duration filtering and transcript-length filtering: **57,000**
training samples and **12,033** validation samples.
**Recipe:** follows the
[official QwenLM SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning)
with our local hyperparameters:
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Scheduler | linear |
| Warmup ratio | 0.02 |
| Per-device batch size | 92 |
| Gradient accumulation | 2 |
| Effective batch size | 184 |
| Epochs | 2 (early-stopped from 3; eval_loss best at step 800 / epoch 1.69) |
| Precision | bf16 mixed |
| Gradient checkpointing | enabled |
| Optimizer | AdamW (fused) |
Trained on a single H100. The best checkpoint was selected by validation
loss.
## β οΈ Limitations
- Trained on Common Voice β read-speech dominated. Conversational,
overlapping-speaker, far-field, or strongly accented audio may degrade
accuracy.
- Outputs Dutch text. Cross-lingual or code-switched audio is not
targeted.
- Punctuation and casing are best-effort and inherit the inconsistencies of
the Common Voice reference transcripts (mitigated, but not eliminated, by
the normalisation step above).
## π Acknowledgements
This model would not exist without the work of others. Thank you to:
- **The Qwen team at Alibaba Cloud** for releasing
[Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B) β the backbone of
this fine-tune β together with a clean, reproducible
[SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning) and
the [Qwen3-ASR Technical Report](https://arxiv.org/abs/2601.21337).
- **The Mozilla Common Voice community** for collecting and releasing the
Dutch speech corpus used for training and evaluation
([Common Voice 22](https://huggingface.co/datasets/fsicoli/common_voice_22_0),
[Common Voice 17 mirror](https://huggingface.co/datasets/fixie-ai/common_voice_17_0)).
- **Every contributor** who recorded, validated, or transcribed a clip in
Common Voice. This model is, very literally, your voices.
## π Citation
If this model is useful in your work, please cite the base Qwen3-ASR report:
```bibtex
@article{qwen3asr2025,
title = {Qwen3-ASR Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://arxiv.org/abs/2601.21337}
}
```
And, if relevant, this Dutch fine-tune:
```
yuriyvnv/Qwen3-ASR-0.6B-NL β Dutch fine-tune of Qwen3-ASR-0.6B
trained on Common-Voice-derived data with WAVe-based
quality filtering.
```