---
language:
  - nl
license: apache-2.0
library_name: transformers
tags:
  - automatic-speech-recognition
  - speech
  - qwen3-asr
  - qwen
  - dutch
  - fine-tuned
  - common-voice
datasets:
  - fsicoli/common_voice_22_0
  - fixie-ai/common_voice_17_0
  - yuriyvnv/synthetic_transcript_pt
base_model: Qwen/Qwen3-ASR-0.6B
pipeline_tag: automatic-speech-recognition
model-index:
  - name: Qwen3-ASR-0.6B-NL
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 17 (test)
          type: fixie-ai/common_voice_17_0
          config: nl
          split: test
        metrics:
          - type: wer
            value: 8.11
            name: WER
          - type: cer
            value: 2.68
            name: CER
      - task:
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 22 (test)
          type: fsicoli/common_voice_22_0
          config: nl
          split: test
        metrics:
          - type: wer
            value: 8.31
            name: WER
          - type: cer
            value: 2.76
            name: CER
---

# 🎙️ Qwen3-ASR-0.6B-NL — Dutch Speech Recognition

<div align="center">
  <img src="https://img.shields.io/badge/Parameters-0.6B-red" alt="0.6B Parameters">
  <img src="https://img.shields.io/badge/Modality-Speech%20%E2%86%92%20Text-purple" alt="Speech to Text">
  <img src="https://img.shields.io/badge/Language-Dutch-green" alt="Dutch">
  <img src="https://img.shields.io/badge/Task-ASR-blue" alt="Automatic Speech Recognition">
  <img src="https://img.shields.io/badge/Base-Qwen3--ASR--0.6B-orange" alt="Base model">
  <img src="https://img.shields.io/badge/Precision-bf16-lightgrey" alt="bf16">
  <img src="https://img.shields.io/badge/License-Apache--2.0-yellow" alt="Apache-2.0">
</div>

<br/>

A Dutch-specialised automatic speech recognition (ASR) model,
fine-tuned from [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B). It outputs
cased, punctuated Dutch text and works as a drop-in replacement for
the base model.

> **Update.** v2 release. The v1 run (mixed_nl: synthetic_nl + CV22-nl train, 5 epochs) reached 9.06% WER on CV17-nl test and 8.95% WER on CV22-nl test. For v2 we fold CV22-nl train + validation together with the full synthetic_transcript_nl corpus (~34.9k clips) into one training set, train with a 3-epoch budget, and validate only on the held-out CV17-nl and CV22-nl test sets. The run was early-stopped at the start of epoch 2 because eval_loss bottomed at step 800 (epoch 1.69, eval_loss 0.1249) and started rising at step 1000 — the checkpoint published here is step 800.

On **Common Voice 17 (test)** it reaches **8.11% WER** (down from 12.42% zero-shot, **-34.7%** relative).

---

## 📊 Results

WER and CER on held-out Common Voice test sets — same samples, same protocol,
no test-time tricks. "Zero-shot" is the base
[Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B) called with
`language="Dutch"`. The fine-tuned numbers are **bold**.

| Test set | Samples | Zero-shot WER | **Fine-tuned WER** | Δ WER | CER (zero-shot → fine-tuned) |
|---|---:|---:|---:|---:|:--:|
| Common Voice 17 (test) | 11,266 | 12.42 | 8.11 | **-4.31** (-34.7%) | 4.06 → 2.68 |
| Common Voice 22 (test) | 12,033 | 12.46 | 8.31 | **-4.15** (-33.3%) | 4.11 → 2.76 |

Lower is better. Both held-out test sets see roughly a **one-third relative
reduction in word error rate** versus the already-strong base model.

> 🔬 **Reproducibility note.** Both the zero-shot baseline and the fine-tuned
> numbers above were measured with the *same* evaluation function
> (`train_qwen3_asr.evaluate_model`), the *same* greedy decoding settings, and
> the *same* reference normalisation (see next section). This is an
> apples-to-apples comparison.

## 🧹 Reference / target normalisation

Common Voice transcripts are crowd-sourced and inconsistent in casing and
trailing punctuation. To give the model a clean, predictable target
distribution we apply a small, deterministic **written-form normalisation**
to every reference at load time, both during training and during evaluation:

1. **Capitalise the first letter** if it is lowercase.
2. **Collapse trailing dots** — any sequence of `.`, `…`, `..`, `...` at the
   end is replaced with a single `.`.
3. **Append a terminal period** if the sentence does not already end in
   terminal punctuation (`. ! ? …`) or a closing bracket / quote
   (`) ] } " '` etc.).

The exact function lives in `src/evaluation/score_written_form.py` of the
project repository. Concretely:

| Raw reference                  | Normalised                       |
|--------------------------------|----------------------------------|
| `bom dia`                      | `Bom dia.`                       |
| `o gato dorme...`              | `O gato dorme.`                  |
| `como estás?`                  | `Como estás?` *(unchanged)*      |
| `"oi"`                         | `"Oi"` *(closing quote → no `.`)* |

Because the **same** normalisation is applied to references used for the
zero-shot baseline above, the gain reported in the results table reflects the
fine-tune itself — **not** a metric quirk caused by mismatched references.

## 🚀 How to use

Install the official `qwen-asr` package, then load this model exactly the
same way you would load the base Qwen3-ASR:

```bash
pip install qwen-asr
```

```python
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-0.6B-NL",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)
```

Batch inference, automatic language detection, streaming, and vLLM serving
all work identically to the base model — see the
[upstream Qwen3-ASR documentation](https://github.com/QwenLM/Qwen3-ASR) for
details.

## 🛠️ Training

**Dataset:** [yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl)](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) — Common Voice 22 Dutch train + validation combined with the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips), shuffled with the run seed.
After duration filtering and transcript-length filtering: **57,000**
training samples and **12,033** validation samples.

**Recipe:** follows the
[official QwenLM SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning)
with our local hyperparameters:

| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Scheduler | linear |
| Warmup ratio | 0.02 |
| Per-device batch size | 92 |
| Gradient accumulation | 2 |
| Effective batch size | 184 |
| Epochs | 2 (early-stopped from 3; eval_loss best at step 800 / epoch 1.69) |
| Precision | bf16 mixed |
| Gradient checkpointing | enabled |
| Optimizer | AdamW (fused) |

Trained on a single H100. The best checkpoint was selected by validation
loss.

## ⚠️ Limitations

- Trained on Common Voice — read-speech dominated. Conversational,
  overlapping-speaker, far-field, or strongly accented audio may degrade
  accuracy.
- Outputs Dutch text. Cross-lingual or code-switched audio is not
  targeted.
- Punctuation and casing are best-effort and inherit the inconsistencies of
  the Common Voice reference transcripts (mitigated, but not eliminated, by
  the normalisation step above).

## 🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

- **The Qwen team at Alibaba Cloud** for releasing
  [Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B) — the backbone of
  this fine-tune — together with a clean, reproducible
  [SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning) and
  the [Qwen3-ASR Technical Report](https://arxiv.org/abs/2601.21337).
- **The Mozilla Common Voice community** for collecting and releasing the
  Dutch speech corpus used for training and evaluation
  ([Common Voice 22](https://huggingface.co/datasets/fsicoli/common_voice_22_0),
  [Common Voice 17 mirror](https://huggingface.co/datasets/fixie-ai/common_voice_17_0)).
- **Every contributor** who recorded, validated, or transcribed a clip in
  Common Voice. This model is, very literally, your voices.

## 📚 Citation

If this model is useful in your work, please cite the base Qwen3-ASR report:

```bibtex
@article{qwen3asr2025,
  title  = {Qwen3-ASR Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://arxiv.org/abs/2601.21337}
}
```

And, if relevant, this Dutch fine-tune:

```
yuriyvnv/Qwen3-ASR-0.6B-NL — Dutch fine-tune of Qwen3-ASR-0.6B
                                  trained on Common-Voice-derived data with WAVe-based
                                  quality filtering.
```