🎙️ Qwen3-ASR-0.6B-NL — Dutch Speech Recognition

A Dutch-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-0.6B. It outputs cased, punctuated Dutch text and works as a drop-in replacement for the base model.

Update. v2 release. The v1 run (mixed_nl: synthetic_nl + CV22-nl train, 5 epochs) reached 9.06% WER on CV17-nl test and 8.95% WER on CV22-nl test. For v2 we fold CV22-nl train + validation together with the full synthetic_transcript_nl corpus (~34.9k clips) into one training set, train with a 3-epoch budget, and validate only on the held-out CV17-nl and CV22-nl test sets. The run was early-stopped at the start of epoch 2 because eval_loss bottomed at step 800 (epoch 1.69, eval_loss 0.1249) and started rising at step 1000 — the checkpoint published here is step 800.

On Common Voice 17 (test) it reaches 8.11% WER (down from 12.42% zero-shot, -34.7% relative).

📊 Results

WER and CER on held-out Common Voice test sets — same samples, same protocol, no test-time tricks. "Zero-shot" is the base Qwen/Qwen3-ASR-0.6B called with language="Dutch". The fine-tuned numbers are bold.

Test set	Samples	Zero-shot WER	Fine-tuned WER	Δ WER	CER (zero-shot → fine-tuned)
Common Voice 17 (test)	11,266	12.42	8.11	-4.31 (-34.7%)	4.06 → 2.68
Common Voice 22 (test)	12,033	12.46	8.31	-4.15 (-33.3%)	4.11 → 2.76

Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.

🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.

🧹 Reference / target normalisation

Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:

Capitalise the first letter if it is lowercase.
Collapse trailing dots — any sequence of ., …, .., ... at the end is replaced with a single ..
Append a terminal period if the sentence does not already end in terminal punctuation (. ! ? …) or a closing bracket / quote () ] } " ' etc.).

The exact function lives in src/evaluation/score_written_form.py of the project repository. Concretely:

Raw reference	Normalised
`bom dia`	`Bom dia.`
`o gato dorme...`	`O gato dorme.`
`como estás?`	`Como estás?` (unchanged)
`"oi"`	`"Oi"` (closing quote → no `.`)

Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.

🚀 How to use

Install the official qwen-asr package, then load this model exactly the same way you would load the base Qwen3-ASR:

pip install qwen-asr

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-0.6B-NL",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)

Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.

🛠️ Training

Dataset: yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl) — Common Voice 22 Dutch train + validation combined with the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips), shuffled with the run seed. After duration filtering and transcript-length filtering: 57,000 training samples and 12,033 validation samples.

Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:

Parameter	Value
Learning rate	2e-5
Scheduler	linear
Warmup ratio	0.02
Per-device batch size	92
Gradient accumulation	2
Effective batch size	184
Epochs	2 (early-stopped from 3; eval_loss best at step 800 / epoch 1.69)
Precision	bf16 mixed
Gradient checkpointing	enabled
Optimizer	AdamW (fused)

Trained on a single H100. The best checkpoint was selected by validation loss.

⚠️ Limitations

Trained on Common Voice — read-speech dominated. Conversational, overlapping-speaker, far-field, or strongly accented audio may degrade accuracy.
Outputs Dutch text. Cross-lingual or code-switched audio is not targeted.
Punctuation and casing are best-effort and inherit the inconsistencies of the Common Voice reference transcripts (mitigated, but not eliminated, by the normalisation step above).

🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-0.6B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
The Mozilla Common Voice community for collecting and releasing the Dutch speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.

📚 Citation

If this model is useful in your work, please cite the base Qwen3-ASR report:

@article{qwen3asr2025,
  title  = {Qwen3-ASR Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://arxiv.org/abs/2601.21337}
}

And, if relevant, this Dutch fine-tune:

yuriyvnv/Qwen3-ASR-0.6B-NL — Dutch fine-tune of Qwen3-ASR-0.6B
                                  trained on Common-Voice-derived data with WAVe-based
                                  quality filtering.