🎙️ Qwen3-ASR-0.6B-NL — Dutch Speech Recognition

0.6B Parameters Speech to Text Dutch Automatic Speech Recognition Base model bf16 Apache-2.0

A Dutch-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-0.6B. It outputs cased, punctuated Dutch text and works as a drop-in replacement for the base model.

Update. v2 release. The v1 run (mixed_nl: synthetic_nl + CV22-nl train, 5 epochs) reached 9.06% WER on CV17-nl test and 8.95% WER on CV22-nl test. For v2 we fold CV22-nl train + validation together with the full synthetic_transcript_nl corpus (~34.9k clips) into one training set, train with a 3-epoch budget, and validate only on the held-out CV17-nl and CV22-nl test sets. The run was early-stopped at the start of epoch 2 because eval_loss bottomed at step 800 (epoch 1.69, eval_loss 0.1249) and started rising at step 1000 — the checkpoint published here is step 800.

On Common Voice 17 (test) it reaches 8.11% WER (down from 12.42% zero-shot, -34.7% relative).


📊 Results

WER and CER on held-out Common Voice test sets — same samples, same protocol, no test-time tricks. "Zero-shot" is the base Qwen/Qwen3-ASR-0.6B called with language="Dutch". The fine-tuned numbers are bold.

Test set Samples Zero-shot WER Fine-tuned WER Δ WER CER (zero-shot → fine-tuned)
Common Voice 17 (test) 11,266 12.42 8.11 -4.31 (-34.7%) 4.06 → 2.68
Common Voice 22 (test) 12,033 12.46 8.31 -4.15 (-33.3%) 4.11 → 2.76

Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.

🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.

🧹 Reference / target normalisation

Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:

  1. Capitalise the first letter if it is lowercase.
  2. Collapse trailing dots — any sequence of ., , .., ... at the end is replaced with a single ..
  3. Append a terminal period if the sentence does not already end in terminal punctuation (. ! ? …) or a closing bracket / quote () ] } " ' etc.).

The exact function lives in src/evaluation/score_written_form.py of the project repository. Concretely:

Raw reference Normalised
bom dia Bom dia.
o gato dorme... O gato dorme.
como estás? Como estás? (unchanged)
"oi" "Oi" (closing quote → no .)

Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.

🚀 How to use

Install the official qwen-asr package, then load this model exactly the same way you would load the base Qwen3-ASR:

pip install qwen-asr
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-0.6B-NL",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)

Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.

🛠️ Training

Dataset: yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl) — Common Voice 22 Dutch train + validation combined with the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips), shuffled with the run seed. After duration filtering and transcript-length filtering: 57,000 training samples and 12,033 validation samples.

Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:

Parameter Value
Learning rate 2e-5
Scheduler linear
Warmup ratio 0.02
Per-device batch size 92
Gradient accumulation 2
Effective batch size 184
Epochs 2 (early-stopped from 3; eval_loss best at step 800 / epoch 1.69)
Precision bf16 mixed
Gradient checkpointing enabled
Optimizer AdamW (fused)

Trained on a single H100. The best checkpoint was selected by validation loss.

⚠️ Limitations

  • Trained on Common Voice — read-speech dominated. Conversational, overlapping-speaker, far-field, or strongly accented audio may degrade accuracy.
  • Outputs Dutch text. Cross-lingual or code-switched audio is not targeted.
  • Punctuation and casing are best-effort and inherit the inconsistencies of the Common Voice reference transcripts (mitigated, but not eliminated, by the normalisation step above).

🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

  • The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-0.6B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
  • The Mozilla Common Voice community for collecting and releasing the Dutch speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
  • Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.

📚 Citation

If this model is useful in your work, please cite the base Qwen3-ASR report:

@article{qwen3asr2025,
  title  = {Qwen3-ASR Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://arxiv.org/abs/2601.21337}
}

And, if relevant, this Dutch fine-tune:

yuriyvnv/Qwen3-ASR-0.6B-NL — Dutch fine-tune of Qwen3-ASR-0.6B
                                  trained on Common-Voice-derived data with WAVe-based
                                  quality filtering.
Downloads last month
95
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/Qwen3-ASR-0.6B-NL

Finetuned
(22)
this model

Datasets used to train yuriyvnv/Qwen3-ASR-0.6B-NL

Collection including yuriyvnv/Qwen3-ASR-0.6B-NL

Paper for yuriyvnv/Qwen3-ASR-0.6B-NL

Evaluation results