--- language: - nl license: apache-2.0 library_name: transformers tags: - automatic-speech-recognition - speech - qwen3-asr - qwen - dutch - fine-tuned - common-voice datasets: - fsicoli/common_voice_22_0 - fixie-ai/common_voice_17_0 - yuriyvnv/synthetic_transcript_pt base_model: Qwen/Qwen3-ASR-0.6B pipeline_tag: automatic-speech-recognition model-index: - name: Qwen3-ASR-0.6B-NL results: - task: type: automatic-speech-recognition dataset: name: Common Voice 17 (test) type: fixie-ai/common_voice_17_0 config: nl split: test metrics: - type: wer value: 8.11 name: WER - type: cer value: 2.68 name: CER - task: type: automatic-speech-recognition dataset: name: Common Voice 22 (test) type: fsicoli/common_voice_22_0 config: nl split: test metrics: - type: wer value: 8.31 name: WER - type: cer value: 2.76 name: CER --- # πŸŽ™οΈ Qwen3-ASR-0.6B-NL β€” Dutch Speech Recognition
0.6B Parameters Speech to Text Dutch Automatic Speech Recognition Base model bf16 Apache-2.0

A Dutch-specialised automatic speech recognition (ASR) model, fine-tuned from [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B). It outputs cased, punctuated Dutch text and works as a drop-in replacement for the base model. > **Update.** v2 release. The v1 run (mixed_nl: synthetic_nl + CV22-nl train, 5 epochs) reached 9.06% WER on CV17-nl test and 8.95% WER on CV22-nl test. For v2 we fold CV22-nl train + validation together with the full synthetic_transcript_nl corpus (~34.9k clips) into one training set, train with a 3-epoch budget, and validate only on the held-out CV17-nl and CV22-nl test sets. The run was early-stopped at the start of epoch 2 because eval_loss bottomed at step 800 (epoch 1.69, eval_loss 0.1249) and started rising at step 1000 β€” the checkpoint published here is step 800. On **Common Voice 17 (test)** it reaches **8.11% WER** (down from 12.42% zero-shot, **-34.7%** relative). --- ## πŸ“Š Results WER and CER on held-out Common Voice test sets β€” same samples, same protocol, no test-time tricks. "Zero-shot" is the base [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B) called with `language="Dutch"`. The fine-tuned numbers are **bold**. | Test set | Samples | Zero-shot WER | **Fine-tuned WER** | Ξ” WER | CER (zero-shot β†’ fine-tuned) | |---|---:|---:|---:|---:|:--:| | Common Voice 17 (test) | 11,266 | 12.42 | 8.11 | **-4.31** (-34.7%) | 4.06 β†’ 2.68 | | Common Voice 22 (test) | 12,033 | 12.46 | 8.31 | **-4.15** (-33.3%) | 4.11 β†’ 2.76 | Lower is better. Both held-out test sets see roughly a **one-third relative reduction in word error rate** versus the already-strong base model. > πŸ”¬ **Reproducibility note.** Both the zero-shot baseline and the fine-tuned > numbers above were measured with the *same* evaluation function > (`train_qwen3_asr.evaluate_model`), the *same* greedy decoding settings, and > the *same* reference normalisation (see next section). This is an > apples-to-apples comparison. ## 🧹 Reference / target normalisation Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic **written-form normalisation** to every reference at load time, both during training and during evaluation: 1. **Capitalise the first letter** if it is lowercase. 2. **Collapse trailing dots** β€” any sequence of `.`, `…`, `..`, `...` at the end is replaced with a single `.`. 3. **Append a terminal period** if the sentence does not already end in terminal punctuation (`. ! ? …`) or a closing bracket / quote (`) ] } " '` etc.). The exact function lives in `src/evaluation/score_written_form.py` of the project repository. Concretely: | Raw reference | Normalised | |--------------------------------|----------------------------------| | `bom dia` | `Bom dia.` | | `o gato dorme...` | `O gato dorme.` | | `como estΓ‘s?` | `Como estΓ‘s?` *(unchanged)* | | `"oi"` | `"Oi"` *(closing quote β†’ no `.`)* | Because the **same** normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself β€” **not** a metric quirk caused by mismatched references. ## πŸš€ How to use Install the official `qwen-asr` package, then load this model exactly the same way you would load the base Qwen3-ASR: ```bash pip install qwen-asr ``` ```python import torch from qwen_asr import Qwen3ASRModel model = Qwen3ASRModel.from_pretrained( "yuriyvnv/Qwen3-ASR-0.6B-NL", dtype=torch.bfloat16, device_map="cuda:0", ) result = model.transcribe(audio="audio.wav", language="Dutch") print(result[0].text) ``` Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model β€” see the [upstream Qwen3-ASR documentation](https://github.com/QwenLM/Qwen3-ASR) for details. ## πŸ› οΈ Training **Dataset:** [yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl)](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) β€” Common Voice 22 Dutch train + validation combined with the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips), shuffled with the run seed. After duration filtering and transcript-length filtering: **57,000** training samples and **12,033** validation samples. **Recipe:** follows the [official QwenLM SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning) with our local hyperparameters: | Parameter | Value | |---|---| | Learning rate | 2e-5 | | Scheduler | linear | | Warmup ratio | 0.02 | | Per-device batch size | 92 | | Gradient accumulation | 2 | | Effective batch size | 184 | | Epochs | 2 (early-stopped from 3; eval_loss best at step 800 / epoch 1.69) | | Precision | bf16 mixed | | Gradient checkpointing | enabled | | Optimizer | AdamW (fused) | Trained on a single H100. The best checkpoint was selected by validation loss. ## ⚠️ Limitations - Trained on Common Voice β€” read-speech dominated. Conversational, overlapping-speaker, far-field, or strongly accented audio may degrade accuracy. - Outputs Dutch text. Cross-lingual or code-switched audio is not targeted. - Punctuation and casing are best-effort and inherit the inconsistencies of the Common Voice reference transcripts (mitigated, but not eliminated, by the normalisation step above). ## πŸ™ Acknowledgements This model would not exist without the work of others. Thank you to: - **The Qwen team at Alibaba Cloud** for releasing [Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B) β€” the backbone of this fine-tune β€” together with a clean, reproducible [SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning) and the [Qwen3-ASR Technical Report](https://arxiv.org/abs/2601.21337). - **The Mozilla Common Voice community** for collecting and releasing the Dutch speech corpus used for training and evaluation ([Common Voice 22](https://huggingface.co/datasets/fsicoli/common_voice_22_0), [Common Voice 17 mirror](https://huggingface.co/datasets/fixie-ai/common_voice_17_0)). - **Every contributor** who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices. ## πŸ“š Citation If this model is useful in your work, please cite the base Qwen3-ASR report: ```bibtex @article{qwen3asr2025, title = {Qwen3-ASR Technical Report}, author = {Qwen Team}, year = {2025}, url = {https://arxiv.org/abs/2601.21337} } ``` And, if relevant, this Dutch fine-tune: ``` yuriyvnv/Qwen3-ASR-0.6B-NL β€” Dutch fine-tune of Qwen3-ASR-0.6B trained on Common-Voice-derived data with WAVe-based quality filtering. ```