Instructions to use yuriyvnv/Qwen3-ASR-0.6B-NL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yuriyvnv/Qwen3-ASR-0.6B-NL with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="yuriyvnv/Qwen3-ASR-0.6B-NL")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("yuriyvnv/Qwen3-ASR-0.6B-NL", dtype="auto") - Notebooks
- Google Colab
- Kaggle
🎙️ Qwen3-ASR-0.6B-NL — Dutch Speech Recognition
A Dutch-specialised automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-0.6B. It outputs cased, punctuated Dutch text and works as a drop-in replacement for the base model.
Update. v2 release. The v1 run (mixed_nl: synthetic_nl + CV22-nl train, 5 epochs) reached 9.06% WER on CV17-nl test and 8.95% WER on CV22-nl test. For v2 we fold CV22-nl train + validation together with the full synthetic_transcript_nl corpus (~34.9k clips) into one training set, train with a 3-epoch budget, and validate only on the held-out CV17-nl and CV22-nl test sets. The run was early-stopped at the start of epoch 2 because eval_loss bottomed at step 800 (epoch 1.69, eval_loss 0.1249) and started rising at step 1000 — the checkpoint published here is step 800.
On Common Voice 17 (test) it reaches 8.11% WER (down from 12.42% zero-shot, -34.7% relative).
📊 Results
WER and CER on held-out Common Voice test sets — same samples, same protocol,
no test-time tricks. "Zero-shot" is the base
Qwen/Qwen3-ASR-0.6B called with
language="Dutch". The fine-tuned numbers are bold.
| Test set | Samples | Zero-shot WER | Fine-tuned WER | Δ WER | CER (zero-shot → fine-tuned) |
|---|---|---|---|---|---|
| Common Voice 17 (test) | 11,266 | 12.42 | 8.11 | -4.31 (-34.7%) | 4.06 → 2.68 |
| Common Voice 22 (test) | 12,033 | 12.46 | 8.31 | -4.15 (-33.3%) | 4.11 → 2.76 |
Lower is better. Both held-out test sets see roughly a one-third relative reduction in word error rate versus the already-strong base model.
🔬 Reproducibility note. Both the zero-shot baseline and the fine-tuned numbers above were measured with the same evaluation function (
train_qwen3_asr.evaluate_model), the same greedy decoding settings, and the same reference normalisation (see next section). This is an apples-to-apples comparison.
🧹 Reference / target normalisation
Common Voice transcripts are crowd-sourced and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:
- Capitalise the first letter if it is lowercase.
- Collapse trailing dots — any sequence of
.,…,..,...at the end is replaced with a single.. - Append a terminal period if the sentence does not already end in
terminal punctuation (
. ! ? …) or a closing bracket / quote () ] } " 'etc.).
The exact function lives in src/evaluation/score_written_form.py of the
project repository. Concretely:
| Raw reference | Normalised |
|---|---|
bom dia |
Bom dia. |
o gato dorme... |
O gato dorme. |
como estás? |
Como estás? (unchanged) |
"oi" |
"Oi" (closing quote → no .) |
Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.
🚀 How to use
Install the official qwen-asr package, then load this model exactly the
same way you would load the base Qwen3-ASR:
pip install qwen-asr
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"yuriyvnv/Qwen3-ASR-0.6B-NL",
dtype=torch.bfloat16,
device_map="cuda:0",
)
result = model.transcribe(audio="audio.wav", language="Dutch")
print(result[0].text)
Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.
🛠️ Training
Dataset: yuriyvnv/synthetic_transcript_nl + fsicoli/common_voice_22_0 (nl) — Common Voice 22 Dutch train + validation combined with the full synthetic OpenAI-TTS Dutch corpus (~34.9k clips), shuffled with the run seed. After duration filtering and transcript-length filtering: 57,000 training samples and 12,033 validation samples.
Recipe: follows the official QwenLM SFT recipe with our local hyperparameters:
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Scheduler | linear |
| Warmup ratio | 0.02 |
| Per-device batch size | 92 |
| Gradient accumulation | 2 |
| Effective batch size | 184 |
| Epochs | 2 (early-stopped from 3; eval_loss best at step 800 / epoch 1.69) |
| Precision | bf16 mixed |
| Gradient checkpointing | enabled |
| Optimizer | AdamW (fused) |
Trained on a single H100. The best checkpoint was selected by validation loss.
⚠️ Limitations
- Trained on Common Voice — read-speech dominated. Conversational, overlapping-speaker, far-field, or strongly accented audio may degrade accuracy.
- Outputs Dutch text. Cross-lingual or code-switched audio is not targeted.
- Punctuation and casing are best-effort and inherit the inconsistencies of the Common Voice reference transcripts (mitigated, but not eliminated, by the normalisation step above).
🙏 Acknowledgements
This model would not exist without the work of others. Thank you to:
- The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-0.6B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe and the Qwen3-ASR Technical Report.
- The Mozilla Common Voice community for collecting and releasing the Dutch speech corpus used for training and evaluation (Common Voice 22, Common Voice 17 mirror).
- Every contributor who recorded, validated, or transcribed a clip in Common Voice. This model is, very literally, your voices.
📚 Citation
If this model is useful in your work, please cite the base Qwen3-ASR report:
@article{qwen3asr2025,
title = {Qwen3-ASR Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://arxiv.org/abs/2601.21337}
}
And, if relevant, this Dutch fine-tune:
yuriyvnv/Qwen3-ASR-0.6B-NL — Dutch fine-tune of Qwen3-ASR-0.6B
trained on Common-Voice-derived data with WAVe-based
quality filtering.
- Downloads last month
- 95
Model tree for yuriyvnv/Qwen3-ASR-0.6B-NL
Base model
Qwen/Qwen3-ASR-0.6BDatasets used to train yuriyvnv/Qwen3-ASR-0.6B-NL
fsicoli/common_voice_22_0
yuriyvnv/synthetic_transcript_pt
Collection including yuriyvnv/Qwen3-ASR-0.6B-NL
Paper for yuriyvnv/Qwen3-ASR-0.6B-NL
Evaluation results
- WER on Common Voice 17 (test)test set self-reported8.110
- CER on Common Voice 17 (test)test set self-reported2.680
- WER on Common Voice 22 (test)test set self-reported8.310
- CER on Common Voice 22 (test)test set self-reported2.760