Whisper-large-v3 β Arabic 4-dialect FT v3 (CT2 int8)
CTranslate2 int8 build of openai/whisper-large-v3 (1.55 B base) fine-tuned with QLoRA on 38 hours of dialect-balanced Arabic from Casablanca (5 countries) + cleaned MGB-3 + MASC + Common Voice 18. 1.56 GB on disk, runs at real-time on commodity CPU.
The "v3" of the project β significantly bigger base than the v2 turbo, with significantly more training data (38 h vs 7 h). For the smaller 809M turbo variant, see whisper-large-v3-turbo-arabic-ft-ct2-int8.
For the float32 PyTorch version: whisper-large-v3-arabic-ft-v3. For the LoRA adapter (further fine-tuning + Git-history of every save during training): whisper-large-v3-arabic-ft-v3-lora.
Headline WER on mixed-domain test sets (n=100/dialect, int8, beam=2, threads=8, c3-standard-8)
50% Casablanca + 50% broadcast (MGB-3 / MASC) for Egyptian and Levantine; 100% Casablanca UAE for Gulf (no public broadcast Gulf source); 100% FLEURS broadcast for MSA. Same exact recordings + decoding config used for both rows.
| Dialect | Test composition | Zero-shot Whisper-large-v3 | This model (v3-ft) | Ξ |
|---|---|---|---|---|
| MSA | FLEURS broadcast | 8.51% | 10.52% | +2.01 pp |
| Egyptian | 50 Casablanca + 50 MGB-3 | 38.48% | 23.90% | β14.58 pp β |
| Levantine | 50 Casablanca JO + 50 MASC | 37.70% | 30.63% | β7.07 pp β |
| Gulf | Casablanca UAE | 52.72% | 41.46% | β11.26 pp β |
| avg-4 | 34.35% | 26.63% | β7.72 pp β |
v3-ft beats zero-shot Whisper-large-v3 by 7.72 pp average WER, with double-digit gains on Egyptian and Gulf. The MSA loss of 2.01 pp is the well-known dialect-vs-MSA tradeoff β see paper Β§6 for the details.
Quickstart
pip install faster-whisper
huggingface-cli download dev-ahmedhany/whisper-large-v3-arabic-ft-v3-ct2-int8 \
--local-dir ./whisper-ar-v3-int8
from faster_whisper import WhisperModel
model = WhisperModel(
"./whisper-ar-v3-int8",
device="cpu", compute_type="int8", cpu_threads=8,
)
segments, info = model.transcribe(
"audio.wav",
beam_size=2, # paper Β§6.6 sweet spot
language="ar",
task="transcribe",
)
print(" ".join(s.text for s in segments))
Inference RTF β 0.5β0.8 (MSA) to 1.4β1.7 (dialects) on a c3-standard-8 (Intel Sapphire Rapids), peak RAM ~3.1 GB.
Training recipe
- Base:
openai/whisper-large-v3(1.55 B params) - QLoRA: NF4 + bf16 compute, r=8, Ξ±=16, dropout 0.05, target modules = q/v/k/out_proj + fc1/fc2
- Optimizer: paged_adamw_8bit, lr = 1e-4, warmup ratio 0.1
- Effective batch 16 (per-device 4 Γ grad-accum 4), gradient_checkpointing
- Best checkpoint: step 4750 of 10000 (early-stop after plateau across ckpts 5000-6000)
- Training data: ~38 h, 26,817 train / 924 val rows
- MSA (9.6 h): Common Voice 18 Arabic (capped 15 h)
- Egyptian (18.6 h): Casablanca Egypt + MGB-3 + cleaned MGB-3
- Levantine (9.8 h): Casablanca Jordan + Palestine + MASC
- Gulf (1.9 h): Casablanca UAE + Yemen
- Maghrebi excluded (84.7% zero-shot WER too far gone for QLoRA budget β paper Β§3.7)
- Compute: ~25β28 h on a single GCP L4 (g2-standard-16), ~$25 GPU + ~$5 CPU bench evals
Conversion to CT2 int8
ct2-transformers-converter \
--model checkpoints/v3-merged \
--output_dir checkpoints/v3-ct2-int8 \
--quantization int8 \
--copy_files preprocessor_config.json tokenizer_config.json normalizer.json \
special_tokens_map.json added_tokens.json merges.txt vocab.json tokenizer.json
(Pin ctranslate2<4.5 for compatibility with transformers==4.46.3.)
Limitations
- Maghrebi out of scope β zero-shot-quality (β₯80% WER) on Algerian/Moroccan/Tunisian.
- MSA: zero-shot Whisper-large-v3 is slightly better. If your traffic is overwhelmingly broadcast Arabic and you don't need dialect support, plain zero-shot large-v3 is the right pick (8.51% MSA vs our 10.52%).
- Gulf test is Casablanca-only β no public broadcast Gulf corpus available; Gulf number reflects conversational only.
- Single-utterance assumed β audio >30 s is internally chunked by faster-whisper; multi-speaker diarization not provided.
Citation
@misc{hany2026whisperarabic,
title = {Production-Aware Fine-Tuning of Whisper Variants for Multi-Dialect
Arabic ASR: A Cross-Platform CPU Inference Study},
author = {Hany, Ahmed},
year = {2026},
howpublished = {Preprint, arXiv (in preparation)},
url = {https://github.com/dev-ahmedhany/whisper-arabic-dialects},
}
License
Apache-2.0 (inherits from base openai/whisper-large-v3).
- Downloads last month
- 70
Model tree for dev-ahmedhany/whisper-large-v3-arabic-ft-v3-ct2-int8
Base model
openai/whisper-large-v3Evaluation results
- wer on FLEURS Arabictest set self-reported10.520
- wer on Mixed Egyptian (Casablanca + MGB-3)test set self-reported23.900
- wer on Mixed Levantine (Casablanca + MASC)test set self-reported30.630
- wer on Casablanca Gulf (UAE)test set self-reported41.460