Whisper-large-v3 — Arabic 4-dialect FT v3 (CT2 int8)

CTranslate2 int8 build of openai/whisper-large-v3 (1.55 B base) fine-tuned with QLoRA on 38 hours of dialect-balanced Arabic from Casablanca (5 countries) + cleaned MGB-3 + MASC + Common Voice 18. 1.56 GB on disk, runs at real-time on commodity CPU.

The "v3" of the project — significantly bigger base than the v2 turbo, with significantly more training data (38 h vs 7 h). For the smaller 809M turbo variant, see whisper-large-v3-turbo-arabic-ft-ct2-int8.

For the float32 PyTorch version: whisper-large-v3-arabic-ft-v3. For the LoRA adapter (further fine-tuning + Git-history of every save during training): whisper-large-v3-arabic-ft-v3-lora.

Headline WER on mixed-domain test sets (n=100/dialect, int8, beam=2, threads=8, c3-standard-8)

50% Casablanca + 50% broadcast (MGB-3 / MASC) for Egyptian and Levantine; 100% Casablanca UAE for Gulf (no public broadcast Gulf source); 100% FLEURS broadcast for MSA. Same exact recordings + decoding config used for both rows.

Dialect	Test composition	Zero-shot Whisper-large-v3	This model (v3-ft)	Δ
MSA	FLEURS broadcast	8.51%	10.52%	+2.01 pp
Egyptian	50 Casablanca + 50 MGB-3	38.48%	23.90%	−14.58 pp ✅
Levantine	50 Casablanca JO + 50 MASC	37.70%	30.63%	−7.07 pp ✅
Gulf	Casablanca UAE	52.72%	41.46%	−11.26 pp ✅
avg-4		34.35%	26.63%	−7.72 pp ✅

v3-ft beats zero-shot Whisper-large-v3 by 7.72 pp average WER, with double-digit gains on Egyptian and Gulf. The MSA loss of 2.01 pp is the well-known dialect-vs-MSA tradeoff — see paper §6 for the details.

Quickstart

pip install faster-whisper
huggingface-cli download dev-ahmedhany/whisper-large-v3-arabic-ft-v3-ct2-int8 \
    --local-dir ./whisper-ar-v3-int8

from faster_whisper import WhisperModel

model = WhisperModel(
    "./whisper-ar-v3-int8",
    device="cpu", compute_type="int8", cpu_threads=8,
)
segments, info = model.transcribe(
    "audio.wav",
    beam_size=2,        # paper §6.6 sweet spot
    language="ar",
    task="transcribe",
)
print(" ".join(s.text for s in segments))

Inference RTF ≈ 0.5–0.8 (MSA) to 1.4–1.7 (dialects) on a c3-standard-8 (Intel Sapphire Rapids), peak RAM ~3.1 GB.

Training recipe

Base: openai/whisper-large-v3 (1.55 B params)
QLoRA: NF4 + bf16 compute, r=8, α=16, dropout 0.05, target modules = q/v/k/out_proj + fc1/fc2
Optimizer: paged_adamw_8bit, lr = 1e-4, warmup ratio 0.1
Effective batch 16 (per-device 4 × grad-accum 4), gradient_checkpointing
Best checkpoint: step 4750 of 10000 (early-stop after plateau across ckpts 5000-6000)
Training data: ~38 h, 26,817 train / 924 val rows
- MSA (9.6 h): Common Voice 18 Arabic (capped 15 h)
- Egyptian (18.6 h): Casablanca Egypt + MGB-3 + cleaned MGB-3
- Levantine (9.8 h): Casablanca Jordan + Palestine + MASC
- Gulf (1.9 h): Casablanca UAE + Yemen
Maghrebi excluded (84.7% zero-shot WER too far gone for QLoRA budget — paper §3.7)
Compute: ~25–28 h on a single GCP L4 (g2-standard-16), ~$25 GPU + ~$5 CPU bench evals

Conversion to CT2 int8

ct2-transformers-converter \
    --model checkpoints/v3-merged \
    --output_dir checkpoints/v3-ct2-int8 \
    --quantization int8 \
    --copy_files preprocessor_config.json tokenizer_config.json normalizer.json \
                 special_tokens_map.json added_tokens.json merges.txt vocab.json tokenizer.json

(Pin ctranslate2<4.5 for compatibility with transformers==4.46.3.)

Limitations

Maghrebi out of scope — zero-shot-quality (≥80% WER) on Algerian/Moroccan/Tunisian.
MSA: zero-shot Whisper-large-v3 is slightly better. If your traffic is overwhelmingly broadcast Arabic and you don't need dialect support, plain zero-shot large-v3 is the right pick (8.51% MSA vs our 10.52%).
Gulf test is Casablanca-only — no public broadcast Gulf corpus available; Gulf number reflects conversational only.
Single-utterance assumed — audio >30 s is internally chunked by faster-whisper; multi-speaker diarization not provided.

Citation

@misc{hany2026whisperarabic,
  title        = {Production-Aware Fine-Tuning of Whisper Variants for Multi-Dialect
                  Arabic ASR: A Cross-Platform CPU Inference Study},
  author       = {Hany, Ahmed},
  year         = {2026},
  howpublished = {Preprint, arXiv (in preparation)},
  url          = {https://github.com/dev-ahmedhany/whisper-arabic-dialects},
}

License

Apache-2.0 (inherits from base openai/whisper-large-v3).

Downloads last month: 70

Model tree for dev-ahmedhany/whisper-large-v3-arabic-ft-v3-ct2-int8

Base model

openai/whisper-large-v3

Finetuned

(844)

this model

Evaluation results

wer on FLEURS Arabic
test set self-reported

10.520
wer on Mixed Egyptian (Casablanca + MGB-3)
test set self-reported

23.900
wer on Mixed Levantine (Casablanca + MASC)
test set self-reported

30.630
wer on Casablanca Gulf (UAE)
test set self-reported

41.460