Whisper-large-v3 β€” Arabic 4-dialect FT v3 (CT2 int8)

CTranslate2 int8 build of openai/whisper-large-v3 (1.55 B base) fine-tuned with QLoRA on 38 hours of dialect-balanced Arabic from Casablanca (5 countries) + cleaned MGB-3 + MASC + Common Voice 18. 1.56 GB on disk, runs at real-time on commodity CPU.

The "v3" of the project β€” significantly bigger base than the v2 turbo, with significantly more training data (38 h vs 7 h). For the smaller 809M turbo variant, see whisper-large-v3-turbo-arabic-ft-ct2-int8.

For the float32 PyTorch version: whisper-large-v3-arabic-ft-v3. For the LoRA adapter (further fine-tuning + Git-history of every save during training): whisper-large-v3-arabic-ft-v3-lora.

Headline WER on mixed-domain test sets (n=100/dialect, int8, beam=2, threads=8, c3-standard-8)

50% Casablanca + 50% broadcast (MGB-3 / MASC) for Egyptian and Levantine; 100% Casablanca UAE for Gulf (no public broadcast Gulf source); 100% FLEURS broadcast for MSA. Same exact recordings + decoding config used for both rows.

Dialect Test composition Zero-shot Whisper-large-v3 This model (v3-ft) Ξ”
MSA FLEURS broadcast 8.51% 10.52% +2.01 pp
Egyptian 50 Casablanca + 50 MGB-3 38.48% 23.90% βˆ’14.58 pp βœ…
Levantine 50 Casablanca JO + 50 MASC 37.70% 30.63% βˆ’7.07 pp βœ…
Gulf Casablanca UAE 52.72% 41.46% βˆ’11.26 pp βœ…
avg-4 34.35% 26.63% βˆ’7.72 pp βœ…

v3-ft beats zero-shot Whisper-large-v3 by 7.72 pp average WER, with double-digit gains on Egyptian and Gulf. The MSA loss of 2.01 pp is the well-known dialect-vs-MSA tradeoff β€” see paper Β§6 for the details.

Quickstart

pip install faster-whisper
huggingface-cli download dev-ahmedhany/whisper-large-v3-arabic-ft-v3-ct2-int8 \
    --local-dir ./whisper-ar-v3-int8
from faster_whisper import WhisperModel

model = WhisperModel(
    "./whisper-ar-v3-int8",
    device="cpu", compute_type="int8", cpu_threads=8,
)
segments, info = model.transcribe(
    "audio.wav",
    beam_size=2,        # paper Β§6.6 sweet spot
    language="ar",
    task="transcribe",
)
print(" ".join(s.text for s in segments))

Inference RTF β‰ˆ 0.5–0.8 (MSA) to 1.4–1.7 (dialects) on a c3-standard-8 (Intel Sapphire Rapids), peak RAM ~3.1 GB.

Training recipe

  • Base: openai/whisper-large-v3 (1.55 B params)
  • QLoRA: NF4 + bf16 compute, r=8, Ξ±=16, dropout 0.05, target modules = q/v/k/out_proj + fc1/fc2
  • Optimizer: paged_adamw_8bit, lr = 1e-4, warmup ratio 0.1
  • Effective batch 16 (per-device 4 Γ— grad-accum 4), gradient_checkpointing
  • Best checkpoint: step 4750 of 10000 (early-stop after plateau across ckpts 5000-6000)
  • Training data: ~38 h, 26,817 train / 924 val rows
    • MSA (9.6 h): Common Voice 18 Arabic (capped 15 h)
    • Egyptian (18.6 h): Casablanca Egypt + MGB-3 + cleaned MGB-3
    • Levantine (9.8 h): Casablanca Jordan + Palestine + MASC
    • Gulf (1.9 h): Casablanca UAE + Yemen
  • Maghrebi excluded (84.7% zero-shot WER too far gone for QLoRA budget β€” paper Β§3.7)
  • Compute: ~25–28 h on a single GCP L4 (g2-standard-16), ~$25 GPU + ~$5 CPU bench evals

Conversion to CT2 int8

ct2-transformers-converter \
    --model checkpoints/v3-merged \
    --output_dir checkpoints/v3-ct2-int8 \
    --quantization int8 \
    --copy_files preprocessor_config.json tokenizer_config.json normalizer.json \
                 special_tokens_map.json added_tokens.json merges.txt vocab.json tokenizer.json

(Pin ctranslate2<4.5 for compatibility with transformers==4.46.3.)

Limitations

  • Maghrebi out of scope β€” zero-shot-quality (β‰₯80% WER) on Algerian/Moroccan/Tunisian.
  • MSA: zero-shot Whisper-large-v3 is slightly better. If your traffic is overwhelmingly broadcast Arabic and you don't need dialect support, plain zero-shot large-v3 is the right pick (8.51% MSA vs our 10.52%).
  • Gulf test is Casablanca-only β€” no public broadcast Gulf corpus available; Gulf number reflects conversational only.
  • Single-utterance assumed β€” audio >30 s is internally chunked by faster-whisper; multi-speaker diarization not provided.

Citation

@misc{hany2026whisperarabic,
  title        = {Production-Aware Fine-Tuning of Whisper Variants for Multi-Dialect
                  Arabic ASR: A Cross-Platform CPU Inference Study},
  author       = {Hany, Ahmed},
  year         = {2026},
  howpublished = {Preprint, arXiv (in preparation)},
  url          = {https://github.com/dev-ahmedhany/whisper-arabic-dialects},
}

License

Apache-2.0 (inherits from base openai/whisper-large-v3).

Downloads last month
70
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dev-ahmedhany/whisper-large-v3-arabic-ft-v3-ct2-int8

Finetuned
(844)
this model

Evaluation results