--- language: - ml license: apache-2.0 tags: - whisper - automatic-speech-recognition - malayalam - indic-asr - fine-tuned base_model: openai/whisper-small metrics: - wer --- # Whisper Small — Malayalam High LR Fine-tuned Malayalam ASR model based on [openai/whisper-small](https://huggingface.co/openai/whisper-small), trained as a single-stage baseline using a high learning rate on the full Malayalam training corpus. This model serves as the High LR baseline (small architecture) in the [Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages](https://huggingface.co/blog/adalat-ai/vividh-benchmark) benchmark suite. This model is part of a set of Malayalam and Hindi Whisper models released by [Adalat AI](https://www.adalat.ai/) alongside the Vividh-ASR benchmark. --- ## Model Description The High LR baseline fine-tunes Whisper in a single stage on all available Malayalam training data mixed together, without any curriculum ordering: | Stage | Data | LR | |---|---|---| | 1 | All tiers — Studio + Broadcast + Spontaneous (~890 hrs) | 2e-4 | Training uses AdamW (weight decay 0.1), linear warmup for the first 10% of steps, and cosine annealing to zero. Trained on NVIDIA H100 GPUs using HuggingFace Transformers. --- ## Benchmark Results (Vividh-ASR) Benchmark WER is measured using [faster-whisper](https://github.com/SYSTRAN/faster-whisper) with 7s VAD segmentation for long-form audio. See the [blogpost](https://huggingface.co/blog/adalat-ai/vividh-benchmark) for full evaluation details. | Model | Tier A (Studio) | Tier B (Broadcast) | Tier C (Spontaneous) | Tier D (Noise) | Global | |---|---|---|---|---|---| | [whisper-medium-ml-high-lr](https://huggingface.co/adalat-ai/whisper-medium-ml-high-lr) | 35.04 | 30.48 | 50.30 | 50.78 | 40.85 | | [whisper-medium-ml-rmft](https://huggingface.co/adalat-ai/whisper-medium-ml-rmft) | 37.56 | 31.66 | 46.10 | 45.73 | 39.64 | | **whisper-small-ml-high-lr (This model)** | 39.05 | 32.50 | 54.39 | 51.08 | 43.93 | | [whisper-small-ml-rmft](https://huggingface.co/adalat-ai/whisper-small-ml-rmft) | 40.26 | 35.05 | 53.77 | 48.04 | 44.53 | | [IndicWhisper](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file#evaluating-asr-models)| 38.07 | 32.43 | 65.74 | 46.92 | 47.96 | | [Vegam Whisper](https://huggingface.co/smcproject/vegam-whisper-medium-ml-int8_float16) | 38.74 | 55.10 | 58.53 | 54.46 | 53.39 | *WER %. Lower is better. See [Vividh-ASR benchmark](https://huggingface.co/datasets/adalat-ai/vividh-test-malayalam) for full evaluation details.* --- ## Usage ```python from transformers import pipeline asr = pipeline( "automatic-speech-recognition", model="adalat-ai/whisper-small-ml-high-lr", chunk_length_s=30, device="cuda" ) result = asr("audio.wav") print(result["text"]) ``` > **Note:** For long-form audio, benchmark results use > [faster-whisper](https://github.com/SYSTRAN/faster-whisper) with 7s VAD > segmentation. For short clips, the HuggingFace pipeline above will produce > equivalent results. --- ## Training Data Training data is a superset of the Vividh-ASR benchmark evaluation splits. Sources used: Tier | Hours | Sources | ---|---|---| A (Studio) | 182.2 | [Fleurs](https://huggingface.co/datasets/google/fleurs), [IndicTTS](https://www.iitm.ac.in/donlab/indictts/database.html), [OpenSLR](https://openslr.org/63/), [IMASC](https://huggingface.co/datasets/thennal/IMaSC) | B (Broadcast) | 200.0 | [Shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) | C (Spontaneous) | 512.5 | [IndicVoices](https://huggingface.co/datasets/ai4bharat/IndicVoices), [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | **Total** | **894.7** | | --- ## Intended Use & Limitations This model is intended as a general-purpose Malayalam ASR model optimised for verbatim transcription accuracy across diverse acoustic conditions. **Limitations:** - Evaluated on Hindi and Malayalam only; generalisation to other Indic languages is untested - Tier D evaluation uses synthetic noise profiles; performance on real-world degraded audio may differ --- ## Citation If you use this model or the Vividh-ASR benchmark, please cite: ```bibtex @misc{vividhasr2025, title = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages}, author = {[Kush Juvekar, Kavya Manohar, Kumaramanas Nethil]}, year = {2026}, url = {https://huggingface.co/blog/adalat-ai/vividh-benchmark} } ``` ```bibtex @misc{vividh2026, title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition}, author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil}, year={2026}, eprint={2605.13087}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.13087}, } ``` --- ## Related Models and Datasets See the [Vividh collection](https://huggingface.co/collections/adalat-ai/vividh-asr). --- *Developed by [Adalat AI](https://www.adalat.ai/). Released under Apache 2.0.*