Automatic Speech Recognition
PEFT
Safetensors
Urdu
English
whisper
audio
speech-to-text
asr
translation
urdu
english
whisper-medium
lora
fast-inference
specdox
Instructions to use Shzaib/SpecDox-Whisper-Medium with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Shzaib/SpecDox-Whisper-Medium with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 4,153 Bytes
4b9285c e84993b 4b9285c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | ---
pipeline_tag: automatic-speech-recognition
language:
- ur
- en
tags:
- audio
- speech-to-text
- automatic-speech-recognition
- asr
- translation
- urdu
- english
- whisper
- whisper-medium
- peft
- lora
- fast-inference
- specdox
base_model: openai/whisper-medium
license: apache-2.0
metrics:
- wer
- bleu
- meteor
- bertscore
---
# SpecDox: Fast Urdu-to-English Speech Translation Model
This is a highly optimized, fine-tuned version of the [OpenAI Whisper Medium](https://huggingface.co/openai/whisper-medium) model. It is explicitly trained to perform **Automatic Speech Recognition (ASR)** and **Audio Translation**, taking spoken Urdu (ุงุฑุฏู) and instantly converting it into written English text.
This model serves as the core audio-processing engine for **SpecDox**, a real-time Urdu-to-English Speech-to-Structured-Document system.
## ๐ Key Features & SEO Highlights
- **High Speed & Low VRAM:** Built on the 769M parameter Whisper Medium architecture. We chose this over the Large model to maximize GPU efficiency, ensure fast inference speeds, and allow for deployment on consumer-grade hardware.
- **Massive Training Data:** Trained on **127 hours** of high-quality Urdu-to-English baseline speech data, expanded to a massive **172 hours** through extensive data augmentation.
- **PEFT / LoRA Optimized:** Fine-tuned using Parameter-Efficient Fine-Tuning (LoRA adapters) and merged into FP16/BF16 weights for a lightweight footprint without sacrificing domain specificity.
---
## ๐ Evaluation & Performance
The table below outlines the performance of the fine-tuned SpecDox models against standard baseline architectures. Evaluation was conducted across four major benchmarks: Word Error Rate (WER), BLEU score, METEOR, and BERTScore F1.
| Model | WER% โ | BLEU โ | METEOR โ | BERTScore F1 โ | Rank |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **SpecDox-Whisper-Medium** | **36.25** | **53.30** | 0.7804 | **0.9405** | **#1** |
| Faster Whisper (SpecDox) | 36.28 | 53.24 | **0.7811** | 0.9402 | **#2** |
| Whisper Large-v3 | 42.88 | 46.86 | 0.7105 | 0.9270 | #3 |
| Whisper Medium (Baseline) | 45.33 | 44.16 | 0.6882 | 0.9226 | #4 |
| SeamlessM4T Medium | 72.04 | 18.84 | 0.3697 | 0.8429 | #5 |
> **Engineering Takeaway:** Despite being a lighter architecture, the fine-tuned SpecDox Medium model outperforms the baseline Whisper Large-v3 by a massive **6.63% absolute reduction in WER** and yields significantly higher translation quality metrics (BLEU/METEOR). This justifies the choice of Whisper Medium for production environments requiring fast inference speeds and low GPU footprints.
---
## ๐ก Ideal Use Cases
If you are searching for a model to handle the following tasks, this model is built for you:
- **Real-Time Translation:** Live transcription and translation of Urdu audio, podcasts, or lectures into English.
- **Voice-to-Text Document Generation:** Converting dictated Urdu notes into structured English reports (the primary function of SpecDox).
- **Cross-Lingual ASR:** Handling Pakistani accents and regional Urdu pronunciations with high accuracy.
- **Edge Deployment:** Running high-accuracy audio translation on hardware with limited VRAM (Google Colab free tier, local RTX GPUs, etc.).
## ๐ป How to Use in Python
```python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the SpecDox Whisper model
processor = WhisperProcessor.from_pretrained("Shzaib/SpecDox-Whisper-Medium")
model = WhisperForConditionalGeneration.from_pretrained("Shzaib/SpecDox-Whisper-Medium").to(device)
def translate_urdu_audio(audio_array, sampling_rate=16000):
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").to(device)
# Force the decoder to translate Urdu to English
forced_decoder_ids = processor.get_decoder_prompt_ids(language="urdu", task="translate")
predicted_ids = model.generate(inputs["input_features"], forced_decoder_ids=forced_decoder_ids)
return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |