File size: 4,153 Bytes
4b9285c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e84993b
4b9285c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
pipeline_tag: automatic-speech-recognition
language:
- ur
- en
tags:
- audio
- speech-to-text
- automatic-speech-recognition
- asr
- translation
- urdu
- english
- whisper
- whisper-medium
- peft
- lora
- fast-inference
- specdox
base_model: openai/whisper-medium
license: apache-2.0
metrics:
- wer
- bleu
- meteor
- bertscore
---

# SpecDox: Fast Urdu-to-English Speech Translation Model

This is a highly optimized, fine-tuned version of the [OpenAI Whisper Medium](https://huggingface.co/openai/whisper-medium) model. It is explicitly trained to perform **Automatic Speech Recognition (ASR)** and **Audio Translation**, taking spoken Urdu (ุงุฑุฏูˆ) and instantly converting it into written English text.

This model serves as the core audio-processing engine for **SpecDox**, a real-time Urdu-to-English Speech-to-Structured-Document system.

## ๐Ÿš€ Key Features & SEO Highlights
- **High Speed & Low VRAM:** Built on the 769M parameter Whisper Medium architecture. We chose this over the Large model to maximize GPU efficiency, ensure fast inference speeds, and allow for deployment on consumer-grade hardware.
- **Massive Training Data:** Trained on **127 hours** of high-quality Urdu-to-English baseline speech data, expanded to a massive **172 hours** through extensive data augmentation. 
- **PEFT / LoRA Optimized:** Fine-tuned using Parameter-Efficient Fine-Tuning (LoRA adapters) and merged into FP16/BF16 weights for a lightweight footprint without sacrificing domain specificity.

---

## ๐Ÿ“Š Evaluation & Performance

The table below outlines the performance of the fine-tuned SpecDox models against standard baseline architectures. Evaluation was conducted across four major benchmarks: Word Error Rate (WER), BLEU score, METEOR, and BERTScore F1.

| Model | WER% โ†“ | BLEU โ†‘ | METEOR โ†‘ | BERTScore F1 โ†‘ | Rank |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **SpecDox-Whisper-Medium** | **36.25** | **53.30** | 0.7804 | **0.9405** | **#1** |
| Faster Whisper (SpecDox) | 36.28 | 53.24 | **0.7811** | 0.9402 | **#2** |
| Whisper Large-v3 | 42.88 | 46.86 | 0.7105 | 0.9270 | #3 |
| Whisper Medium (Baseline) | 45.33 | 44.16 | 0.6882 | 0.9226 | #4 |
| SeamlessM4T Medium | 72.04 | 18.84 | 0.3697 | 0.8429 | #5 |

> **Engineering Takeaway:** Despite being a lighter architecture, the fine-tuned SpecDox Medium model outperforms the baseline Whisper Large-v3 by a massive **6.63% absolute reduction in WER** and yields significantly higher translation quality metrics (BLEU/METEOR). This justifies the choice of Whisper Medium for production environments requiring fast inference speeds and low GPU footprints.

---

## ๐Ÿ’ก Ideal Use Cases
If you are searching for a model to handle the following tasks, this model is built for you:
- **Real-Time Translation:** Live transcription and translation of Urdu audio, podcasts, or lectures into English.
- **Voice-to-Text Document Generation:** Converting dictated Urdu notes into structured English reports (the primary function of SpecDox).
- **Cross-Lingual ASR:** Handling Pakistani accents and regional Urdu pronunciations with high accuracy.
- **Edge Deployment:** Running high-accuracy audio translation on hardware with limited VRAM (Google Colab free tier, local RTX GPUs, etc.).

## ๐Ÿ’ป How to Use in Python

```python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the SpecDox Whisper model
processor = WhisperProcessor.from_pretrained("Shzaib/SpecDox-Whisper-Medium")
model = WhisperForConditionalGeneration.from_pretrained("Shzaib/SpecDox-Whisper-Medium").to(device)

def translate_urdu_audio(audio_array, sampling_rate=16000):
    inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").to(device)
    
    # Force the decoder to translate Urdu to English
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="urdu", task="translate")
    predicted_ids = model.generate(inputs["input_features"], forced_decoder_ids=forced_decoder_ids)
    
    return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]