--- pipeline_tag: automatic-speech-recognition language: - ur - en tags: - audio - speech-to-text - automatic-speech-recognition - asr - translation - urdu - english - whisper - whisper-medium - peft - lora - fast-inference - specdox base_model: openai/whisper-medium license: apache-2.0 metrics: - wer - bleu - meteor - bertscore --- # SpecDox: Fast Urdu-to-English Speech Translation Model This is a highly optimized, fine-tuned version of the [OpenAI Whisper Medium](https://huggingface.co/openai/whisper-medium) model. It is explicitly trained to perform **Automatic Speech Recognition (ASR)** and **Audio Translation**, taking spoken Urdu (اردو) and instantly converting it into written English text. This model serves as the core audio-processing engine for **SpecDox**, a real-time Urdu-to-English Speech-to-Structured-Document system. ## 🚀 Key Features & SEO Highlights - **High Speed & Low VRAM:** Built on the 769M parameter Whisper Medium architecture. We chose this over the Large model to maximize GPU efficiency, ensure fast inference speeds, and allow for deployment on consumer-grade hardware. - **Massive Training Data:** Trained on **127 hours** of high-quality Urdu-to-English baseline speech data, expanded to a massive **172 hours** through extensive data augmentation. - **PEFT / LoRA Optimized:** Fine-tuned using Parameter-Efficient Fine-Tuning (LoRA adapters) and merged into FP16/BF16 weights for a lightweight footprint without sacrificing domain specificity. --- ## 📊 Evaluation & Performance The table below outlines the performance of the fine-tuned SpecDox models against standard baseline architectures. Evaluation was conducted across four major benchmarks: Word Error Rate (WER), BLEU score, METEOR, and BERTScore F1. | Model | WER% ↓ | BLEU ↑ | METEOR ↑ | BERTScore F1 ↑ | Rank | | :--- | :---: | :---: | :---: | :---: | :---: | | **SpecDox-Whisper-Medium** | **36.25** | **53.30** | 0.7804 | **0.9405** | **#1** | | Faster Whisper (SpecDox) | 36.28 | 53.24 | **0.7811** | 0.9402 | **#2** | | Whisper Large-v3 | 42.88 | 46.86 | 0.7105 | 0.9270 | #3 | | Whisper Medium (Baseline) | 45.33 | 44.16 | 0.6882 | 0.9226 | #4 | | SeamlessM4T Medium | 72.04 | 18.84 | 0.3697 | 0.8429 | #5 | > **Engineering Takeaway:** Despite being a lighter architecture, the fine-tuned SpecDox Medium model outperforms the baseline Whisper Large-v3 by a massive **6.63% absolute reduction in WER** and yields significantly higher translation quality metrics (BLEU/METEOR). This justifies the choice of Whisper Medium for production environments requiring fast inference speeds and low GPU footprints. --- ## 💡 Ideal Use Cases If you are searching for a model to handle the following tasks, this model is built for you: - **Real-Time Translation:** Live transcription and translation of Urdu audio, podcasts, or lectures into English. - **Voice-to-Text Document Generation:** Converting dictated Urdu notes into structured English reports (the primary function of SpecDox). - **Cross-Lingual ASR:** Handling Pakistani accents and regional Urdu pronunciations with high accuracy. - **Edge Deployment:** Running high-accuracy audio translation on hardware with limited VRAM (Google Colab free tier, local RTX GPUs, etc.). ## 💻 How to Use in Python ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor import torch device = "cuda" if torch.cuda.is_available() else "cpu" # Load the SpecDox Whisper model processor = WhisperProcessor.from_pretrained("Shzaib/SpecDox-Whisper-Medium") model = WhisperForConditionalGeneration.from_pretrained("Shzaib/SpecDox-Whisper-Medium").to(device) def translate_urdu_audio(audio_array, sampling_rate=16000): inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").to(device) # Force the decoder to translate Urdu to English forced_decoder_ids = processor.get_decoder_prompt_ids(language="urdu", task="translate") predicted_ids = model.generate(inputs["input_features"], forced_decoder_ids=forced_decoder_ids) return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]