--- language: - uz license: apache-2.0 tags: - whisper - automatic-speech-recognition - uzbek - speech-to-text - asr metrics: - wer - cer base_model: openai/whisper-medium pipeline_tag: automatic-speech-recognition library_name: transformers datasets: - custom model-index: - name: whisper-medium-uz-v1 results: - task: type: automatic-speech-recognition name: Speech Recognition metrics: - type: wer value: 16.7 name: Overall WER - type: cer value: 7.0 name: Overall CER --- # Whisper Medium Uzbek v1 by **Kotibai & Rubai Team** ## Developed by **Kotibai & Rubai Team** **Uzbek Automatic Speech Recognition (ASR) model** fine-tuned from Whisper Medium. ## Model Description - **Base Model**: OpenAI Whisper Medium (769M parameters) - **Language**: Uzbek (uz) - **Training Data**: ~1,600 hours of Uzbek audio - **Precision**: BF16 - **Script**: Latin (handles Russian loanwords in Latin script: "brat", "davay", "prosto", etc.) ## Evaluation Results | Category | WER | |----------|-----| | **Overall** | **16.7%** | | Clean Speech | ~6-11% | | Noisy/Augmented | ~12-24% | | Dialects | ~16-25% | Evaluated on 1,864 samples across 8 diverse test sets. ## Usage ### Using Transformers ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import librosa processor = WhisperProcessor.from_pretrained("Kotib/uzbek_stt_v1") model = WhisperForConditionalGeneration.from_pretrained("Kotib/uzbek_stt_v1") audio, sr = librosa.load("audio.wav", sr=16000) input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features predicted_ids = model.generate(input_features, language="uz", task="transcribe") transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(transcription) ``` ### Using Pipeline ```python from transformers import pipeline pipe = pipeline( "automatic-speech-recognition", model="Kotib/uzbek_stt_v1", chunk_length_s=30, device="cuda" ) result = pipe("audio.wav", generate_kwargs={"language": "uz", "task": "transcribe"}) print(result["text"]) ``` ## Training Trained in 3 stages using curriculum learning: | Stage | Hours | |-------|-------| | Foundation | 725h | | Robustness | 394h | | Domain Adaptation | 474h | ## Intended Use - Uzbek speech-to-text transcription - Voice assistants and dictation - Media transcription and subtitling ## Limitations - Performance degrades on very noisy audio - May struggle with heavy code-switching - Optimized for Uzbek only ## License Apache 2.0