--- license: apache-2.0 language: - uz - en - ru metrics: - wer base_model: - openai/whisper-small pipeline_tag: automatic-speech-recognition tags: - speech-recognition - whisper - multilingual - uzbek - russian - english --- # Multilingual Whisper (Uz/En/Ru) β€” Fine-tuned Speech-to-Text Model A fine-tuned **Whisper Small** model optimized to transcribe **Uzbek, English, and Russian equally well**. This model is intended for real-world speech transcription with a balanced multilingual dataset and performs competitively against strong open-source and commercial STT solutions. --- ## Model Details ### Model Description This model extends **OpenAI Whisper Small** by fine-tuning it on a multilingual speech mixture, aimed to deliver robust ASR performance for **Uzbek**, **English**, and **Russian** speakers. The goal was to reduce the performance gap between languages, especially improving **Uzbek** speech recognition, where public ASR resources are scarce. - **Model type:** Automatic Speech Recognition (ASR) - **Language(s):** Uzbek πŸ‡ΊπŸ‡Ώ, English πŸ‡¬πŸ‡§, Russian πŸ‡·πŸ‡Ί - **License:** Apache-2.0 - **Finetuned from:** openai/whisper-small - **Intended usage:** Real-time & offline speech-to-text --- ## Trained datasets: - DavronSherbaev/uzbekvoice-filtered - telegram-voice-messages (private collection) - navaistt-open-datasets - sovaai/russian-audiobooks - librispeech ## Evaluation ### Word Error Rate (WER) Comparison All WER results were obtained using the same test set. The test set consists of real-world voice messages collected from public Telegram groups. It contains approximately 2 hours of audio data in total. The dataset will be made publicly available soon. | Model | WER ↓ | |--------------------------------|----------| | Whisper-small-uz-v1 | **34.5%** | | Gemini (Commercial) | 36.21% | | NavaiSTT v2 (Open-Source medium model) | 35.14% | | Aisha STT (Commercial) | 41.71% | The model **outperforms both commercial and open-source Uzbek STT models**, showing strong generalization for informal real-world speech. --- ## Usage Example ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch import torchaudio model_id = "OvozifyLabs/whisper-small-uz-v1" processor = WhisperProcessor.from_pretrained(model_id) model = WhisperForConditionalGeneration.from_pretrained(model_id) audio, sr = torchaudio.load("audio.wav") inputs = processor(audio, sampling_rate=sr, return_tensors="pt") with torch.no_grad(): predicted_ids = model.generate(inputs.input_features) text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(text)