---
license: apache-2.0
language:
- uz
- en
- ru
metrics:
- wer
base_model:
- openai/whisper-small
pipeline_tag: automatic-speech-recognition
tags:
- speech-recognition
- whisper
- multilingual
- uzbek
- russian
- english
---

# Multilingual Whisper (Uz/En/Ru) — Fine-tuned Speech-to-Text Model

A fine-tuned **Whisper Small** model optimized to transcribe **Uzbek, English, and Russian equally well**.  
This model is intended for real-world speech transcription with a balanced multilingual dataset and performs competitively against strong open-source and commercial STT solutions.

---

## Model Details

### Model Description

This model extends **OpenAI Whisper Small** by fine-tuning it on a multilingual speech mixture, aimed to deliver robust ASR performance for **Uzbek**, **English**, and **Russian** speakers.  
The goal was to reduce the performance gap between languages, especially improving **Uzbek** speech recognition, where public ASR resources are scarce.

- **Model type:** Automatic Speech Recognition (ASR)
- **Language(s):** Uzbek 🇺🇿, English 🇬🇧, Russian 🇷🇺
- **License:** Apache-2.0
- **Finetuned from:** openai/whisper-small
- **Intended usage:** Real-time & offline speech-to-text

---
## Trained datasets:
- DavronSherbaev/uzbekvoice-filtered
- telegram-voice-messages (private collection)
- navaistt-open-datasets
- sovaai/russian-audiobooks
- librispeech

## Evaluation

### Word Error Rate (WER) Comparison

All WER results were obtained using the same test set.
The test set consists of real-world voice messages collected from public Telegram groups.
It contains approximately 2 hours of audio data in total.
The dataset will be made publicly available soon.

| Model                          | WER ↓    |
|--------------------------------|----------|
| Whisper-small-uz-v1  | **34.5%** |
| Gemini (Commercial)            | 36.21%  |
| NavaiSTT v2 (Open-Source medium model)     | 35.14%  |
| Aisha STT (Commercial)         | 41.71%  |

The model **outperforms both commercial and open-source Uzbek STT models**, showing strong generalization for informal real-world speech.

---

## Usage Example

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

model_id = "OvozifyLabs/whisper-small-uz-v1"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, sr = torchaudio.load("audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)
text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(text)