---
language:
  - uz
license: apache-2.0
tags:
  - whisper
  - automatic-speech-recognition
  - uzbek
  - speech-to-text
  - asr
metrics:
  - wer
  - cer
base_model: openai/whisper-medium
pipeline_tag: automatic-speech-recognition
library_name: transformers
datasets:
  - custom
model-index:
  - name: whisper-medium-uz-v1
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        metrics:
          - type: wer
            value: 16.7
            name: Overall WER
          - type: cer
            value: 7.0
            name: Overall CER
---

# Whisper Medium Uzbek v1 by **Kotibai & Rubai Team**

## Developed by **Kotibai & Rubai Team**

**Uzbek Automatic Speech Recognition (ASR) model** fine-tuned from Whisper Medium.

## Model Description

- **Base Model**: OpenAI Whisper Medium (769M parameters)
- **Language**: Uzbek (uz)
- **Training Data**: ~1,600 hours of Uzbek audio
- **Precision**: BF16
- **Script**: Latin (handles Russian loanwords in Latin script: "brat", "davay", "prosto", etc.)

## Evaluation Results

| Category | WER |
|----------|-----|
| **Overall** | **16.7%** |
| Clean Speech | ~6-11% |
| Noisy/Augmented | ~12-24% |
| Dialects | ~16-25% |

Evaluated on 1,864 samples across 8 diverse test sets.

## Usage

### Using Transformers

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("Kotib/uzbek_stt_v1")
model = WhisperForConditionalGeneration.from_pretrained("Kotib/uzbek_stt_v1")

audio, sr = librosa.load("audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

predicted_ids = model.generate(input_features, language="uz", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
```

### Using Pipeline

```python
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="Kotib/uzbek_stt_v1",
    chunk_length_s=30,
    device="cuda"
)

result = pipe("audio.wav", generate_kwargs={"language": "uz", "task": "transcribe"})
print(result["text"])
```

## Training

Trained in 3 stages using curriculum learning:

| Stage | Hours |
|-------|-------|
| Foundation | 725h |
| Robustness | 394h |
| Domain Adaptation | 474h |

## Intended Use

- Uzbek speech-to-text transcription
- Voice assistants and dictation
- Media transcription and subtitling

## Limitations

- Performance degrades on very noisy audio
- May struggle with heavy code-switching
- Optimized for Uzbek only

## License

Apache 2.0