| --- |
| language: |
| - uz |
| license: apache-2.0 |
| tags: |
| - whisper |
| - automatic-speech-recognition |
| - audio-transcription |
| - uzbek |
| - fine-tuned |
| - speech-recognition |
| --- |
| |
| # rubaiSTT-2v Medium - Uzbek Speech-to-Text Model |
|
|
| Classic Whisper medium model fine-tuned for Uzbek language. The dataset included of diverse audio: publicly available podcasts, Tashkent dialect podcasts, news, google fleurs, USC and Common Voice 17. Data quality was mixed with 50% human transcribed and 50% pseudo-transcribed using Gemini 2.5 Pro. |
|
|
| Difference between v1 is that v2 is fully open-sourced. Due to some conflicts with data partners, v1 was removed, and the 500-hour dataset was excluded. Instead, new and different datasets were included—all of which will be open-sourced. Training scripts will also be open-sourced. The entire process will be fully repeatable. |
|
|
| Special attention was given to Tashkent dialect audio materials, resulting in strong performance on this dialect. Future versions will include other regional dialects to improve overall coverage. |
|
|
| # Whitepaper |
| For more details on the methodology and research behind this model, visit: https://uz-speech.web.app/rubaistt02m |
|
|
| Training and filtering code: https://github.com/Islomov49/rubaistt_v2-open-sourced |
| |
| Support my works and open-source movement: https://tirikchilik.uz/islomovs |
| |
| ## Model Details |
| |
| - **Base Model:** Whisper Medium |
| - **Parameters:** 769M |
| - **Performance:** |
| - WER: ~17% |
| - CER: ~5.5% |
| |
| ## Training Data |
| |
| This model was fine-tuned on approximately 475 hours of diverse Uzbek audio data including: |
| - Common Voice 17 dataset (filtered) |
| - USC (filtered) |
| - Google fleurs (filtered) |
| - Podcasts Tashkent Dialect Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/podcasts_tashkent_dialect_youtube_uzbek_speech_dataset) |
| - News Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/news_youtube_uzbek_speech_dataset) |
| - IT Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/it_youtube_uzbek_speech_dataset) |
| |
| The dataset consisted of 50% human-transcribed and 50% pseudo-transcribed material (using Gemini 2.5 Pro). Special attention was given to Tashkent dialect audio materials to ensure strong performance on this dialect. |
| |
| A technique was used to filter out datasets based on Word Error Rate (WER) and similarity checks. The script for this process will also be open-sourced. |
| |
| ## Usage Example |
| |
| ```python |
| import torch |
| import torchaudio |
| from transformers import WhisperProcessor, WhisperForConditionalGeneration |
| |
| # Load model and processor |
| processor = WhisperProcessor.from_pretrained("islomov/rubaistt_v2_medium") |
| model = WhisperForConditionalGeneration.from_pretrained("islomov/rubaistt_v2_medium") |
| |
| def transcribe_audio(audio_path): |
| |
| global model, processor |
| |
| # Move to GPU if available |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model = model.to(device) |
| |
| # Load and preprocess audio |
| waveform, sample_rate = torchaudio.load(audio_path) |
| if sample_rate != 16000: |
| waveform = torchaudio.functional.resample(waveform, sample_rate, 16000) |
| |
| # Convert to mono if needed |
| if waveform.shape[0] > 1: |
| waveform = waveform.mean(dim=0, keepdim=True) |
| |
| # Process audio |
| input_features = processor( |
| waveform.squeeze().numpy(), |
| sampling_rate=16000, |
| return_tensors="pt", |
| language="uz" |
| ).input_features.to(device) |
| |
| # Generate transcription |
| with torch.no_grad(): |
| predicted_ids = model.generate(input_features) |
| |
| # Decode |
| transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
| return transcription |
| |
| # Example usage |
| if __name__ == "__main__": |
| audio_file = "some_audio_max_30_sec.wav" |
| |
| text = transcribe_audio(audio_file) |
| print(f"Transcription: {text}") |
| ``` |
| |
| # Future Improvements |
| Future versions will include more regional Uzbek dialects to improve overall coverage. |
|
|
|
|