--- base_model: vasista22/whisper-telugu-large-v2 library_name: peft language: te license: apache-2.0 tags: - automatic-speech-recognition - whisper - telugu - indic - lora - entity-dense metrics: - wer - ehr datasets: - ai4bharat/IndicVoices - mozilla-foundation/common_voice_25_0 - google/fleurs --- # Praxy-STT-Te-rb: Entity-Dense Telugu ASR via TTS↔STT Flywheel LoRA adapter on top of `vasista22/whisper-telugu-large-v2` trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails. ## Headline results (entity-dense Telugu, n=102, Cartesia held-out) | System | EHR | WER | SFR | |---|---|---|---| | Vanilla Whisper-large-v3 | 0.560 | 1.330 | 0.566 | | vasista22 (open SOTA, our base) | 0.027 | 0.582 | 1.000 | | Deepgram Nova-3 (commercial) | 0.160 | 0.690 | 0.978 | | **Praxy-STT-Te-rb (this model)** | **0.473** | **0.324** | 0.928 | = **17× over open SOTA, 3× over commercial** on Indian-entity recognition. Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25. ## Usage ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor from peft import PeftModel base_model = "vasista22/whisper-telugu-large-v2" processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe") model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda") # vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40 forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe") model.config.forced_decoder_ids = forced model.generation_config.forced_decoder_ids = forced model.generation_config.suppress_tokens = [] model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb") model.eval() # Transcribe import librosa audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True) feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16) pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1) text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip() print(text) ``` ## Training - **Base:** `vasista22/whisper-telugu-large-v2` (IIT-Madras Speech Lab, Apache-2.0) - **LoRA config:** rank 16, alpha 32, dropout 0.05, target modules `q_proj k_proj v_proj out_proj` - **Training corpus:** Entity-Dense Synthetic Audio (~22 audio-hours per language) from Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia synthesis; Cartesia rows held out as evaluation set - **Steps:** 4000 on Modal A10G, ~$5 compute - **Pin chain:** `transformers==4.36.2`, `peft==0.10.0`, `torch==2.4.0` (vasista22's saved generation_config is incompatible with newer transformers) ## License + companion work Apache-2.0 (matches upstream vasista22 license). This is paper #3 in a series: - **Praxy Voice TTS** (paper #1, the synthesis half of this flywheel): [arXiv:2604.25441](https://arxiv.org/abs/2604.25441) - **PSP** (paper #2, accent metric used to validate synth quality): [arXiv:2604.25476](https://arxiv.org/abs/2604.25476) - **STT Flywheel** (this paper): [arXiv:2605.03073](https://arxiv.org/abs/2605.03073); code at [github.com/praxelhq/stt-flywheel](https://github.com/praxelhq/stt-flywheel) Companion β models: `Praxel/praxy-stt-hi-rb`, `Praxel/praxy-stt-ta-rb`. ## Limitations - Entity-dense evaluation is on Cartesia-synthesised audio held-out from training; transfer to native human entity-dense speech is not directly measured. - Pre-registered EHR ≥ 0.75 target was missed (achieved 0.473); entity-dense Indic ASR remains substantially open as a research direction. - Read-prose regression is bounded but exists (+6 pp on FLEURS-Te); for pure read-prose deployment the upstream vasista22 base is preferable. ## Citation ```bibtex @misc{praxy_stt_2026, author = {Menta, Venkata Pushpak Teja}, title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail}, year = {2026}, publisher = {Praxel Ventures}, howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}}, } ```