---
base_model: vasista22/whisper-telugu-large-v2
library_name: peft
language: te
license: apache-2.0
tags:
- automatic-speech-recognition
- whisper
- telugu
- indic
- lora
- entity-dense
metrics:
- wer
- ehr
datasets:
- ai4bharat/IndicVoices
- mozilla-foundation/common_voice_25_0
- google/fleurs
---

# Praxy-STT-Te-rb: Entity-Dense Telugu ASR via TTS↔STT Flywheel

LoRA adapter on top of `vasista22/whisper-telugu-large-v2` trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails.

## Headline results (entity-dense Telugu, n=102, Cartesia held-out)

| System | EHR | WER | SFR |
|---|---|---|---|
| Vanilla Whisper-large-v3 | 0.560 | 1.330 | 0.566 |
| vasista22 (open SOTA, our base) | 0.027 | 0.582 | 1.000 |
| Deepgram Nova-3 (commercial) | 0.160 | 0.690 | 0.978 |
| **Praxy-STT-Te-rb (this model)** | **0.473** | **0.324** | 0.928 |

= **17× over open SOTA, 3× over commercial** on Indian-entity recognition.

Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25.

## Usage

```python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel

base_model = "vasista22/whisper-telugu-large-v2"
processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda")

# vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40
forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe")
model.config.forced_decoder_ids = forced
model.generation_config.forced_decoder_ids = forced
model.generation_config.suppress_tokens = []

model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb")
model.eval()

# Transcribe
import librosa
audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip()
print(text)
```

## Training

- **Base:** `vasista22/whisper-telugu-large-v2` (IIT-Madras Speech Lab, Apache-2.0)
- **LoRA config:** rank 16, alpha 32, dropout 0.05, target modules `q_proj k_proj v_proj out_proj`
- **Training corpus:** Entity-Dense Synthetic Audio (~22 audio-hours per language) from Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia synthesis; Cartesia rows held out as evaluation set
- **Steps:** 4000 on Modal A10G, ~$5 compute
- **Pin chain:** `transformers==4.36.2`, `peft==0.10.0`, `torch==2.4.0` (vasista22's saved generation_config is incompatible with newer transformers)

## License + companion work

Apache-2.0 (matches upstream vasista22 license).

This is paper #3 in a series:
- **Praxy Voice TTS** (paper #1, the synthesis half of this flywheel): [arXiv:2604.25441](https://arxiv.org/abs/2604.25441)
- **PSP** (paper #2, accent metric used to validate synth quality): [arXiv:2604.25476](https://arxiv.org/abs/2604.25476)
- **STT Flywheel** (this paper): [arXiv:2605.03073](https://arxiv.org/abs/2605.03073); code at [github.com/praxelhq/stt-flywheel](https://github.com/praxelhq/stt-flywheel)

Companion β models: `Praxel/praxy-stt-hi-rb`, `Praxel/praxy-stt-ta-rb`.

## Limitations

- Entity-dense evaluation is on Cartesia-synthesised audio held-out from training; transfer to native human entity-dense speech is not directly measured.
- Pre-registered EHR ≥ 0.75 target was missed (achieved 0.473); entity-dense Indic ASR remains substantially open as a research direction.
- Read-prose regression is bounded but exists (+6 pp on FLEURS-Te); for pure read-prose deployment the upstream vasista22 base is preferable.

## Citation

```bibtex
@misc{praxy_stt_2026,
  author = {Menta, Venkata Pushpak Teja},
  title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail},
  year = {2026},
  publisher = {Praxel Ventures},
  howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}},
}
```