# LexiCore Wav2Vec2 XLS-R 300M CTC — শব্দতরী Bangla Dialect ASR This model is a fine-tuned version of [`arijitx/wav2vec2-xls-r-300m-bengali`](https://huggingface.co/arijitx/wav2vec2-xls-r-300m-bengali) for the **“শব্দতরী: Where Dialects Flow into Bangla”** competition. - Task: dialectal Bangla speech → standard Bangla text - Data: 3,350 audio clips from 20 regions of Bangladesh (competition dataset only) - Metric: Normalized Levenshtein Similarity (char-level) - Decoding: CTC + 5-gram KenLM (`pyctcdecode`) + small punctuation rule - Training: - 20 epochs - LR = 1e-4 - Batch size ≈ 8 (4 × 2 grad accumulation) - Strong waveform augmentations (speed, gain, noise, time-drop) ## Intended Use - Research and experimentation on Bangla ASR for low-resource and dialectal settings - Non-commercial applications, respecting the original competition and dataset license ## Limitations - Trained only on short, scripted sentences from 20 Bangladeshi regions - May not generalize to very long utterances, noisy real-world audio, or code-switching - Output is in standard written Bangla, not dialect spelling ## Usage (pseudo-code) ```python from transformers import Wav2Vec2Processor, AutoModelForCTC import torch, torchaudio processor = Wav2Vec2Processor.from_pretrained("your-username/your-repo") model = AutoModelForCTC.from_pretrained("your-username/your-repo").to("cuda").eval() waveform, sr = torchaudio.load("example.wav") # resample to 16k if needed... inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt") with torch.no_grad(): logits = model(inputs.input_values.to("cuda")).logits pred_ids = torch.argmax(logits, dim=-1) transcript = processor.batch_decode(pred_ids)[0] print(transcript)