File size: 2,748 Bytes
0d4f3a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: mit
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
---
# Whisper Large-v3 Khmer ASR

Fine-tuned variant of [`openai/whisper-large-v3`](https://huggingface.co/openai/whisper-large-v3) for Khmer automatic speech recognition. The model was trained with the utilities in `whisper` and is intended for transcription workloads that prioritize Khmer text normalization, including numerals, currency, and date expressions.

## Model Card

| Attribute | Value |
| --- | --- |
| **Base model** | `openai/whisper-large-v3` |
| **Language** | Khmer (`km-KH`) |
| **Task** | Automatic Speech Recognition (speech-to-text) |
| **Sample rate** | 16 kHz audio, automatically resampled |
| **Input length** | Up to 30 s clips (truncated during batching) |
| **Finetuning data** | `asr_mixed_dataset.txt` (internal manifests, normalized through `dataset_builder.segment_text`) |
| **Epochs** | 10 |
| **Batch size** | 2 (gradient accumulation 1) |
| **Optimizer** | AdamW (managed by `Seq2SeqTrainer`) |
| **Learning rate** | 1e-6 with cosine scheduler & 1k warmup steps |
| **Normalization** | Khmer-specific regex and rule-based normalization (`khmerspeech`, `khmercut`) |
| **Dataset** | Training with Mixed Khmer & English audio with 199K samples (225 hours), train all khmer public dataset + humaned label dataset
| **Training Time** | Training with Mixed precision with RTX-5090 VRAM 32GB for 10 days

> Limitations: performance has been validated only on internal validation/test splits. Long-form audio, accents outside the training distribution, or noisy backgrounds may degrade accuracy.


## Inference Examples

```python
import torch
import torchaudio
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


AUDIO_PATH = "audio_path.wav" 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "metythorn/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

speech_waveform, sr = torchaudio.load(AUDIO_PATH)

# Whisper expects 16kHz mono
if sr != 16000:
    speech_waveform = torchaudio.functional.resample(
        speech_waveform, 
        orig_freq=sr, 
        new_freq=16000
    )
speech_waveform = speech_waveform.squeeze().numpy()
result = pipe(speech_waveform)

print("Transcription:", result["text"])

```