Instructions to use Edmon02/whisper-small-hy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Edmon02/whisper-small-hy with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Edmon02/whisper-small-hy")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Edmon02/whisper-small-hy") model = AutoModelForMultimodalLM.from_pretrained("Edmon02/whisper-small-hy") - Notebooks
- Google Colab
- Kaggle
Armenian Whisper Small (whisper-small-hy)
Fine-tuned openai/whisper-small for Eastern Armenian (hy) automatic speech recognition (ASR). Part of the Armenian speech research stack alongside HyVoxPopuli and SpeechT5 TTS.
| Task | Automatic speech recognition |
| Language | Armenian (`< |
| Architecture | WhisperForConditionalGeneration (encoder–decoder) |
| Input | 16 kHz mono audio (30 s chunks max) |
| Output | Armenian text (greedy / beam search via generate) |
| Weights | model.safetensors (~922 MB) |
Evaluation (from training run)
Metrics below come from the same validation split used during fine-tuning (auto-generated Trainer card). They are not a fresh benchmark on HyVoxPopuli or Common Voice test sets.
| Step | Train loss | Val loss | WER |
|---|---|---|---|
| 250 | 0.1258 | 0.3914 | 76.08% |
| 500 | 0.0064 | 0.4882 | 74.57% |
| 750 | 0.0008 | 0.5486 | 74.25% |
| 1000 | 0.0007 | 0.5691 | 74.77% |
Interpretation: Validation WER remains ~75% with rising val loss at the end of training — strong signs of overfitting on a small fine-tune set. Treat this checkpoint as an experimental baseline, not production ASR.
For stronger Armenian ASR today, prefer facebook/mms-1b-all with the hy language adapter (see HyVoxPopuli dataset card).
Quick start
import torch
from datasets import load_dataset, Audio
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "Edmon02/whisper-small-hy"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
language="hy", task="transcribe"
)
# Example: HyVoxPopuli test clip (filter empty text in real eval)
ds = load_dataset("Edmon02/hyvoxpopuli", split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
ds = ds.filter(lambda x: bool((x["normalized_text"] or "").strip()))
sample = ds[0]
inputs = processor(
sample["audio"]["array"],
sampling_rate=16_000,
return_tensors="pt",
)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
ids = model.generate(**inputs, max_new_tokens=256)
hypothesis = processor.batch_decode(ids, skip_special_tokens=True)[0]
print("Hypothesis:", hypothesis)
print("Reference:", sample["normalized_text"])
Pipeline (high level)
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="Edmon02/whisper-small-hy",
device=0,
generate_kwargs={"language": "hy", "task": "transcribe"},
)
# asr(audio_path_or_array)
Intended uses
- Prototyping Armenian ASR in Whisper-based pipelines
- Comparing Whisper fine-tuning vs MMS / XLS-R on HyVoxPopuli
- Teaching / reproducing low-resource Whisper fine-tunes
Out of scope
- Production transcription without re-benchmarking on your domain
- Real-time streaming (use faster-whisper + optimized checkpoints after validation)
- Non-Armenian languages (use base
openai/whisper-smallor other adapters) - Speaker diarization or timestamp-heavy workflows without extra tooling
Training procedure
| Hyperparameter | Value |
|---|---|
| Base model | openai/whisper-small |
| Learning rate | 1e-5 |
| Train batch size | 16 |
| Eval batch size | 8 |
| Steps | 1000 |
| LR schedule | Linear warmup (125 steps) |
| Optimizer | Adam (β₁=0.9, β₂=0.999, ε=1e-8) |
| Precision | Native AMP (fp16) |
| Seed | 42 |
| Framework | Transformers 4.36.2, PyTorch 2.1.0+cu121, Datasets 2.16.1 |
Training data is not serialized in this repo; it was likely Common Voice hy-AM (and/or project-internal splits) from the same era as speecht5_finetuned_hy. Re-run training with documented splits before trusting metrics.
Limitations
- High WER (~75%) on the recorded validation pass — not competitive with modern MMS adapters
- Small fine-tuning corpus typical of early experiments
- Literary / read speech in HyVoxPopuli differs from conversational CV audio — domain shift
- No word-level timestamps in default
generateconfig (return_timestamps: false) - Multilingual Whisper vocabulary; Armenian uses language token
<|hy|>
Related assets
| Asset | Link | Role |
|---|---|---|
| Dataset | Edmon02/hyvoxpopuli | Armenian speech for eval / fine-tune |
| TTS (paired stack) | speecht5_finetuned_voxpopuli_hy | Armenian TTS |
| Stronger ASR baseline | facebook/mms-1b-all + hy adapter |
Recommended for new work |
| Base model | openai/whisper-small | Multilingual Whisper |
Bias and ethics
ASR errors disproportionately affect dialectal or code-switched Armenian (e.g. Russian lines in literary sources). Do not use outputs for high-stakes decisions without human review.
Citation
@misc{whisper_small_hy2024,
author = {Avetisyan, Edmon},
title = {Armenian Whisper Small (whisper-small-hy)},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Edmon02/whisper-small-hy}}
}
Also cite OpenAI Whisper and your training data (e.g. Common Voice, HyVoxPopuli).
License
Apache-2.0 (inherits from Whisper fine-tuning convention). Base openai/whisper-small license applies to architecture and tokenizer.
- Downloads last month
- 4
Model tree for Edmon02/whisper-small-hy
Base model
openai/whisper-smallDataset used to train Edmon02/whisper-small-hy
Evaluation results
- WER on validation (training run)self-reported74.771
- Validation Loss on validation (training run)self-reported0.569