Armenian Whisper Small (`whisper-small-hy`)

Fine-tuned openai/whisper-small for Eastern Armenian (hy) automatic speech recognition (ASR). Part of the Armenian speech research stack alongside HyVoxPopuli and SpeechT5 TTS.


Task	Automatic speech recognition
Language	Armenian (`<
Architecture	`WhisperForConditionalGeneration` (encoder–decoder)
Input	16 kHz mono audio (30 s chunks max)
Output	Armenian text (greedy / beam search via `generate`)
Weights	`model.safetensors` (~922 MB)

Evaluation (from training run)

Metrics below come from the same validation split used during fine-tuning (auto-generated Trainer card). They are not a fresh benchmark on HyVoxPopuli or Common Voice test sets.

Step	Train loss	Val loss	WER
250	0.1258	0.3914	76.08%
500	0.0064	0.4882	74.57%
750	0.0008	0.5486	74.25%
1000	0.0007	0.5691	74.77%

Interpretation: Validation WER remains ~75% with rising val loss at the end of training — strong signs of overfitting on a small fine-tune set. Treat this checkpoint as an experimental baseline, not production ASR.

For stronger Armenian ASR today, prefer facebook/mms-1b-all with the hy language adapter (see HyVoxPopuli dataset card).

Quick start

import torch
from datasets import load_dataset, Audio
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "Edmon02/whisper-small-hy"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="hy", task="transcribe"
)

# Example: HyVoxPopuli test clip (filter empty text in real eval)
ds = load_dataset("Edmon02/hyvoxpopuli", split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
ds = ds.filter(lambda x: bool((x["normalized_text"] or "").strip()))
sample = ds[0]

inputs = processor(
    sample["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    ids = model.generate(**inputs, max_new_tokens=256)

hypothesis = processor.batch_decode(ids, skip_special_tokens=True)[0]
print("Hypothesis:", hypothesis)
print("Reference:", sample["normalized_text"])

Pipeline (high level)

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="Edmon02/whisper-small-hy",
    device=0,
    generate_kwargs={"language": "hy", "task": "transcribe"},
)
# asr(audio_path_or_array)

Intended uses

Prototyping Armenian ASR in Whisper-based pipelines
Comparing Whisper fine-tuning vs MMS / XLS-R on HyVoxPopuli
Teaching / reproducing low-resource Whisper fine-tunes

Out of scope

Production transcription without re-benchmarking on your domain
Real-time streaming (use faster-whisper + optimized checkpoints after validation)
Non-Armenian languages (use base openai/whisper-small or other adapters)
Speaker diarization or timestamp-heavy workflows without extra tooling

Training procedure

Hyperparameter	Value
Base model	`openai/whisper-small`
Learning rate	1e-5
Train batch size	16
Eval batch size	8
Steps	1000
LR schedule	Linear warmup (125 steps)
Optimizer	Adam (β₁=0.9, β₂=0.999, ε=1e-8)
Precision	Native AMP (fp16)
Seed	42
Framework	Transformers 4.36.2, PyTorch 2.1.0+cu121, Datasets 2.16.1

Training data is not serialized in this repo; it was likely Common Voice hy-AM (and/or project-internal splits) from the same era as speecht5_finetuned_hy. Re-run training with documented splits before trusting metrics.

Limitations

High WER (~75%) on the recorded validation pass — not competitive with modern MMS adapters
Small fine-tuning corpus typical of early experiments
Literary / read speech in HyVoxPopuli differs from conversational CV audio — domain shift
No word-level timestamps in default generate config (return_timestamps: false)
Multilingual Whisper vocabulary; Armenian uses language token <|hy|>

Related assets

Asset	Link	Role
Dataset	Edmon02/hyvoxpopuli	Armenian speech for eval / fine-tune
TTS (paired stack)	speecht5_finetuned_voxpopuli_hy	Armenian TTS
Stronger ASR baseline	facebook/mms-1b-all + `hy` adapter	Recommended for new work
Base model	openai/whisper-small	Multilingual Whisper

Bias and ethics

ASR errors disproportionately affect dialectal or code-switched Armenian (e.g. Russian lines in literary sources). Do not use outputs for high-stakes decisions without human review.

Citation

@misc{whisper_small_hy2024,
  author = {Avetisyan, Edmon},
  title = {Armenian Whisper Small (whisper-small-hy)},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Edmon02/whisper-small-hy}}
}

Also cite OpenAI Whisper and your training data (e.g. Common Voice, HyVoxPopuli).

License

Apache-2.0 (inherits from Whisper fine-tuning convention). Base openai/whisper-small license applies to architecture and tokenizer.

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Edmon02/whisper-small-hy

Base model

openai/whisper-small

Finetuned

(3553)

this model

Dataset used to train Edmon02/whisper-small-hy

Evaluation results

WER on validation (training run)
self-reported

74.771
Validation Loss on validation (training run)
self-reported

0.569