Armenian Whisper Small (whisper-small-hy)

Fine-tuned openai/whisper-small for Eastern Armenian (hy) automatic speech recognition (ASR). Part of the Armenian speech research stack alongside HyVoxPopuli and SpeechT5 TTS.

Task Automatic speech recognition
Language Armenian (`<
Architecture WhisperForConditionalGeneration (encoder–decoder)
Input 16 kHz mono audio (30 s chunks max)
Output Armenian text (greedy / beam search via generate)
Weights model.safetensors (~922 MB)

Evaluation (from training run)

Metrics below come from the same validation split used during fine-tuning (auto-generated Trainer card). They are not a fresh benchmark on HyVoxPopuli or Common Voice test sets.

Step Train loss Val loss WER
250 0.1258 0.3914 76.08%
500 0.0064 0.4882 74.57%
750 0.0008 0.5486 74.25%
1000 0.0007 0.5691 74.77%

Interpretation: Validation WER remains ~75% with rising val loss at the end of training — strong signs of overfitting on a small fine-tune set. Treat this checkpoint as an experimental baseline, not production ASR.

For stronger Armenian ASR today, prefer facebook/mms-1b-all with the hy language adapter (see HyVoxPopuli dataset card).

Quick start

import torch
from datasets import load_dataset, Audio
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "Edmon02/whisper-small-hy"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="hy", task="transcribe"
)

# Example: HyVoxPopuli test clip (filter empty text in real eval)
ds = load_dataset("Edmon02/hyvoxpopuli", split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
ds = ds.filter(lambda x: bool((x["normalized_text"] or "").strip()))
sample = ds[0]

inputs = processor(
    sample["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    ids = model.generate(**inputs, max_new_tokens=256)

hypothesis = processor.batch_decode(ids, skip_special_tokens=True)[0]
print("Hypothesis:", hypothesis)
print("Reference:", sample["normalized_text"])

Pipeline (high level)

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="Edmon02/whisper-small-hy",
    device=0,
    generate_kwargs={"language": "hy", "task": "transcribe"},
)
# asr(audio_path_or_array)

Intended uses

  • Prototyping Armenian ASR in Whisper-based pipelines
  • Comparing Whisper fine-tuning vs MMS / XLS-R on HyVoxPopuli
  • Teaching / reproducing low-resource Whisper fine-tunes

Out of scope

  • Production transcription without re-benchmarking on your domain
  • Real-time streaming (use faster-whisper + optimized checkpoints after validation)
  • Non-Armenian languages (use base openai/whisper-small or other adapters)
  • Speaker diarization or timestamp-heavy workflows without extra tooling

Training procedure

Hyperparameter Value
Base model openai/whisper-small
Learning rate 1e-5
Train batch size 16
Eval batch size 8
Steps 1000
LR schedule Linear warmup (125 steps)
Optimizer Adam (β₁=0.9, β₂=0.999, ε=1e-8)
Precision Native AMP (fp16)
Seed 42
Framework Transformers 4.36.2, PyTorch 2.1.0+cu121, Datasets 2.16.1

Training data is not serialized in this repo; it was likely Common Voice hy-AM (and/or project-internal splits) from the same era as speecht5_finetuned_hy. Re-run training with documented splits before trusting metrics.

Limitations

  • High WER (~75%) on the recorded validation pass — not competitive with modern MMS adapters
  • Small fine-tuning corpus typical of early experiments
  • Literary / read speech in HyVoxPopuli differs from conversational CV audio — domain shift
  • No word-level timestamps in default generate config (return_timestamps: false)
  • Multilingual Whisper vocabulary; Armenian uses language token <|hy|>

Related assets

Asset Link Role
Dataset Edmon02/hyvoxpopuli Armenian speech for eval / fine-tune
TTS (paired stack) speecht5_finetuned_voxpopuli_hy Armenian TTS
Stronger ASR baseline facebook/mms-1b-all + hy adapter Recommended for new work
Base model openai/whisper-small Multilingual Whisper

Bias and ethics

ASR errors disproportionately affect dialectal or code-switched Armenian (e.g. Russian lines in literary sources). Do not use outputs for high-stakes decisions without human review.

Citation

@misc{whisper_small_hy2024,
  author = {Avetisyan, Edmon},
  title = {Armenian Whisper Small (whisper-small-hy)},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Edmon02/whisper-small-hy}}
}

Also cite OpenAI Whisper and your training data (e.g. Common Voice, HyVoxPopuli).

License

Apache-2.0 (inherits from Whisper fine-tuning convention). Base openai/whisper-small license applies to architecture and tokenizer.

Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Edmon02/whisper-small-hy

Finetuned
(3553)
this model

Dataset used to train Edmon02/whisper-small-hy

Evaluation results