LokaalHub
/

nemotron-3.5-cy

Automatic Speech Recognition

cache-aware-streaming

Eval Results (legacy)

Model card Files Files and versions

nemotron-3.5-cy / README.md

jellewas's picture

Upload README.md with huggingface_hub

3282d0c verified 12 days ago

|

History Blame Contribute Delete

2.52 kB

	---
	license: other
	license_name: openmdw-1.1
	license_link: https://openmdw.ai/license/1-1/
	language:
	- cy
	library_name: nemo
	pipeline_tag: automatic-speech-recognition
	base_model: nvidia/nemotron-3.5-asr-streaming-0.6b
	datasets:
	- LokaalHub/cy-asr-cv
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- cy
	- nemo
	- fastconformer
	- rnnt
	- cache-aware-streaming
	- nemotron
	metrics:
	- wer
	model-index:
	- name: cy-asr-streaming-0.6b
	results:
	- task: {type: automatic-speech-recognition, name: Automatic Speech Recognition}
	dataset: {name: LokaalHub/cy-asr-cv (test), type: LokaalHub/cy-asr-cv, split: test}
	metrics:
	- type: wer
	value: 22.48
	name: WER (offline / full-context, normalized)
	---

	# cy-asr-streaming-0.6b

	A streaming Welsh (`cy`) ASR model, fine-tuned from
	[`nvidia/nemotron-3.5-asr-streaming-0.6b`](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b) on
	[`LokaalHub/cy-asr-cv`](https://huggingface.co/datasets/LokaalHub/cy-asr-cv).

	> Community fine-tune, not an NVIDIA model. A derivative of NVIDIA's Nemotron 3.5 ASR.
	> NVIDIA did not produce, endorse, or review this model. "Nemotron" is a trademark of NVIDIA,
	> used here only to identify the base model.

	## TL;DR

	Welsh (`cy`) is not one of the base model's supported locales, so it is fine-tuned conditioned on the closest available slot (`en`). Fine-tuning on ~50.1h takes WER from ~99.2% to ~22.48%. Prompt slot used during fine-tuning: `en` (nearest relative).

	## Results

	\| Condition \| Base \| Fine-tuned \| Rel. improvement \|
	\|-----------\|-----:\|-----------:\|-----------------:\|
	\| WER (offline, full-context, normalized) on `LokaalHub/cy-asr-cv` test \| 99.2% \| 22.48% \| 77.3% \|

	> Offline (full-context) WER via NeMo `transcribe_speech.py`. Cache-aware streaming WER
	> (the condition NVIDIA headlines) was not measured for this release.

	## Usage

	```python
	import nemo.collections.asr as nemo_asr
	m = nemo_asr.models.ASRModel.restore_from("model.nemo") # from this repo
	m.transcribe(["audio.wav"]) # target_lang prompt: en
	```

	## Training

	Single full fine-tune (`init_from_nemo_model`), bf16, NoamAnnealing. Data:
	[`LokaalHub/cy-asr-cv`](https://huggingface.co/datasets/LokaalHub/cy-asr-cv) (~50.1h train).
	Built and trained by the [asr-loop](https://huggingface.co/LokaalHub) pipeline.

	## Limitations

	Low-resource fine-tune on read speech (Common Voice). Evaluated on a 2.0h
	speaker-disjoint test subset — not directly comparable to published full-Common-Voice-test numbers.