Upload README.md with huggingface_hub

3282d0c verified 11 days ago

2.52 kB

license: other
license_name: openmdw-1.1
license_link: https://openmdw.ai/license/1-1/
language:
  - cy
library_name: nemo
pipeline_tag: automatic-speech-recognition
base_model: nvidia/nemotron-3.5-asr-streaming-0.6b
datasets:
  - LokaalHub/cy-asr-cv
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - cy
  - nemo
  - fastconformer
  - rnnt
  - cache-aware-streaming
  - nemotron
metrics:
  - wer
model-index:
  - name: cy-asr-streaming-0.6b
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: LokaalHub/cy-asr-cv (test)
          type: LokaalHub/cy-asr-cv
          split: test
        metrics:
          - type: wer
            value: 22.48
            name: WER (offline / full-context, normalized)

cy-asr-streaming-0.6b

A streaming Welsh (cy) ASR model, fine-tuned from nvidia/nemotron-3.5-asr-streaming-0.6b on LokaalHub/cy-asr-cv.

Community fine-tune, not an NVIDIA model. A derivative of NVIDIA's Nemotron 3.5 ASR. NVIDIA did not produce, endorse, or review this model. "Nemotron" is a trademark of NVIDIA, used here only to identify the base model.

TL;DR

Welsh (cy) is not one of the base model's supported locales, so it is fine-tuned conditioned on the closest available slot (en). Fine-tuning on ~50.1h takes WER from ~99.2% to ~22.48%. Prompt slot used during fine-tuning: en (nearest relative).

Results

Condition	Base	Fine-tuned	Rel. improvement
WER (offline, full-context, normalized) on `LokaalHub/cy-asr-cv` test	99.2%	22.48%	77.3%

Offline (full-context) WER via NeMo transcribe_speech.py. Cache-aware streaming WER (the condition NVIDIA headlines) was not measured for this release.

Usage

import nemo.collections.asr as nemo_asr
m = nemo_asr.models.ASRModel.restore_from("model.nemo")  # from this repo
m.transcribe(["audio.wav"])   # target_lang prompt: en

Training

Single full fine-tune (init_from_nemo_model), bf16, NoamAnnealing. Data: LokaalHub/cy-asr-cv (~50.1h train). Built and trained by the asr-loop pipeline.

Limitations

Low-resource fine-tune on read speech (Common Voice). Evaluated on a 2.0h speaker-disjoint test subset — not directly comparable to published full-Common-Voice-test numbers.