Ultravox-MamayLM-12B-UK v3 (extended training)

Single-pass speech-language model for Ukrainian, built on top of INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 with the Ultravox v0.6 architecture (Whisper-large-v3-turbo audio encoder + projector → frozen Gemma-3-12B LLM).

This is the v3 checkpoint (HF tag v3.0): warm-started from the v2 checkpoint and trained for an additional 24 000 steps with a fresh learning-rate schedule on the same multi-dataset UK + EN mix as v2.

Headline result

Same 50-fixture Ukrainian benchmark, same MamayLM-12B backbone, same prompts, same in-cluster bench client as v1/v2 (612 records, 0 errors, 22 min wall-clock).

Pipeline Verbatim WER TTFT p50 TTFT mean
Cascade (Whisper-large-v3-turbo + MamayLM-12B) 0.219 0.284 s 0.954 s
Ultravox v1 (single-dataset, 8 k steps) 0.339 0.092 s 0.190 s
Ultravox v2 (multi-dataset, 14.4 k steps) 0.222 0.091 s 0.227 s
Ultravox v3 (24 k steps, warm-started from v2) 0.217 0.091 s 0.385 s

v3 is the first Ultravox-MamayLM checkpoint whose verbatim WER beats the Whisper-large-v3-turbo cascade on the same audio (0.217 vs 0.219), at 3.09× faster median TTFT.

Statistical detail (paired by fixture, 51 fixtures, 2 verbatim rounds per version):

Comparison Paired Δ mean 95 % CI Cohen's d Note
v3 verbatim − v2 verbatim −0.0046 [−0.0264, +0.0172] −0.06 Within measurement noise — v3 ≈ v2
v3 verbatim − cascade (this run) −0.0012 [−0.0328, +0.0304] −0.01 Statistically tied with the cascade ceiling
cascade (v3 run) − cascade (v2 run) 0.0000 [0.0000, 0.0000] n/a 51 / 51 fixture means identical (cascade is deterministic)

The v2 → v3 result is small in absolute terms, but it tells us that with this data mix and stack_factor = 8 projector we have converged to the cascade ceiling: more training on the same data does not help further. Pushing below 0.217 will require an architectural change (e.g. stack_factor = 4) or new training signal (noise augmentation, Whisper pseudolabels) — see the v3.1 roadmap notes in the paper write-up.

Full bench artifacts: roman4work/voice-bench-results (bench-20260501T141534Z for v1, bench-20260502T081341Z for v2, bench-20260503T082435Z for v3).

Architecture

  • Audio encoder: openai/whisper-large-v3-turbo (LoRA-adapted, r = 8, target k/v/q/o_proj)
  • Projector: SwiGLU, stack_factor = 8, mid-LayerNorm
  • Text backbone: INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 (frozen during training, loaded automatically by the Ultravox model class — you do not need to download it separately, but you must have access)

This repository contains only the projector + Whisper-LoRA + tokenizer / processor files (~140 MB). The base text model is referenced by config.json (text_model_id) and fetched from HF Hub at load time.

Training data (mix, unchanged from v2)

Dataset Weight Objective
commonvoice-uk-transcription 4 UK ASR (verbatim)
commonvoice-uk-continuation 4 UK reply / instruction-following
fleurs-uk_ua-transcription 8 broader UK domain coverage
librispeech-clean-transcription 1 EN audio anchor
librispeech-clean-continuation 1 EN audio anchor
commonvoice-en-transcription 0.5 EN audio anchor
commonvoice-en-continuation 0.5 EN audio anchor

EN data is anchored at low weight to prevent the projector from collapsing onto UK-specific audio statistics. FLEURS contributes only its -transcription form because the dataset registry does not provide a -continuation version for FLEURS.

Training setup

Hardware 1 × NVIDIA B200
Steps 24 000 (10 checkpoints saved every 2 400 steps)
Initialization warm-started from v2 checkpoint-14400 (weights only — fresh optimizer & LR schedule)
Batch size 4 per GPU × grad_accum 8 → effective 32
Learning rate 5e-4, 1 000-step warmup
Wall clock 12 h 25 m
Final loss (rolling-100 avg) 0.1155 (vs v2's 0.119)
Aggregate train_loss 0.12266
Seed 43

Inference

Recommended runtime: vLLM with the vllm[audio] extras. Example deployment:

vllm serve roman4work/ultravox-mamaylm-12b-uk-v3 \
    --served-model-name ultravox-mamaylm-uk \
    --max-model-len 4096 \
    --dtype bfloat16 \
    --trust-remote-code \
    --enforce-eager \
    --block-size 64

The OpenAI-compatible Chat Completions endpoint accepts audio via the input_audio content type:

{
  "model": "ultravox-mamaylm-uk",
  "messages": [
    {"role": "system",
     "content": "Ти український голосовий помічник. Відповідай коротко, природно і виключно українською мовою."},
    {"role": "user", "content": [
      {"type": "input_audio",
       "input_audio": {"data": "<base64-encoded-wav>", "format": "wav"}}
    ]}
  ]
}

License

This model inherits from its components and you must comply with all of:

Citation

If you use this checkpoint, please credit:

Downloads last month
5
Safetensors
Model size
51.8M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for roman4work/ultravox-mamaylm-12b-uk-v3

Finetuned
(4)
this model