Instructions to use roman4work/ultravox-mamaylm-12b-uk-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use roman4work/ultravox-mamaylm-12b-uk-v3 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("roman4work/ultravox-mamaylm-12b-uk-v3", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Ultravox-MamayLM-12B-UK v3 (extended training)
Single-pass speech-language model for Ukrainian, built on top of
INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0
with the Ultravox v0.6 architecture
(Whisper-large-v3-turbo audio encoder + projector → frozen Gemma-3-12B LLM).
This is the v3 checkpoint (HF tag v3.0): warm-started from the
v2 checkpoint
and trained for an additional 24 000 steps with a fresh learning-rate
schedule on the same multi-dataset UK + EN mix as v2.
Headline result
Same 50-fixture Ukrainian benchmark, same MamayLM-12B backbone, same prompts, same in-cluster bench client as v1/v2 (612 records, 0 errors, 22 min wall-clock).
| Pipeline | Verbatim WER | TTFT p50 | TTFT mean |
|---|---|---|---|
| Cascade (Whisper-large-v3-turbo + MamayLM-12B) | 0.219 | 0.284 s | 0.954 s |
| Ultravox v1 (single-dataset, 8 k steps) | 0.339 | 0.092 s | 0.190 s |
| Ultravox v2 (multi-dataset, 14.4 k steps) | 0.222 | 0.091 s | 0.227 s |
| Ultravox v3 (24 k steps, warm-started from v2) | 0.217 | 0.091 s | 0.385 s |
v3 is the first Ultravox-MamayLM checkpoint whose verbatim WER beats the Whisper-large-v3-turbo cascade on the same audio (0.217 vs 0.219), at 3.09× faster median TTFT.
Statistical detail (paired by fixture, 51 fixtures, 2 verbatim rounds per version):
| Comparison | Paired Δ mean | 95 % CI | Cohen's d | Note |
|---|---|---|---|---|
| v3 verbatim − v2 verbatim | −0.0046 | [−0.0264, +0.0172] | −0.06 | Within measurement noise — v3 ≈ v2 |
| v3 verbatim − cascade (this run) | −0.0012 | [−0.0328, +0.0304] | −0.01 | Statistically tied with the cascade ceiling |
| cascade (v3 run) − cascade (v2 run) | 0.0000 | [0.0000, 0.0000] | n/a | 51 / 51 fixture means identical (cascade is deterministic) |
The v2 → v3 result is small in absolute terms, but it tells us that with this
data mix and stack_factor = 8 projector we have converged to the cascade
ceiling: more training on the same data does not help further. Pushing below
0.217 will require an architectural change (e.g. stack_factor = 4) or new
training signal (noise augmentation, Whisper pseudolabels) — see the v3.1
roadmap notes in the paper write-up.
Full bench artifacts:
roman4work/voice-bench-results
(bench-20260501T141534Z for v1, bench-20260502T081341Z for v2,
bench-20260503T082435Z for v3).
Architecture
- Audio encoder:
openai/whisper-large-v3-turbo(LoRA-adapted, r = 8, target k/v/q/o_proj) - Projector: SwiGLU,
stack_factor = 8, mid-LayerNorm - Text backbone:
INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0(frozen during training, loaded automatically by the Ultravox model class — you do not need to download it separately, but you must have access)
This repository contains only the projector + Whisper-LoRA + tokenizer /
processor files (~140 MB). The base text model is referenced by config.json
(text_model_id) and fetched from HF Hub at load time.
Training data (mix, unchanged from v2)
| Dataset | Weight | Objective |
|---|---|---|
commonvoice-uk-transcription |
4 | UK ASR (verbatim) |
commonvoice-uk-continuation |
4 | UK reply / instruction-following |
fleurs-uk_ua-transcription |
8 | broader UK domain coverage |
librispeech-clean-transcription |
1 | EN audio anchor |
librispeech-clean-continuation |
1 | EN audio anchor |
commonvoice-en-transcription |
0.5 | EN audio anchor |
commonvoice-en-continuation |
0.5 | EN audio anchor |
EN data is anchored at low weight to prevent the projector from collapsing onto
UK-specific audio statistics. FLEURS contributes only its -transcription
form because the dataset registry does not provide a -continuation version
for FLEURS.
Training setup
| Hardware | 1 × NVIDIA B200 |
| Steps | 24 000 (10 checkpoints saved every 2 400 steps) |
| Initialization | warm-started from v2 checkpoint-14400 (weights only — fresh optimizer & LR schedule) |
| Batch size | 4 per GPU × grad_accum 8 → effective 32 |
| Learning rate | 5e-4, 1 000-step warmup |
| Wall clock | 12 h 25 m |
| Final loss (rolling-100 avg) | 0.1155 (vs v2's 0.119) |
| Aggregate train_loss | 0.12266 |
| Seed | 43 |
Inference
Recommended runtime: vLLM with the
vllm[audio] extras. Example deployment:
vllm serve roman4work/ultravox-mamaylm-12b-uk-v3 \
--served-model-name ultravox-mamaylm-uk \
--max-model-len 4096 \
--dtype bfloat16 \
--trust-remote-code \
--enforce-eager \
--block-size 64
The OpenAI-compatible Chat Completions endpoint accepts audio via the
input_audio content type:
{
"model": "ultravox-mamaylm-uk",
"messages": [
{"role": "system",
"content": "Ти український голосовий помічник. Відповідай коротко, природно і виключно українською мовою."},
{"role": "user", "content": [
{"type": "input_audio",
"input_audio": {"data": "<base64-encoded-wav>", "format": "wav"}}
]}
]
}
License
This model inherits from its components and you must comply with all of:
- Gemma Terms of Use (the MamayLM backbone is a Gemma-3 derivative)
- INSAIT MamayLM license
- Whisper LoRA weights are released under MIT (audio encoder)
Citation
If you use this checkpoint, please credit:
- Ultravox: Fixie AI Ultravox
- MamayLM: INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0
- Gemma: Gemma 3 Technical Report
- Downloads last month
- 5
Model tree for roman4work/ultravox-mamaylm-12b-uk-v3
Base model
google/gemma-3-12b-pt