Instructions to use fixie-ai/ultravox-v0_5-glm-4_5-355b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use fixie-ai/ultravox-v0_5-glm-4_5-355b with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("fixie-ai/ultravox-v0_5-glm-4_5-355b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - ar | |
| - be | |
| - bg | |
| - bn | |
| - cs | |
| - cy | |
| - da | |
| - de | |
| - el | |
| - en | |
| - es | |
| - et | |
| - fa | |
| - fi | |
| - fr | |
| - gl | |
| - hi | |
| - hu | |
| - it | |
| - ja | |
| - ka | |
| - lt | |
| - lv | |
| - mk | |
| - mr | |
| - nl | |
| - pl | |
| - pt | |
| - ro | |
| - ru | |
| - sk | |
| - sl | |
| - sr | |
| - sv | |
| - sw | |
| - ta | |
| - th | |
| - tr | |
| - uk | |
| - ur | |
| - vi | |
| - zh | |
| license: mit | |
| library_name: transformers | |
| metrics: | |
| - bleu | |
| pipeline_tag: audio-text-to-text | |
| # Model Card for Ultravox | |
| Ultravox is a multimodal Speech LLM built around a pretrained [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) and [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) backbone. | |
| See https://ultravox.ai for the GitHub repo and more information. | |
| ## Model Details | |
| ### Model Description | |
| Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message). | |
| The input to the model is given as a text prompt with a special `<|audio|>` pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio. | |
| Using the merged embeddings as input, the model will then generate output text as usual. | |
| In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output. | |
| No preference tuning has been applied to this revision of the model. | |
| - **Developed by:** Fixie.ai | |
| - **License:** MIT | |
| ### Model Sources | |
| - **Repository:** https://ultravox.ai | |
| - **Demo:** See repo | |
| ## Usage | |
| Think of the model as an LLM that can also hear and understand speech. As such, it can be used as a voice agent, and also to do speech-to-speech translation, analysis of spoken audio, etc. | |
| To use the model, try the following: | |
| ```python | |
| # pip install transformers peft librosa | |
| import transformers | |
| import numpy as np | |
| import librosa | |
| pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_5-glm-4_5-355b', trust_remote_code=True) | |
| path = "<path-to-input-audio>" # TODO: pass the audio here | |
| audio, sr = librosa.load(path, sr=16000) | |
| turns = [ | |
| { | |
| "role": "system", | |
| "content": "You are a friendly and helpful character. You love to answer questions for people." | |
| }, | |
| ] | |
| pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30) | |
| ``` | |
| ## Training Details | |
| The model uses a pre-trained [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) backbone as well as the encoder part of [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo). | |
| The multi-modal adapter is trained, the Whisper encoder is fine-tuned, and the GLM model is kept frozen. | |
| We use a knowledge-distillation loss where Ultravox is trying to match the logits of the text-based GLM backbone. | |
| ### Training Data | |
| The training dataset is a mix of ASR datasets, extended with continuations generated by Llama 3.1 8B, and speech translation datasets, which yield a modest improvement in translation evaluations. | |
| ### Training Procedure | |
| Supervised speech instruction finetuning via knowledge-distillation. For more info, see [training code in Ultravox repo](https://github.com/fixie-ai/ultravox/blob/main/ultravox/training/train.py). | |
| #### Training Hyperparameters | |
| - **Training regime:** BF16 mixed precision training | |
| - **Hardward used:** 8x B200 GPUs | |
| ## Evaluation | |
| Coming soon. |