Instructions to use bosonai/higgs-tts-2-3b-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bosonai/higgs-tts-2-3b-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="bosonai/higgs-tts-2-3b-base")# Load model directly from transformers import AutoProcessor, AutoModelForTextToWaveform processor = AutoProcessor.from_pretrained("bosonai/higgs-tts-2-3b-base") model = AutoModelForTextToWaveform.from_pretrained("bosonai/higgs-tts-2-3b-base") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -10,11 +10,12 @@ pipeline_tag: text-to-speech
|
|
| 10 |
|
| 11 |
# Higgs Audio V2: Redefining Expressiveness in Audio Generation
|
| 12 |
|
|
|
|
|
|
|
| 13 |
We are open-sourcing Higgs Audio v2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data.
|
| 14 |
Despite having no post-training or fine-tuning, Higgs Audio v2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.
|
| 15 |
|
| 16 |
On [EmergentTTS-Eval](https://github.com/boson-ai/emergenttts-eval-public), the model achieves win rates of **75.7%** and **55.7%** over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including automatic prosody adaptation during narration, zero-shot generation of natural multi-speaker dialogues in multiple languages, melodic humming with the cloned voice, and simultaneous generation of speech and background music.
|
| 17 |
-
Check our open-source repository https://github.com/boson-ai/higgs-audio for more details.
|
| 18 |
|
| 19 |
|
| 20 |
<p>
|
|
@@ -44,7 +45,7 @@ Higgs Audio v2 adopts the "generation variant" depicted in the architecture figu
|
|
| 44 |
We introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate.
|
| 45 |
Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system.
|
| 46 |
It also uses a simple non-diffusion encoder/decoder for fast, batch inference. It achieves state-of-the-art performance in semantic and acoustic evaluations.
|
| 47 |
-
Check https://huggingface.co/bosonai/higgs-audio-v2-tokenizer
|
| 48 |
|
| 49 |
### Model Architecture -- Dual FFN
|
| 50 |
|
|
@@ -54,7 +55,7 @@ DualFFN acts as an audio-specific expert, boosting the LLM's performance with mi
|
|
| 54 |
Our implementation preserves 91% of the original LLM’s training speed with the inclusion of DualFFN, which has 2.2B parameters.
|
| 55 |
Thus, the total number of parameter for Higgs Audio v2 is 3.6B (LLM) + 2.2B (Audio Dual FFN), and it has the same training / inference FLOPs as Llama-3.2-3B.
|
| 56 |
Ablation study shows that the model equipped with DualFFN consistently outperforms its counterpart in terms of word error rate (WER) and speaker similarity.
|
| 57 |
-
See [
|
| 58 |
|
| 59 |
|
| 60 |
## Evaluation
|
|
@@ -77,7 +78,7 @@ We prompt Higgs Audio v2 with `<ref_text, ref_audio, text>` for zero-shot TTS. W
|
|
| 77 |
|
| 78 |
#### EmergentTTS-Eval ("Emotions" and "Questions")
|
| 79 |
|
| 80 |
-
Following the [EmergentTTS-Eval Paper](https://arxiv.org/abs/2505.23009), we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice.
|
| 81 |
|
| 82 |
| Model | Emotions (%) ↑ | Questions (%) ↑ |
|
| 83 |
|------------------------------------|--------------|----------------|
|
|
@@ -116,7 +117,7 @@ We evaluate the word-error-rate (WER) and the geometric mean between intra-speak
|
|
| 116 |
|
| 117 |
## Get Started
|
| 118 |
|
| 119 |
-
You need to first install the [higgs-audio
|
| 120 |
|
| 121 |
```bash
|
| 122 |
git clone https://github.com/boson-ai/higgs-audio.git
|
|
@@ -139,8 +140,8 @@ import torchaudio
|
|
| 139 |
import time
|
| 140 |
import click
|
| 141 |
|
| 142 |
-
MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-
|
| 143 |
-
AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer
|
| 144 |
|
| 145 |
system_prompt = (
|
| 146 |
"Generate audio following instruction.\n\n<|scene_desc_start|>\nSPEAKER0: british accent\n<|scene_desc_end|>"
|
|
|
|
| 10 |
|
| 11 |
# Higgs Audio V2: Redefining Expressiveness in Audio Generation
|
| 12 |
|
| 13 |
+
Check our open-source repository https://github.com/boson-ai/higgs-audio for more details!
|
| 14 |
+
|
| 15 |
We are open-sourcing Higgs Audio v2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data.
|
| 16 |
Despite having no post-training or fine-tuning, Higgs Audio v2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.
|
| 17 |
|
| 18 |
On [EmergentTTS-Eval](https://github.com/boson-ai/emergenttts-eval-public), the model achieves win rates of **75.7%** and **55.7%** over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including automatic prosody adaptation during narration, zero-shot generation of natural multi-speaker dialogues in multiple languages, melodic humming with the cloned voice, and simultaneous generation of speech and background music.
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
<p>
|
|
|
|
| 45 |
We introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate.
|
| 46 |
Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system.
|
| 47 |
It also uses a simple non-diffusion encoder/decoder for fast, batch inference. It achieves state-of-the-art performance in semantic and acoustic evaluations.
|
| 48 |
+
Check https://huggingface.co/bosonai/higgs-audio-v2-tokenizer for more information about the tokenizer.
|
| 49 |
|
| 50 |
### Model Architecture -- Dual FFN
|
| 51 |
|
|
|
|
| 55 |
Our implementation preserves 91% of the original LLM’s training speed with the inclusion of DualFFN, which has 2.2B parameters.
|
| 56 |
Thus, the total number of parameter for Higgs Audio v2 is 3.6B (LLM) + 2.2B (Audio Dual FFN), and it has the same training / inference FLOPs as Llama-3.2-3B.
|
| 57 |
Ablation study shows that the model equipped with DualFFN consistently outperforms its counterpart in terms of word error rate (WER) and speaker similarity.
|
| 58 |
+
See [our architecture blog](https://github.com/boson-ai/higgs-audio/tech_blogs/ARCHITECTURE_BLOG.md) for more information.
|
| 59 |
|
| 60 |
|
| 61 |
## Evaluation
|
|
|
|
| 78 |
|
| 79 |
#### EmergentTTS-Eval ("Emotions" and "Questions")
|
| 80 |
|
| 81 |
+
Following the [EmergentTTS-Eval Paper](https://arxiv.org/abs/2505.23009), we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice. Results of Higgs Audio v2 is obtained with the voice of "belinda".
|
| 82 |
|
| 83 |
| Model | Emotions (%) ↑ | Questions (%) ↑ |
|
| 84 |
|------------------------------------|--------------|----------------|
|
|
|
|
| 117 |
|
| 118 |
## Get Started
|
| 119 |
|
| 120 |
+
You need to first install the [higgs-audio](https://github.com/boson-ai/higgs-audio):
|
| 121 |
|
| 122 |
```bash
|
| 123 |
git clone https://github.com/boson-ai/higgs-audio.git
|
|
|
|
| 140 |
import time
|
| 141 |
import click
|
| 142 |
|
| 143 |
+
MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-base"
|
| 144 |
+
AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer"
|
| 145 |
|
| 146 |
system_prompt = (
|
| 147 |
"Generate audio following instruction.\n\n<|scene_desc_start|>\nSPEAKER0: british accent\n<|scene_desc_end|>"
|