File size: 25,825 Bytes
a30e8ac
a798c1c
a30e8ac
 
 
 
 
0589e42
0ff4877
a30e8ac
 
d80c511
a30e8ac
4482dee
d201cff
4482dee
f863892
4482dee
 
 
879b32f
 
d80c511
 
 
 
a30e8ac
 
 
 
 
 
 
 
bbf086a
 
 
5253f75
bbf086a
 
 
 
 
 
 
 
5253f75
bbf086a
 
 
 
a30e8ac
 
 
 
 
 
 
d80c511
a30e8ac
226bdcf
a30e8ac
 
 
 
 
 
 
 
 
 
 
 
 
879b32f
a30e8ac
 
 
d80c511
a30e8ac
 
 
d80c511
a30e8ac
1084018
a30e8ac
 
 
 
d80c511
a30e8ac
 
 
d80c511
a30e8ac
 
 
 
 
 
 
226bdcf
d80c511
a30e8ac
 
 
 
d80c511
a30e8ac
 
 
d80c511
a30e8ac
 
 
 
 
 
 
 
 
 
 
 
 
 
d80c511
a30e8ac
ce1355c
 
 
a30e8ac
 
d80c511
ce1355c
a30e8ac
 
 
 
 
ce1355c
d80c511
a30e8ac
 
0ff4877
 
 
 
d80c511
0ff4877
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a30e8ac
879b32f
a30e8ac
 
 
 
 
 
 
 
 
 
 
096073c
 
 
 
 
 
 
 
 
 
 
879b32f
 
096073c
d4ae044
a490cff
d4ae044
 
096073c
 
 
d4ae044
096073c
 
 
 
 
 
 
 
 
 
 
 
 
a490cff
096073c
 
 
 
 
a30e8ac
 
4d866ee
0ff4877
a30e8ac
 
d80c511
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
---
license: other
language:
- en
- zh
- de
- ko
pipeline_tag: text-to-speech
library_name: transformers
---

# Higgs TTS 2: Redefining Expressiveness in Audio Generation

<div align="center" style="display: flex; justify-content: center; margin-top: 10px; flex-wrap: wrap; gap: 8px;">
  <a href="https://boson.ai/blog/higgs-audio-v2"><img src='https://img.shields.io/badge/🚀-Launch Blogpost-228B22' style="margin-right: 5px;"></a>
  <a href="https://github.com/boson-ai/higgs-audio"><img src="https://img.shields.io/badge/💻-Github%20Repo-9C276A" style="margin-right: 5px;"></a>
  <a href="https://huggingface.co/spaces/smola/higgs_audio_v2"><img src="https://img.shields.io/badge/🎮-HF%20Space%20Playground-8A2BE2" style="margin-right: 5px;"></a>
  <a href="https://huggingface.co/bosonai/higgs-audio-v2-tokenizer"><img src="https://img.shields.io/badge/🎧-Audio%20Tokenizer-6A5ACD.svg" style="margin-right: 5px;"></a>
</div>

Check our open-source repository https://github.com/boson-ai/higgs-audio for more details!

> **Rename note:** Higgs Audio V2 and Higgs Audio V2 Generation have been renamed to Higgs TTS 2.

We are open-sourcing Higgs TTS 2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data.
Despite having no post-training or fine-tuning, Higgs TTS 2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.

On [EmergentTTS-Eval](https://github.com/boson-ai/emergenttts-eval-public), the model achieves win rates of **75.7%** and **55.7%** over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including automatic prosody adaptation during narration, zero-shot generation of natural multi-speaker dialogues in multiple languages, melodic humming with the cloned voice, and simultaneous generation of speech and background music.


<p>
    <img src="./emergent-tts-emotions-win-rate.png" width=900>
</p>

Here's the demo video that shows some of its emergent capabilities (remember to unmute):

<div align="left">
    <video width="95%" controls>
        <source src="https://cdn-uploads.huggingface.co/production/uploads/64fa072a52e82dd432460767/bjbWGg1IKoMtWXnl0Od8G.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

Here's another demo video that show-cases the model's multilingual capability and how it enabled live translation (remember to unmute):

<div align="left">
    <video width="95%" controls>
        <source src="https://cdn-uploads.huggingface.co/production/uploads/64fa072a52e82dd432460767/9cN-ky02GzmUgogsIh1Wg.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

## Technical Details

<p>
    <img src="./higgs_audio_v2_architecture_combined.png" width=900>
</p>

Higgs TTS 2 adopts the "generation variant" depicted in the architecture figure above. Its strong performance is driven by three key technical innovations:

- We developed an automated annotation pipeline that leverages multiple ASR models, sound event classification models, and our in-house audio understanding model. Using this pipeline, we cleaned and annotated 10 million hours audio data, which we refer to as AudioVerse. The in-house understanding model is finetuned on top of Higgs Audio v1 Understanding, which adopts the "understanding variant" shown in the architecture figure.
- We trained a unified audio tokenizer from scratch that captures both semantic and acoustic features.
- We proposed the DualFFN architecture, which enhances the LLM’s ability to model acoustics tokens with minimal computational overhead.


### Audio Tokenizer

<p>
    <img src="./higgs_audio_tokenizer_architecture.png" width=900>
</p>

We introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate.
Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system.
It also uses a simple non-diffusion encoder/decoder for fast, batch inference. It achieves state-of-the-art performance in semantic and acoustic evaluations.
Check https://huggingface.co/bosonai/higgs-audio-v2-tokenizer for more information about the tokenizer.

### Model Architecture -- Dual FFN

Higgs TTS 2 is built on top of [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B). To enhance the model’s ability to process audio tokens,
we incorporate the "DualFFN" architecture as an audio adapter.
DualFFN acts as an audio-specific expert, boosting the LLM's performance with minimal computational overhead.
Our implementation preserves 91% of the original LLM’s training speed with the inclusion of DualFFN, which has 2.2B parameters.
Thus, the total number of parameter for Higgs TTS 2 is 3.6B (LLM) + 2.2B (Audio Dual FFN), and it has the same training / inference FLOPs as Llama-3.2-3B.
Ablation study shows that the model equipped with DualFFN consistently outperforms its counterpart in terms of word error rate (WER) and speaker similarity.
See [our architecture blog](https://github.com/boson-ai/higgs-audio/blob/main/tech_blogs/ARCHITECTURE_BLOG.md) for more information.


## Evaluation

Here's the performance of Higgs TTS 2 on four benchmarks,  [Seed-TTS Eval](https://github.com/BytedanceSpeech/seed-tts-eval), [Emotional Speech Dataset (ESD)](https://paperswithcode.com/dataset/esd), [EmergentTTS-Eval](https://arxiv.org/abs/2505.23009), and Multi-speaker Eval:

#### Seed-TTS Eval & ESD

We prompt Higgs TTS 2 with the reference text, reference audio, and target text for zero-shot TTS. We use the standard evaluation metrics from Seed-TTS Eval and ESD.

|                              | SeedTTS-Eval| | ESD   |                 |
|------------------------------|--------|--------|---------|-------------------|
|                              | WER ↓ | SIM ↑ | WER ↓ | SIM (emo2vec) ↑ |
| Cosyvoice2                   | 2.28   | 65.49  | 2.71    | 80.48             |
| Qwen2.5-omni†                | 2.33   | 64.10  | -       | -                 |
| ElevenLabs Multilingual V2   | **1.43**   | 50.00  | 1.66    | 65.87             |
| Higgs Audio v1                | 2.18   | 66.27  | **1.49**    | 82.84             |
| Higgs TTS 2 (base)            | 2.44   | **67.70**  | 1.78    | **86.13**         |


#### EmergentTTS-Eval ("Emotions" and "Questions")

Following the [EmergentTTS-Eval Paper](https://arxiv.org/abs/2505.23009), we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice. Results of Higgs TTS 2 are obtained with the voice of "belinda". The judge model is Gemini 2.5 Pro.

| Model                              | Emotions (%) ↑ | Questions (%) ↑ |
|------------------------------------|--------------|----------------|
| Higgs TTS 2 (base)                  | **75.71%**   | **55.71%**         |
| [gpt-4o-audio-preview†](https://platform.openai.com/docs/models/gpt-4o-audio-preview)       | 61.64%       | 47.85%         |
| [Hume.AI](https://www.hume.ai/research)                            | 61.60%       | 43.21%         |
| **BASELINE:** [gpt-4o-mini-tts](https://platform.openai.com/docs/models/gpt-4o-mini-tts)  | 50.00%       | 50.00%         |
| [Qwen 2.5 Omni†](https://github.com/QwenLM/Qwen2.5-Omni)      | 41.60%       | 51.78%         |
| [minimax/speech-02-hd](https://replicate.com/minimax/speech-02-hd)               | 40.86%        | 47.32%         |
| [ElevenLabs Multilingual v2](https://elevenlabs.io/blog/eleven-multilingual-v2)         | 30.35%       | 39.46%         |
| [DeepGram Aura-2](https://deepgram.com/learn/introducing-aura-2-enterprise-text-to-speech)                    | 29.28%       | 48.21%         |
| [Sesame csm-1B](https://github.com/SesameAILabs/csm)                      | 15.96%       | 31.78%         |

<sup><sub>'†' means using the strong-prompting method described in the paper.</sub></sup>


#### Multi-speaker Eval

We also designed a multi-speaker evaluation benchmark to evaluate the capability of Higgs TTS 2 for multi-speaker dialog generation. The benchmark contains three subsets

- `two-speaker-conversation`: 1000 synthetic dialogues involving two speakers. We fix two reference audio clips to evaluate the model's ability in double voice cloning for utterances ranging from 4 to 10 dialogues between two randomly chosen persona.
- `small talk (no ref)`: 250 synthetic dialogues curated in the same way as above, but are characterized by short utterances and a limited number of turns (4–6), we do not fix reference audios in this case and this set is designed to evaluate the model's ability to automatically assign appropriate voices to speakers.
- `small talk (ref)`: 250 synthetic dialogues similar to above, but contains even shorter utterances as this set is meant to include reference clips in it's context, similar to `two-speaker-conversation`.


We report the word-error-rate (WER) and the geometric mean between intra-speaker similarity and inter-speaker dis-similarity on these three subsets. Other than Higgs TTS 2, we also evaluated [MoonCast](https://github.com/jzq2000/MoonCast) and [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626), two of the most popular open-source models capable of multi-speaker dialog generation.
Results are summarized in the following table. We are not able to run [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626) on our "two-speaker-conversation" subset due to its strict limitation on the length of the utterances and output audio.

|                                                | two-speaker-conversation |                |small talk |                | small talk (no ref) |                |
| ---------------------------------------------- | -------------- | ------------------ | ---------- | -------------- | ------------------- | -------------- |
|                                                | WER ↓                      | Mean Sim & Dis-sim ↑ | WER ↓       |  Mean Sim & Dis-sim ↑ | WER ↓               | Mean Sim & Dis-sim ↑ |
| [MoonCast](https://github.com/jzq2000/MoonCast) | 38.77                    | 46.02         | **8.33**       | 63.68          | 24.65               | 53.94 |
| [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626)         | \-                       | \-             | 17.62      | 63.15          | 19.46               | **61.14**          |
| Higgs TTS 2 (base)        | **18.88**                    | **51.95**          | 11.89      | **67.92**              | **14.65**               | 55.28              |


## Usage

### Transformers 🤗

Higgs TTS 2 is supported natively in `transformers`: [see the doc](https://huggingface.co/docs/transformers/en/model_doc/higgs_audio_v2).

```bash
uv pip install "transformers>=5.3.0"
```

<details>
<summary>Single-speaker smart voice</summary>

```python
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

model_id = "bosonai/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "Generate audio following instruction."}],
    },
    {
        "role": "scene",
        "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
            }
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    sampling_rate=24000,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_single_speaker.wav")
```

</details>

<details>
<summary>Multi-speaker smart voice</summary>

Use `[SPEAKER*]` tags to generate a multi-speaker dialogue. Speaker characteristics are described in the `scene` role.

```python
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

model_id = "bosonai/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

system_message = """You are an AI assistant designed to convert text into speech.
If the user's message includes a [SPEAKER*] tag, do not read out the tag and generate speech for the following text, using the specified voice.
If no speaker tag is present, select a suitable voice on your own."""

user_message = """[SPEAKER0] I can't believe you did that without even asking me first!
[SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.
[SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!
[SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""

conversation = [
    {
        "role": "system",
        "content": [{"type": "text", "text": system_message}],
    },
    {
        "role": "scene",
        "content": [
            {"type": "text", "text": "Audio is recorded from a quiet room."},
            {"type": "text", "text": "SPEAKER0: feminine"},
            {"type": "text", "text": "SPEAKER1: masculine"},
        ],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": user_message}],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    sampling_rate=24000,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_multi_speaker.wav")
```

</details>

<details>
<summary>Zero-shot voice cloning</summary>

Clone a voice by providing a reference audio in the conversation history.

```python
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

model_id = "bosonai/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "Generate audio following instruction."}],
    },
    {
        "role": "scene",
        "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
            }
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    sampling_rate=24000,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_voice_cloning.wav")
```

</details>

<details>
<summary>Multi-speaker voice cloning</summary>

Clone multiple voices by providing reference audio clips in the `scene` role.

```python
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

model_id = "bosonai/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

user_message = """[SPEAKER0] I can't believe you did that without even asking me first!
[SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.
[SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!
[SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""

conversation = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "Generate audio following instruction."}],
    },
    {
        "role": "scene",
        "content": [
            {"type": "text", "text": "Audio is recorded from a quiet room."},
            {"type": "text", "text": "SPEAKER0:"},
            {
                "type": "audio",
                "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav",
            },
            {"type": "text", "text": "SPEAKER1:"},
            {
                "type": "audio",
                "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac",
            },
        ],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": user_message}],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    sampling_rate=24000,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_multi_speaker_cloning.wav")
```

</details>

<details>
<summary>Batched inference</summary>

Process multiple conversations in a single forward pass.

```python
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

model_id = "bosonai/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation1 = [
    {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
    {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
            }
        ],
    },
]

conversation2 = [
    {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
    {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": " It's super important to assess fairly the fact that our former model is over. And this is not a question of adjustment. This is not the same world, 2024, 2025. And on top of that, we are making the same mistakes, on top of the key elements I mentioned. We are over-regulating and under-investing. So just if, in the two to three years to come, if we follow our classical agenda, we will be out of the market. I have no doubts.",
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/macron.wav",
            }
        ],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "Hey, here is a clone from the given voice."}],
    },
]

inputs = processor.apply_chat_template(
    [conversation1, conversation2],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    sampling_rate=24000,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, ["output_batched_1.wav", "output_batched_2.wav"])
```

</details>

<details>
<summary>Training</summary>

By default, the model does not load the text language modeling head to save memory (~1.5GiB reduction), as it's not required for generation. When training, set `use_text_head=True` to compute loss on text tokens.

```python
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

model_id = "bosonai/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto", use_text_head=True)

conversation1 = [
    {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
    {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",
            }
        ],
    },
]

conversation2 = [
    {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
    {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": " I would imagine so. A wand with a dragon heartstring core is capable of dazzling magic, and the bond between you and your wand should only grow stronger. Do not be surprised at your new wand's ability to perceive your intentions, particularly in a moment of need",
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/broom_salesman.wav",
            }
        ],
    },
]

inputs = processor.apply_chat_template(
    [conversation1, conversation2],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    sampling_rate=24000,
    return_tensors="pt",
    output_labels=True,
).to(model.device)

outputs = model(**inputs)
outputs.loss.backward()
```

</details>

### Original codebase

You need to first install the [higgs-audio](https://github.com/boson-ai/higgs-audio):

```bash
git clone https://github.com/boson-ai/higgs-audio.git

cd higgs-audio
python3 -m venv higgs_audio_env
source higgs_audio_env/bin/activate
pip install -r requirements.txt
pip install -e .
```

Afterwards, try to run the following python code snippet to convert text to speech.

```python
from boson_multimodal.serve.serve_engine import HiggsAudioServeEngine, HiggsAudioResponse
from boson_multimodal.data_types import ChatMLSample, Message, AudioContent

import torch
import torchaudio
import time
import click

MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-base"
AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer"

system_prompt = (
    "Generate audio following instruction.\n\n<|scene_desc_start|>\nAudio is recorded from a quiet room.\n<|scene_desc_end|>"
)

messages = [
    Message(
        role="system",
        content=system_prompt,
    ),
    Message(
        role="user",
        content="The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
    ),
]
device = "cuda" if torch.cuda.is_available() else "cpu"

serve_engine = HiggsAudioServeEngine(MODEL_PATH, AUDIO_TOKENIZER_PATH, device=device)

output: HiggsAudioResponse = serve_engine.generate(
    chat_ml_sample=ChatMLSample(messages=messages),
    max_new_tokens=1024,
    temperature=0.3,
    top_p=0.95,
    top_k=50,
    stop_strings=["<|end_of_text|>", "<|eot_id|>"],
)
torchaudio.save(f"output.wav", torch.from_numpy(output.audio)[None, :], output.sampling_rate)
```

You can also check https://github.com/boson-ai/higgs-audio/tree/main/examples for more example scripts.

## License

See [LICENSE](./LICENSE)