VRAM use

by JacobR22 - opened 10 days ago

Discussion

JacobR22

10 days ago

How much VRAM is necessary to use this model?

popsoda2002

Boson AI org 10 days ago

•

edited 10 days ago

Hi @JacobR22 — the 4B backbone is ~5B params total in BF16, so the weights alone are ~10 GB. With the Higgs audio tokenizer, KV cache, and activations on top, you'll want roughly 16 GB of VRAM as a practical minimum, and 24 GB for comfortable headroom — especially for zero-shot voice cloning, which is more memory-hungry.

On personal / consumer GPUs: a 24 GB card (RTX 3090 / 4090) runs the full BF16 model comfortably, including voice cloning. 16 GB cards (RTX 4080 / 4060 Ti 16GB) work for basic zero-shot synthesis but get tight with cloning — 8-bit quantization helps here. 8 GB cards aren't enough for the full model (even 4-bit struggles in clone mode and needs CPU offload). For reference, our throughput benchmarks were run on a single H100.

If you'd rather skip the local setup entirely, you can also use the hosted Boson AI API. https://docs.boson.ai/models/higgs-audio-tts/overview

drbaph

9 days ago

12gb vram is sufficent for the BF16 on comfyui native aimdo implementation

JacobR22

9 days ago

Hi @JacobR22 — the 4B backbone is ~5B params total in BF16, so the weights alone are ~10 GB. With the Higgs audio tokenizer, KV cache, and activations on top, you'll want roughly 16 GB of VRAM as a practical minimum, and 24 GB for comfortable headroom — especially for zero-shot voice cloning, which is more memory-hungry.

On personal / consumer GPUs: a 24 GB card (RTX 3090 / 4090) runs the full BF16 model comfortably, including voice cloning. 16 GB cards (RTX 4080 / 4060 Ti 16GB) work for basic zero-shot synthesis but get tight with cloning — 8-bit quantization helps here. 8 GB cards aren't enough for the full model (even 4-bit struggles in clone mode and needs CPU offload). For reference, our throughput benchmarks were run on a single H100.

If you'd rather skip the local setup entirely, you can also use the hosted Boson AI API. https://docs.boson.ai/models/higgs-audio-tts/overview

I see. Thanks for the information!

JacobR22 changed discussion status to closed 9 days ago

JacobR22 changed discussion status to open 9 days ago

agus2312

9 days ago

•

edited 9 days ago

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

ruskinmanku

Boson AI org 8 days ago

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

agus2312

7 days ago

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

Thank you for your confirmation. Great job btw.

cmoney113

6 days ago

•

edited 6 days ago

Hi @JacobR22 — the 4B backbone is ~5B params total in BF16, so the weights alone are ~10 GB. With the Higgs audio tokenizer, KV cache, and activations on top, you'll want roughly 16 GB of VRAM as a practical minimum, and 24 GB for comfortable headroom — especially for zero-shot voice cloning, which is more memory-hungry.

On personal / consumer GPUs: a 24 GB card (RTX 3090 / 4090) runs the full BF16 model comfortably, including voice cloning. 16 GB cards (RTX 4080 / 4060 Ti 16GB) work for basic zero-shot synthesis but get tight with cloning — 8-bit quantization helps here. 8 GB cards aren't enough for the full model (even 4-bit struggles in clone mode and needs CPU offload). For reference, our throughput benchmarks were run on a single H100.

If you'd rather skip the local setup entirely, you can also use the hosted Boson AI API. https://docs.boson.ai/models/higgs-audio-tts/overview

So basiccally there is zero that is unique about this model and you inflated the weights to funnel users to your paid api. Classic shit-ai move. If you wanted to do something novel -- and no, your ridiculous claim of 100 langs is not novel--you would do cocurrent overalppoing realtime streams, optimize and fine-tine phoneme pronunciation for Arabic, Thai, Cantonese, etc., and, as users come to HF largely looking for local models to be deployed on consumer hardware, you'd have quants ready to go, llama.cpp friendly tha auto-detects hardware and auto-offloads to cpu when necessary. Next.

Oh yeah, and the non-permissive license proves the open-washing point. Great job.

Omicrow

4 days ago

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

It's the best open source TTS model I tested. I think it's at the level of best closed source TTS, and I'm serious. This TTS is pure gold, at a very high level.

Thank you for open sourcing it.

For my case it takes 10 to 11 gb vram on 16 gb vram.

If we could have faster generation, it would be excellent as it would allow streaming.

Currently, even using quantized model doesn't reduce generation time. Or maybe i'm doing it wrong?

cmoney113

3 days ago

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

It's the best open source TTS model I tested. I think it's at the level of best closed source TTS, and I'm serious. This TTS is pure gold, at a very high level.

Thank you for open sourcing it.

For my case it takes 10 to 11 gb vram on 16 gb vram.

If we could have faster generation, it would be excellent as it would allow streaming.

Currently, even using quantized model doesn't reduce generation time. Or maybe i'm doing it wrong?

Well, that may be uniquely true for you in english. The trouble is that models like this fall off after language no. 10, or so. Their own model card admits that only a handful of the headline-grabbing, and misleading, 100-lang support, are atually ready for primetime.

There are many other models at this point that are superior. It's great you are having a good experience, but it is nowhere near the best tts model posted to HF. Also, I would enocourage you to be careful with the terminology you use. this mode is far from "open-source". It expliocitly includes a non-permissive license. This is open-weight, and honesttly, I think that is much too charitable.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment