VRAM use

#3
by JacobR22 - opened

How much VRAM is necessary to use this model?

Hi @JacobR22 — the 4B backbone is ~5B params total in BF16, so the weights alone are ~10 GB. With the Higgs audio tokenizer, KV cache, and activations on top, you'll want roughly 16 GB of VRAM as a practical minimum, and 24 GB for comfortable headroom — especially for zero-shot voice cloning, which is more memory-hungry.

On personal / consumer GPUs: a 24 GB card (RTX 3090 / 4090) runs the full BF16 model comfortably, including voice cloning. 16 GB cards (RTX 4080 / 4060 Ti 16GB) work for basic zero-shot synthesis but get tight with cloning — 8-bit quantization helps here. 8 GB cards aren't enough for the full model (even 4-bit struggles in clone mode and needs CPU offload). For reference, our throughput benchmarks were run on a single H100.

If you'd rather skip the local setup entirely, you can also use the hosted Boson AI API. https://docs.boson.ai/models/higgs-audio-tts/overview

12gb vram is sufficent for the BF16 on comfyui native aimdo implementation

Hi @JacobR22 — the 4B backbone is ~5B params total in BF16, so the weights alone are ~10 GB. With the Higgs audio tokenizer, KV cache, and activations on top, you'll want roughly 16 GB of VRAM as a practical minimum, and 24 GB for comfortable headroom — especially for zero-shot voice cloning, which is more memory-hungry.

On personal / consumer GPUs: a 24 GB card (RTX 3090 / 4090) runs the full BF16 model comfortably, including voice cloning. 16 GB cards (RTX 4080 / 4060 Ti 16GB) work for basic zero-shot synthesis but get tight with cloning — 8-bit quantization helps here. 8 GB cards aren't enough for the full model (even 4-bit struggles in clone mode and needs CPU offload). For reference, our throughput benchmarks were run on a single H100.

If you'd rather skip the local setup entirely, you can also use the hosted Boson AI API. https://docs.boson.ai/models/higgs-audio-tts/overview

I see. Thanks for the information!

JacobR22 changed discussion status to closed
JacobR22 changed discussion status to open

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Boson AI org

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

Thank you for your confirmation. Great job btw.

Hi @JacobR22 — the 4B backbone is ~5B params total in BF16, so the weights alone are ~10 GB. With the Higgs audio tokenizer, KV cache, and activations on top, you'll want roughly 16 GB of VRAM as a practical minimum, and 24 GB for comfortable headroom — especially for zero-shot voice cloning, which is more memory-hungry.

On personal / consumer GPUs: a 24 GB card (RTX 3090 / 4090) runs the full BF16 model comfortably, including voice cloning. 16 GB cards (RTX 4080 / 4060 Ti 16GB) work for basic zero-shot synthesis but get tight with cloning — 8-bit quantization helps here. 8 GB cards aren't enough for the full model (even 4-bit struggles in clone mode and needs CPU offload). For reference, our throughput benchmarks were run on a single H100.

If you'd rather skip the local setup entirely, you can also use the hosted Boson AI API. https://docs.boson.ai/models/higgs-audio-tts/overview

So basiccally there is zero that is unique about this model and you inflated the weights to funnel users to your paid api. Classic shit-ai move. If you wanted to do something novel -- and no, your ridiculous claim of 100 langs is not novel--you would do cocurrent overalppoing realtime streams, optimize and fine-tine phoneme pronunciation for Arabic, Thai, Cantonese, etc., and, as users come to HF largely looking for local models to be deployed on consumer hardware, you'd have quants ready to go, llama.cpp friendly tha auto-detects hardware and auto-offloads to cpu when necessary. Next.

Oh yeah, and the non-permissive license proves the open-washing point. Great job.

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

It's the best open source TTS model I tested. I think it's at the level of best closed source TTS, and I'm serious. This TTS is pure gold, at a very high level.

Thank you for open sourcing it.

For my case it takes 10 to 11 gb vram on 16 gb vram.

If we could have faster generation, it would be excellent as it would allow streaming.

Currently, even using quantized model doesn't reduce generation time. Or maybe i'm doing it wrong?

6GB VRAM + 44GB RAM can run this model via comfyui for sure, i already tested it. It's good but still i confuse how to make/build a male voice, since there's not explanation about how to make/build a specified voice character beside clone voice.

Here the example from my mechine.

I use
I5
RTX 3050 6GB VRAM
RAM 6GB

Hi, the model currently does not support voice-design. We support voice-cloning, and no-reference generation which can generate audio of any gender.

It's the best open source TTS model I tested. I think it's at the level of best closed source TTS, and I'm serious. This TTS is pure gold, at a very high level.

Thank you for open sourcing it.

For my case it takes 10 to 11 gb vram on 16 gb vram.

If we could have faster generation, it would be excellent as it would allow streaming.

Currently, even using quantized model doesn't reduce generation time. Or maybe i'm doing it wrong?

Well, that may be uniquely true for you in english. The trouble is that models like this fall off after language no. 10, or so. Their own model card admits that only a handful of the headline-grabbing, and misleading, 100-lang support, are atually ready for primetime.

There are many other models at this point that are superior. It's great you are having a good experience, but it is nowhere near the best tts model posted to HF. Also, I would enocourage you to be careful with the terminology you use. this mode is far from "open-source". It expliocitly includes a non-permissive license. This is open-weight, and honesttly, I think that is much too charitable.

Sign up or log in to comment