Instructions to use Thomcles/Chatterbox-TTS-French with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use Thomcles/Chatterbox-TTS-French with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
New feature - Save already cloned voices with same params
Hi, I’d like to suggest a new feature for Chatterbox-TTS-French: the ability to save cloned voices along with their generation parameters (exaggeration, temperature, cfg_weight) and the processed audio representation of the voice itself.
Currently, generating a new phrase requires re-processing the audio prompt every time. Saving the cloned voice with its parameters would allow users to quickly generate new text with the exact same voice, ensuring consistency and reducing processing time.
Thanks for considering this!
Well, it seems to happen automatically.
This discussion suggested that conds (representations of speaker characteristics) remain stored in the model once used.
Try this on your computer:
text = """
If music be the food of love, play on;
"""
import torchaudio as ta
wav = model.generate(text, audio_prompt_path="ref.mp3")
ta.save("test-1.wav", wav, model.sr)
then if do the same thing without audio_prompt_path, you get:
import torchaudio as ta
text = """
If music be the food of love, play on;
"""
wav = model.generate(text)
ta.save("test-1.wav", wav, model.sr) # -> Same audio (if it's the same instance of the ChatterboxTTS class)
The problem is that there is no gain in latency.
The architecture of the model requires latents to be obtained from audio, and the amount of time it takes to generate them is infinitesimal.
So there's nothing for you to do, because most of the generation time is devoted to the audio itself, and not to the latents.