Text-to-Speech
NeMo
Georgian
tts
georgian
magpie-tts

Request for reproducible Magpie-TTS fine-tuning setup for new language adaptation

#2
by tauqeersajid - opened

Hi NMikka,

Thank you for sharing the Georgian Magpie-TTS fine-tuned model. I am also trying to fine-tune NVIDIA Magpie-TTS for a new language, specifically Slovene, and your model seems to be one of the few public examples of successful language adaptation.

I checked the model card and saw that the model was fine-tuned from nvidia/magpie_tts_multilingual_357m using NeMo, with Full SFT, LR 2e-5, 37 epochs, bf16-mixed precision, and the NeMo commit:

3d73c48aca1ae3be44657267b81f25dc3201161a

Would you be willing to share the exact fine-tuning setup you used?

Specifically, it would be very helpful if you could share:

  1. The exact magpietts.yaml / Hydra config used for training
  2. The full training command with all overrides
  3. Whether you modified any files in the NeMo repo
  4. If yes, could you share the changed files, patch, or commit diff?
  5. The exact dataset manifest format you used
  6. Whether you precomputed target_audio_codes_path and context_audio_codes_path
  7. How you selected context_audio_filepath and context_text for each sample
  8. Which tokenizer configuration you used for Georgian
  9. Whether you used google/byt5-small as a byte-level tokenizer or made any language-specific tokenizer changes
  10. Whether you changed alignment_loss_scale, prior_scaling_factor, cfg_unconditional_prob, context_duration_min/max, or any decoder settings
  11. Whether you used trainer.precision=32 first and later switched to bf16-mixed, or trained directly with bf16-mixed
  12. Any inference settings that helped avoid repetitions or artifacts, such as temperature, topk, cfg_scale, max_decoder_steps, or use_local_transformer_for_inference

I am asking because my fine-tuned model trains, but the generated audio sometimes has artifacts, repeated words, or duplicated segments. I want to understand whether the issue is coming from my data preparation, tokenizer setup, cached codec extraction, NeMo version, training config, or inference settings.

Thanks again for releasing the model. It would really help others who are trying to adapt Magpie-TTS to low-resource or unsupported languages.

Best regards,
Tauqeer

Sign up or log in to comment