--- license: cc-by-nc-4.0 language: - hak pipeline_tag: text-to-speech tags: - yourtts - hakka - taiwanese-hakka - text-to-speech - arxiv:2409.01548 --- # Model Card for yourtts-htia-240704 [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://voxhakka.github.io/) [![Demo](https://img.shields.io/badge/🤗%20Demo-Hugging%20Face%20Space-blue)](https://huggingface.co/spaces/united-link/taiwanese-hakka-tts) [![Paper](https://img.shields.io/badge/arXiv-2409.01548-b31b1b.svg)](https://arxiv.org/abs/2409.01548) [![License](https://img.shields.io/badge/License-CC--BY--NC--4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) `yourtts-htia-240704` is an experimental **Taiwanese Hakka text-to-speech (TTS)** model based on YourTTS. The model is designed for synthesizing **Taiwanese Hakka** speech and is part of the VoxHakka project. For more details, audio samples, and system information, please refer to the [project page](https://voxhakka.github.io/). This checkpoint was trained on multi-speaker speech data covering the following Taiwanese Hakka dialects: - Sixian - Hailu - Dapu - Raoping - Zhaoan ## Model Details - **Architecture:** YourTTS - **Task:** Text-to-speech - **Language:** Taiwanese Hakka (`hak`) - **Supported dialects:** Sixian, Hailu, Dapu, Raoping, and Zhaoan - **Sample rate:** 22,050 Hz - **Training data:** Multi-speaker Taiwanese Hakka speech data from more than 19 speakers - **Speaker conditioning:** Speaker encoder - **Language conditioning:** Language embedding ## Intended Use This model is intended for: - Taiwanese Hakka speech synthesis research - Taiwanese Hakka language technology development - Educational and non-commercial demonstrations - Experiments on multi-speaker and dialect-aware text-to-speech This model is **not intended for commercial use** under the CC BY-NC 4.0 license. ## Usage ### Local Demo The recommended way to run this model locally is to use the official Space implementation, since it includes the Taiwanese Hakka G2P frontend and the required YourTTS configuration patch. ```bash git clone https://huggingface.co/spaces/united-link/taiwanese-hakka-tts cd taiwanese-hakka-tts pip install -r requirements.txt python app.py ``` ### Programmatic Inference The following example is adapted from the Space implementation. It assumes that you run the script inside the cloned Space repository so that `replace/tts.py` and the required dependencies are available. ```python import os import re import numpy as np import torch import TTS from formog2p.hakka import g2p from huggingface_hub import snapshot_download from scipy.io.wavfile import write as write_wav from TTS.utils.synthesizer import Synthesizer from replace.tts import ChangedVitsConfig TTS.tts.configs.vits_config.VitsConfig = ChangedVitsConfig MODEL_ID = "formospeech/yourtts-htia-240704" # This example uses Sixian Taiwanese Hakka. DIALECT = "sixian" G2P_DIALECT = "hak_sx" # Example default speaker. SPEAKER_NAME = "XF" def parse_ipa(ipa: str, delete_chars=r"\+\-\|\_", as_space: str = "") -> list[str]: text = [] ipa_list = re.split(r"(? 0: word = re.sub(r"[{}]".format(as_space), " ", word) if len(delete_chars) > 0: word = re.sub(r"[{}]".format(delete_chars), "", word) word = word.replace(",", " , ") text.extend(word) return text def load_model(model_id: str = MODEL_ID) -> Synthesizer: model_dir = snapshot_download(model_id) config_file_path = os.path.join(model_dir, "config.json") model_ckpt_path = os.path.join(model_dir, "model.pth") speaker_file_path = os.path.join(model_dir, "speakers.pth") language_file_path = os.path.join(model_dir, "language_ids.json") speaker_embedding_file_path = os.path.join(model_dir, "speaker_embs.pth") temp_config_path = "temp_config.json" with open(config_file_path, "r", encoding="utf-8") as f: content = f.read() content = content.replace("speakers.pth", speaker_file_path) content = content.replace("language_ids.json", language_file_path) content = content.replace("speaker_embs.pth", speaker_embedding_file_path) with open(temp_config_path, "w", encoding="utf-8") as f: f.write(content) return Synthesizer( tts_checkpoint=model_ckpt_path, tts_config_path=temp_config_path, use_cuda=torch.cuda.is_available(), ) def synthesize( text: str, output_path: str = "output.wav", speed: float = 1.0, ): model = load_model() result = g2p(text, G2P_DIALECT, include_eng=True) if len(result.unknown_words) > 0: raise ValueError( f"The following words could not be converted to IPA: " f"{', '.join(result.unknown_words)}" ) parsed_ipa = [p.replace(" ", "|") for p in result.pronunciations] parsed_ipa = parse_ipa(" ".join(parsed_ipa)) # Larger values produce slower speech. model.tts_model.length_scale = speed wav = model.tts( parsed_ipa, speaker_name=SPEAKER_NAME, language_name=DIALECT, split_sentences=False, ) sample_rate = model.tts_model.config.audio.sample_rate wav = np.asarray(wav, dtype=np.float32) write_wav(output_path, sample_rate, wav) return output_path if __name__ == "__main__": synthesize( text="食飯愛正經食,正毋會食到半出半入", output_path="output.wav", speed=1.0, ) ``` ## Input Format The model is intended for **Taiwanese Hakka text**. The official Space uses `formog2p.hakka.g2p` to convert Taiwanese Hakka text into the phonetic representation expected by the model. If some input words cannot be converted by the G2P frontend, inference may fail. In that case, try rewriting the sentence with supported Taiwanese Hakka words or orthography. ## Limitations - This is an experimental Taiwanese Hakka TTS model. - Output quality may vary by dialect, speaker, sentence style, and G2P coverage. - The model is expected to work best on Taiwanese Hakka text similar to the training data. - The model is not designed for Mandarin Chinese, general Chinese TTS, or non-Hakka languages. - As with other voice synthesis systems, users should avoid misleading, deceptive, or unauthorized voice impersonation use cases. ## License This model is released under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license. By downloading or using the public release of this model, you agree to comply with the terms and conditions of the CC BY-NC 4.0 license. Commercial use is not permitted under this license. ## Citation If you use this model, please cite the following paper: ```bibtex @article{chen2024voxhakka, title={VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka}, author={Chen, Li-Wei and Lee, Hung-Shin and Chang, Chen-Chi}, journal={arXiv preprint arXiv:2409.01548}, year={2024} } ```