Configuration Parsing Warning:Invalid JSON for config file config.json

Model Card for yourtts-htia-240704

Project Page Demo Paper License

yourtts-htia-240704 is an experimental Taiwanese Hakka text-to-speech (TTS) model based on YourTTS.

The model is designed for synthesizing Taiwanese Hakka speech and is part of the VoxHakka project. For more details, audio samples, and system information, please refer to the project page.

This checkpoint was trained on multi-speaker speech data covering the following Taiwanese Hakka dialects:

  • Sixian
  • Hailu
  • Dapu
  • Raoping
  • Zhaoan

Model Details

  • Architecture: YourTTS
  • Task: Text-to-speech
  • Language: Taiwanese Hakka (hak)
  • Supported dialects: Sixian, Hailu, Dapu, Raoping, and Zhaoan
  • Sample rate: 22,050 Hz
  • Training data: Multi-speaker Taiwanese Hakka speech data from more than 19 speakers
  • Speaker conditioning: Speaker encoder
  • Language conditioning: Language embedding

Intended Use

This model is intended for:

  • Taiwanese Hakka speech synthesis research
  • Taiwanese Hakka language technology development
  • Educational and non-commercial demonstrations
  • Experiments on multi-speaker and dialect-aware text-to-speech

This model is not intended for commercial use under the CC BY-NC 4.0 license.

Usage

Local Demo

The recommended way to run this model locally is to use the official Space implementation, since it includes the Taiwanese Hakka G2P frontend and the required YourTTS configuration patch.

git clone https://huggingface.co/spaces/united-link/taiwanese-hakka-tts
cd taiwanese-hakka-tts
pip install -r requirements.txt
python app.py

Programmatic Inference

The following example is adapted from the Space implementation. It assumes that you run the script inside the cloned Space repository so that replace/tts.py and the required dependencies are available.

import os
import re

import numpy as np
import torch
import TTS
from formog2p.hakka import g2p
from huggingface_hub import snapshot_download
from scipy.io.wavfile import write as write_wav
from TTS.utils.synthesizer import Synthesizer

from replace.tts import ChangedVitsConfig

TTS.tts.configs.vits_config.VitsConfig = ChangedVitsConfig

MODEL_ID = "formospeech/yourtts-htia-240704"

# This example uses Sixian Taiwanese Hakka.
DIALECT = "sixian"
G2P_DIALECT = "hak_sx"

# Example default speaker.
SPEAKER_NAME = "XF"


def parse_ipa(ipa: str, delete_chars=r"\+\-\|\_", as_space: str = "") -> list[str]:
    text = []
    ipa_list = re.split(r"(?<![\d])(?=[\d])|(?<=[\d])(?![\d])", ipa)

    for word in ipa_list:
        if word.isdigit():
            text.append(word)
        else:
            if len(as_space) > 0:
                word = re.sub(r"[{}]".format(as_space), " ", word)
            if len(delete_chars) > 0:
                word = re.sub(r"[{}]".format(delete_chars), "", word)

            word = word.replace(",", " , ")
            text.extend(word)

    return text


def load_model(model_id: str = MODEL_ID) -> Synthesizer:
    model_dir = snapshot_download(model_id)

    config_file_path = os.path.join(model_dir, "config.json")
    model_ckpt_path = os.path.join(model_dir, "model.pth")
    speaker_file_path = os.path.join(model_dir, "speakers.pth")
    language_file_path = os.path.join(model_dir, "language_ids.json")
    speaker_embedding_file_path = os.path.join(model_dir, "speaker_embs.pth")

    temp_config_path = "temp_config.json"

    with open(config_file_path, "r", encoding="utf-8") as f:
        content = f.read()

    content = content.replace("speakers.pth", speaker_file_path)
    content = content.replace("language_ids.json", language_file_path)
    content = content.replace("speaker_embs.pth", speaker_embedding_file_path)

    with open(temp_config_path, "w", encoding="utf-8") as f:
        f.write(content)

    return Synthesizer(
        tts_checkpoint=model_ckpt_path,
        tts_config_path=temp_config_path,
        use_cuda=torch.cuda.is_available(),
    )


def synthesize(
    text: str,
    output_path: str = "output.wav",
    speed: float = 1.0,
):
    model = load_model()

    result = g2p(text, G2P_DIALECT, include_eng=True)

    if len(result.unknown_words) > 0:
        raise ValueError(
            f"The following words could not be converted to IPA: "
            f"{', '.join(result.unknown_words)}"
        )

    parsed_ipa = [p.replace(" ", "|") for p in result.pronunciations]
    parsed_ipa = parse_ipa(" ".join(parsed_ipa))

    # Larger values produce slower speech.
    model.tts_model.length_scale = speed

    wav = model.tts(
        parsed_ipa,
        speaker_name=SPEAKER_NAME,
        language_name=DIALECT,
        split_sentences=False,
    )

    sample_rate = model.tts_model.config.audio.sample_rate
    wav = np.asarray(wav, dtype=np.float32)

    write_wav(output_path, sample_rate, wav)
    return output_path


if __name__ == "__main__":
    synthesize(
        text="食飯愛正經食,正毋會食到半出半入",
        output_path="output.wav",
        speed=1.0,
    )

Input Format

The model is intended for Taiwanese Hakka text. The official Space uses formog2p.hakka.g2p to convert Taiwanese Hakka text into the phonetic representation expected by the model.

If some input words cannot be converted by the G2P frontend, inference may fail. In that case, try rewriting the sentence with supported Taiwanese Hakka words or orthography.

Limitations

  • This is an experimental Taiwanese Hakka TTS model.
  • Output quality may vary by dialect, speaker, sentence style, and G2P coverage.
  • The model is expected to work best on Taiwanese Hakka text similar to the training data.
  • The model is not designed for Mandarin Chinese, general Chinese TTS, or non-Hakka languages.
  • As with other voice synthesis systems, users should avoid misleading, deceptive, or unauthorized voice impersonation use cases.

License

This model is released under the CC BY-NC 4.0 license.

By downloading or using the public release of this model, you agree to comply with the terms and conditions of the CC BY-NC 4.0 license.

Commercial use is not permitted under this license.

Citation

If you use this model, please cite the following paper:

@article{chen2024voxhakka,
  title={VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka},
  author={Chen, Li-Wei and Lee, Hung-Shin and Chang, Chen-Chi},
  journal={arXiv preprint arXiv:2409.01548},
  year={2024}
}
Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using formospeech/yourtts-htia-240704 1

Paper for formospeech/yourtts-htia-240704