Configuration Parsing Warning:Invalid JSON for config file config.json

Model Card for yourtts-htia-240704

yourtts-htia-240704 is an experimental Taiwanese Hakka text-to-speech (TTS) model based on YourTTS.

The model is designed for synthesizing Taiwanese Hakka speech and is part of the VoxHakka project. For more details, audio samples, and system information, please refer to the project page.

This checkpoint was trained on multi-speaker speech data covering the following Taiwanese Hakka dialects:

Sixian
Hailu
Dapu
Raoping
Zhaoan

Model Details

Architecture: YourTTS
Task: Text-to-speech
Language: Taiwanese Hakka (hak)
Supported dialects: Sixian, Hailu, Dapu, Raoping, and Zhaoan
Sample rate: 22,050 Hz
Training data: Multi-speaker Taiwanese Hakka speech data from more than 19 speakers
Speaker conditioning: Speaker encoder
Language conditioning: Language embedding

Intended Use

This model is intended for:

Taiwanese Hakka speech synthesis research
Taiwanese Hakka language technology development
Educational and non-commercial demonstrations
Experiments on multi-speaker and dialect-aware text-to-speech

This model is not intended for commercial use under the CC BY-NC 4.0 license.

Usage

Local Demo

The recommended way to run this model locally is to use the official Space implementation, since it includes the Taiwanese Hakka G2P frontend and the required YourTTS configuration patch.

git clone https://huggingface.co/spaces/united-link/taiwanese-hakka-tts
cd taiwanese-hakka-tts
pip install -r requirements.txt
python app.py

Programmatic Inference

The following example is adapted from the Space implementation. It assumes that you run the script inside the cloned Space repository so that replace/tts.py and the required dependencies are available.

import os
import re

import numpy as np
import torch
import TTS
from formog2p.hakka import g2p
from huggingface_hub import snapshot_download
from scipy.io.wavfile import write as write_wav
from TTS.utils.synthesizer import Synthesizer

from replace.tts import ChangedVitsConfig

TTS.tts.configs.vits_config.VitsConfig = ChangedVitsConfig

MODEL_ID = "formospeech/yourtts-htia-240704"

# This example uses Sixian Taiwanese Hakka.
DIALECT = "sixian"
G2P_DIALECT = "hak_sx"

# Example default speaker.
SPEAKER_NAME = "XF"


def parse_ipa(ipa: str, delete_chars=r"\+\-\|\_", as_space: str = "") -> list[str]:
    text = []
    ipa_list = re.split(r"(?<![\d])(?=[\d])|(?<=[\d])(?![\d])", ipa)

    for word in ipa_list:
        if word.isdigit():
            text.append(word)
        else:
            if len(as_space) > 0:
                word = re.sub(r"[{}]".format(as_space), " ", word)
            if len(delete_chars) > 0:
                word = re.sub(r"[{}]".format(delete_chars), "", word)

            word = word.replace("，", " ， ")
            text.extend(word)

    return text


def load_model(model_id: str = MODEL_ID) -> Synthesizer:
    model_dir = snapshot_download(model_id)

    config_file_path = os.path.join(model_dir, "config.json")
    model_ckpt_path = os.path.join(model_dir, "model.pth")
    speaker_file_path = os.path.join(model_dir, "speakers.pth")
    language_file_path = os.path.join(model_dir, "language_ids.json")
    speaker_embedding_file_path = os.path.join(model_dir, "speaker_embs.pth")

    temp_config_path = "temp_config.json"

    with open(config_file_path, "r", encoding="utf-8") as f:
        content = f.read()

    content = content.replace("speakers.pth", speaker_file_path)
    content = content.replace("language_ids.json", language_file_path)
    content = content.replace("speaker_embs.pth", speaker_embedding_file_path)

    with open(temp_config_path, "w", encoding="utf-8") as f:
        f.write(content)

    return Synthesizer(
        tts_checkpoint=model_ckpt_path,
        tts_config_path=temp_config_path,
        use_cuda=torch.cuda.is_available(),
    )


def synthesize(
    text: str,
    output_path: str = "output.wav",
    speed: float = 1.0,
):
    model = load_model()

    result = g2p(text, G2P_DIALECT, include_eng=True)

    if len(result.unknown_words) > 0:
        raise ValueError(
            f"The following words could not be converted to IPA: "
            f"{', '.join(result.unknown_words)}"
        )

    parsed_ipa = [p.replace(" ", "|") for p in result.pronunciations]
    parsed_ipa = parse_ipa(" ".join(parsed_ipa))

    # Larger values produce slower speech.
    model.tts_model.length_scale = speed

    wav = model.tts(
        parsed_ipa,
        speaker_name=SPEAKER_NAME,
        language_name=DIALECT,
        split_sentences=False,
    )

    sample_rate = model.tts_model.config.audio.sample_rate
    wav = np.asarray(wav, dtype=np.float32)

    write_wav(output_path, sample_rate, wav)
    return output_path


if __name__ == "__main__":
    synthesize(
        text="食飯愛正經食，正毋會食到半出半入",
        output_path="output.wav",
        speed=1.0,
    )

Input Format

The model is intended for Taiwanese Hakka text. The official Space uses formog2p.hakka.g2p to convert Taiwanese Hakka text into the phonetic representation expected by the model.

If some input words cannot be converted by the G2P frontend, inference may fail. In that case, try rewriting the sentence with supported Taiwanese Hakka words or orthography.

Limitations

This is an experimental Taiwanese Hakka TTS model.
Output quality may vary by dialect, speaker, sentence style, and G2P coverage.
The model is expected to work best on Taiwanese Hakka text similar to the training data.
The model is not designed for Mandarin Chinese, general Chinese TTS, or non-Hakka languages.
As with other voice synthesis systems, users should avoid misleading, deceptive, or unauthorized voice impersonation use cases.

License

This model is released under the CC BY-NC 4.0 license.

By downloading or using the public release of this model, you agree to comply with the terms and conditions of the CC BY-NC 4.0 license.

Commercial use is not permitted under this license.

Citation

If you use this model, please cite the following paper:

@article{chen2024voxhakka,
  title={VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka},
  author={Chen, Li-Wei and Lee, Hung-Shin and Chang, Chen-Chi},
  journal={arXiv preprint arXiv:2409.01548},
  year={2024}
}

Downloads last month: 33

Space using formospeech/yourtts-htia-240704 1

Paper for formospeech/yourtts-htia-240704

VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka

Paper • 2409.01548 • Published Sep 3, 2024