OLaPhLLM_v2 / README.md

ojo

Update README.md

165cad8 verified 28 days ago

preview code

Raw

History Blame Contribute Delete

3.34 kB

metadata

license: gemma
license_name: license
license_link: LICENSE
base_model:
  - ModelSpace/GemmaX2-28-2B-v0.1
pipeline_tag: translation
library_name: transformers
tags:
  - text-generation
language:
  - de
  - en
  - fr
  - es
datasets:
  - iisys-hof/olaph-data

Model Summary

OLaPhLLM is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1. Its tokenizer was extended with phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the OLaPh framework).

The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework as well as with lexicon words taken from OLaPh. The training set comprised 650,000 sentence pairs per target language (English, French, German, and Spanish), supplemented by 100,000 isolated, randomly selected lexicon entries per language, totaling 3 million training examples.

Finetuned By: Institute for Information Systems at Hof University
Model type: Text-To-Text
Dataset: OLaPh Phonemization Dataset v2
Language(s): English, French, German, Spanish
License: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms)
Release Date: May 8, 2026

Evaluation

The reported values are Phone Error Rate (PER) across the Wikipron dataset.

Compared Models/Frameworks:

eSpeak NG
Gruut
charsiu/g2p_multilingual_byT5_tiny_16_layers_100
OLaPh Framework
OLaPh LLM (this)

Language	espeak	gruut	byt5	olaph	olaph_llm
de	0.17594	0.17558	0.27864	0.04302	0.13518
en_uk	0.14117	0.19174	0.13036	0.08749	0.16321
en_us	0.14588	0.16545	0.16345	0.10491	0.14713
es	0.04324	0.04210	0.05436	0.02582	0.03179
fr	0.06203	0.04045	0.09217	0.03143	0.08591

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

lang = "English" #German, French, Spanish
sentence = "But we are not sorry, for the rain is delightful."

model_id = "iisys-hof/OLaPhLLM_v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")

prompt =  f"Translate this from {lang} to Phones:\n{lang}: "

inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256)
phonemized = tokenizer.decode(outputs[0], skip_special_tokens=True)
phonemized = phonemized.split("\n")[-1].replace("Phones:", "")

print(phonemized)

Citation

@misc{wirth2026olaphoptimallanguagephonemizer,
      title={OLaPh: Optimal Language Phonemizer}, 
      author={Johannes Wirth},
      year={2026},
      eprint={2509.20086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.20086}, 
}