iisys-hof
/

OLaPhLLM_v2

text-generation

text-generation-inference

Model card Files Files and versions

OLaPhLLM_v2 / README.md

ojo's picture

Update README.md

3c15b91 verified about 2 months ago

|

3.34 kB

	---
	license: gemma
	license_name: license
	license_link: LICENSE
	base_model:
	- ModelSpace/GemmaX2-28-2B-v0.1
	pipeline_tag: translation
	library_name: transformers
	tags:
	- text-generation
	language:
	- de
	- en
	- fr
	- es
	datasets:
	- iisys-hof/olaph-data
	---
	## Model Summary

	OLaPhLLM is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1.
	Its tokenizer was extended with phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the [OLaPh framework](https://github.com/iisys-hof/olaph)).

	The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework as well as with lexicon words taken from OLaPh.
	The training set comprised 650,000 sentence pairs per target language (English, French, German, and Spanish), supplemented by 100,000 isolated, randomly selected lexicon entries per language, totaling 3 million training examples.

	- Finetuned By: Institute for Information Systems at Hof University
	- Model type: Text-To-Text
	- Dataset: [OLaPh Phonemization Dataset](https://huggingface.co/datasets/iisys-hof/olaph-data)
	- Language(s): English, French, German, Spanish
	- License: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms)
	- Release Date: May 8, 2026

	## Evaluation

	The reported values are Phone Error Rate (PER) across the [Wikipron](https://github.com/CUNY-CL/wikipron) dataset.

	Compared Models/Frameworks:

	- [eSpeak NG](https://github.com/espeak-ng/espeak-ng)
	- [Gruut](https://pypi.org/project/gruut)
	- [charsiu/g2p_multilingual_byT5_tiny_16_layers_100](https://huggingface.co/charsiu/g2p_multilingual_byT5_tiny_16_layers_100)
	- [OLaPh Framework](https://github.com/iisys-hof/olaph)
	- OLaPh LLM (this)

	\| Language \| espeak \| gruut \| byt5 \| olaph \| olaph_llm \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| de \| 0.17594 \| 0.17558 \| 0.27864 \| 0.04302 \| 0.13518 \|
	\| en_uk \| 0.14117 \| 0.19174 \| 0.13036 \| 0.08749 \| 0.16321 \|
	\| en_us \| 0.14588 \| 0.16545 \| 0.16345 \| 0.10491 \| 0.14713 \|
	\| es \| 0.04324 \| 0.04210 \| 0.05436 \| 0.02582 \| 0.03179 \|
	\| fr \| 0.06203 \| 0.04045 \| 0.09217 \| 0.03143 \| 0.08591 \|

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	lang = "English" #German, French, Spanish
	sentence = "But we are not sorry, for the rain is delightful."

	model_id = "iisys-hof/OLaPhLLM_v2"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")

	prompt = f"Translate this from {lang} to Phones:\n{lang}: "

	inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")

	outputs = model.generate(**inputs, max_new_tokens=256)
	phonemized = tokenizer.decode(outputs[0], skip_special_tokens=True)
	phonemized = phonemized.split("\n")[-1].replace("Phones:", "")

	print(phonemized)
	```

	### Citation
	```bibtex
	@misc{wirth2026olaphoptimallanguagephonemizer,
	title={OLaPh: Optimal Language Phonemizer},
	author={Johannes Wirth},
	year={2026},
	eprint={2509.20086},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.20086},
	}
	```