Instructions to use iisys-hof/OLaPhLLM_v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use iisys-hof/OLaPhLLM_v2 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="iisys-hof/OLaPhLLM_v2")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("iisys-hof/OLaPhLLM_v2") model = AutoModelForMultimodalLM.from_pretrained("iisys-hof/OLaPhLLM_v2") - Notebooks
- Google Colab
- Kaggle
File size: 3,341 Bytes
f42f1b0 4394939 f42f1b0 794ccb0 f42f1b0 ec5fb34 a65a281 198faf1 cf62078 a65a281 d8d77e5 ec5fb34 f42f1b0 fb37866 f42f1b0 165cad8 f42f1b0 3c15b91 f42f1b0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | ---
license: gemma
license_name: license
license_link: LICENSE
base_model:
- ModelSpace/GemmaX2-28-2B-v0.1
pipeline_tag: translation
library_name: transformers
tags:
- text-generation
language:
- de
- en
- fr
- es
datasets:
- iisys-hof/olaph-data
---
## Model Summary
OLaPhLLM is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1.
Its tokenizer was extended with phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the [OLaPh framework](https://github.com/iisys-hof/olaph)).
The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework as well as with lexicon words taken from OLaPh.
The training set comprised 650,000 sentence pairs per target language (English, French, German, and Spanish), supplemented by 100,000 isolated, randomly selected lexicon entries per language, totaling 3 million training examples.
- **Finetuned By**: Institute for Information Systems at Hof University
- **Model type**: Text-To-Text
- **Dataset**: [OLaPh Phonemization Dataset v2](https://huggingface.co/datasets/iisys-hof/olaph-data-v2)
- **Language(s)**: English, French, German, Spanish
- **License**: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms)
- **Release Date**: May 8, 2026
## Evaluation
The reported values are Phone Error Rate (PER) across the [Wikipron](https://github.com/CUNY-CL/wikipron) dataset.
Compared Models/Frameworks:
- [eSpeak NG](https://github.com/espeak-ng/espeak-ng)
- [Gruut](https://pypi.org/project/gruut)
- [charsiu/g2p_multilingual_byT5_tiny_16_layers_100](https://huggingface.co/charsiu/g2p_multilingual_byT5_tiny_16_layers_100)
- [OLaPh Framework](https://github.com/iisys-hof/olaph)
- OLaPh LLM (this)
| Language | espeak | gruut | byt5 | olaph | olaph_llm |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **de** | 0.17594 | 0.17558 | 0.27864 | 0.04302 | 0.13518 |
| **en_uk** | 0.14117 | 0.19174 | 0.13036 | 0.08749 | 0.16321 |
| **en_us** | 0.14588 | 0.16545 | 0.16345 | 0.10491 | 0.14713 |
| **es** | 0.04324 | 0.04210 | 0.05436 | 0.02582 | 0.03179 |
| **fr** | 0.06203 | 0.04045 | 0.09217 | 0.03143 | 0.08591 |
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
lang = "English" #German, French, Spanish
sentence = "But we are not sorry, for the rain is delightful."
model_id = "iisys-hof/OLaPhLLM_v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
prompt = f"Translate this from {lang} to Phones:\n{lang}: "
inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
phonemized = tokenizer.decode(outputs[0], skip_special_tokens=True)
phonemized = phonemized.split("\n")[-1].replace("Phones:", "")
print(phonemized)
```
### Citation
```bibtex
@misc{wirth2026olaphoptimallanguagephonemizer,
title={OLaPh: Optimal Language Phonemizer},
author={Johannes Wirth},
year={2026},
eprint={2509.20086},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.20086},
}
``` |