iisys-hof
/

OLaPhLLM_v2

text-generation

text-generation-inference

Model card Files Files and versions

ojo commited on May 8

Commit

f42f1b0

·

verified ·

1 Parent(s): d33206b

Update README.md

Files changed (1) hide show

README.md +70 -3

README.md CHANGED Viewed

@@ -1,3 +1,70 @@
----
-license: gemma
----

+---
+license: gemma
+license_name: license
+license_link: LICENSE
+base_model:
+- ModelSpace/GemmaX2-28-2B-v0.1
+pipeline_tag: translation
+library_name: transformers
+tags:
+- text-generation
+language:
+- de
+- en
+- fr
+- es
+datasets:
+- iisys-hof/olaph-data
+---
+## Model Summary
+OLaPhLLM is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1.
+Its tokenizer was extended with phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the [OLaPh framework](https://github.com/iisys-hof/olaph)).
+The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework as well as with lexicon words taken from OLaPh.
+The training set comprised 650,000 sentence pairs per target language (English, French, German, and Spanish), supplemented by 100,000 isolated, randomly selected lexicon entries per language, totaling 3 million training examples.
+- **Finetuned By**: Institute for Information Systems at Hof University
+- **Model type**: Text-To-Text
+- **Dataset**: [OLaPh Phonemization Dataset](https://huggingface.co/datasets/iisys-hof/olaph-data)
+- **Language(s)**: English, French, German, Spanish
+- **License**: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms)
+- **Release Date**: September 25, 2025
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+lang = "English" #German, French, Spanish
+sentence = "But we are not sorry, for the rain is delightful."
+model_id = "iisys-hof/olaph"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
+stop_tokens = [tokenizer.eos_token_id, tokenizer.encode(".", add_special_tokens=False)[0]]
+prompt =  f"Translate this from {lang} to Phones:\n{lang}: "
+inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=stop_tokens)
+phonemized = tokenizer.decode(outputs[0], skip_special_tokens=False)
+phonemized = phonemized.split("\n")[-1].replace("Phones:", "")
+print(phonemized)
+```
+### Citation
+```bibtex
+@misc{wirth2026olaphoptimallanguagephonemizer,
+      title={OLaPh: Optimal Language Phonemizer},
+      author={Johannes Wirth},
+      year={2026},
+      eprint={2509.20086},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2509.20086},
+}
+```