ojo commited on
Commit
f42f1b0
·
verified ·
1 Parent(s): d33206b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -3
README.md CHANGED
@@ -1,3 +1,70 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ license_name: license
4
+ license_link: LICENSE
5
+ base_model:
6
+ - ModelSpace/GemmaX2-28-2B-v0.1
7
+ pipeline_tag: translation
8
+ library_name: transformers
9
+ tags:
10
+ - text-generation
11
+ language:
12
+ - de
13
+ - en
14
+ - fr
15
+ - es
16
+ datasets:
17
+ - iisys-hof/olaph-data
18
+ ---
19
+ ## Model Summary
20
+
21
+ OLaPhLLM is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1.
22
+ Its tokenizer was extended with phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the [OLaPh framework](https://github.com/iisys-hof/olaph)).
23
+
24
+ The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework as well as with lexicon words taken from OLaPh.
25
+ The training set comprised 650,000 sentence pairs per target language (English, French, German, and Spanish), supplemented by 100,000 isolated, randomly selected lexicon entries per language, totaling 3 million training examples.
26
+
27
+ - **Finetuned By**: Institute for Information Systems at Hof University
28
+ - **Model type**: Text-To-Text
29
+ - **Dataset**: [OLaPh Phonemization Dataset](https://huggingface.co/datasets/iisys-hof/olaph-data)
30
+ - **Language(s)**: English, French, German, Spanish
31
+ - **License**: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms)
32
+ - **Release Date**: September 25, 2025
33
+
34
+ ## Usage
35
+
36
+ ```python
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer
38
+
39
+ lang = "English" #German, French, Spanish
40
+ sentence = "But we are not sorry, for the rain is delightful."
41
+
42
+ model_id = "iisys-hof/olaph"
43
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
44
+ model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
45
+ stop_tokens = [tokenizer.eos_token_id, tokenizer.encode(".", add_special_tokens=False)[0]]
46
+
47
+
48
+ prompt = f"Translate this from {lang} to Phones:\n{lang}: "
49
+
50
+ inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")
51
+
52
+ outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=stop_tokens)
53
+ phonemized = tokenizer.decode(outputs[0], skip_special_tokens=False)
54
+ phonemized = phonemized.split("\n")[-1].replace("Phones:", "")
55
+
56
+ print(phonemized)
57
+ ```
58
+
59
+ ### Citation
60
+ ```bibtex
61
+ @misc{wirth2026olaphoptimallanguagephonemizer,
62
+ title={OLaPh: Optimal Language Phonemizer},
63
+ author={Johannes Wirth},
64
+ year={2026},
65
+ eprint={2509.20086},
66
+ archivePrefix={arXiv},
67
+ primaryClass={cs.CL},
68
+ url={https://arxiv.org/abs/2509.20086},
69
+ }
70
+ ```