Instructions to use zcgf111/linguaforge-gemma4-204lang-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use zcgf111/linguaforge-gemma4-204lang-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-e4b-it-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "zcgf111/linguaforge-gemma4-204lang-lora") - Notebooks
- Google Colab
- Kaggle
LinguaForge — Gemma 4 E4B LoRA across 204 languages
A single ~170 MB LoRA adapter that shifts Google DeepMind's
google/gemma-4-E4B-it
toward every language in FLORES-200 (Meta NLLB Team, No Language Left
Behind, Nature 2024) plus Cherokee depth from the ChrEn corpus
(Zhang, Frey & Bansal, EMNLP 2020). Cherokee is not in FLORES-200.
Trained as part of the LinguaForge / 古韵 GuYun submission to the
Gemma 4 Hackathon (AI for Good — endangered language preservation).
Training summary (Kaggle T4, ~5 h 9 min)
| Base model | unsloth/gemma-4-e4b-it-unsloth-bnb-4bit (4-bit NF4) |
| Trainable params | 42,401,792 / 8,038,558,240 (0.53%) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Rank / alpha / dropout | 16 / 32 / 0.05 |
| Total chat samples | 33,480 (alt. en → target / target → en) |
| Languages covered | 203 FLORES-200 languages + Cherokee from ChrEn = 204 |
| Continents | 6 (Africa, Asia, Europe, Pacific, South America, Diaspora) + N. America |
| Optimizer steps | 8,370 (1 epoch, batch 2 × grad-accum 2) |
| Reproducer | Kaggle kernel dongwei666/linguaforge-auto |
Held-out evaluation (FLORES-200 devtest + ChrEn seed=99)
Numbers from Kaggle kernel
dongwei666/linguaforge-eval,
50 unseen sentences per language, greedy decoding, sacrebleu corpus-level
metrics.
| Language | base BLEU | +LoRA BLEU | Δ BLEU | base chrF | +LoRA chrF | Δ chrF |
|---|---|---|---|---|---|---|
Cherokee (chr_Cher) |
0.04 | 0.45 | +0.41 | 2.30 | 7.87 | +5.56 (3.4×) |
Tibetan (bod_Tibt) |
0.12 | 0.21 | +0.09 | 19.14 | 27.05 | +7.91 |
Welsh (cym_Latn) |
3.90 | 6.13 | +2.23 | 31.11 | 31.21 | +0.10 |
Quechua (quy_Latn) |
1.02 | 1.93 | +0.91 | 19.94 | 22.49 | +2.55 |
Māori (mri_Latn) |
3.64 | 4.16 | +0.52 | 28.48 | 27.58 | −0.90 |
Yoruba (yor_Latn) |
2.54 | 1.12 | −1.42 | 21.65 | 11.10 | −10.55 ⚠ |
| Mean (6 langs) | 1.88 | 2.33 | +0.45 | 20.44 | 21.22 | +0.78 |
Honest read: the LoRA's biggest wins are on languages whose scripts the
base model could barely write (Cherokee chrF 3.4×, Tibetan chrF +7.91).
Welsh shows the largest BLEU jump (+2.23) — the adapter strips a
**Welsh Translation:** boilerplate prefix from the base model. Yoruba
regressed into a repetition loop; reported transparently. With more
samples per language or per-community LoRAs, that regression should
resolve.
Usage
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch
model, tok = FastLanguageModel.from_pretrained(
model_name="zcgf111/linguaforge-gemma4-204lang-lora",
max_seq_length=2048,
load_in_4bit=True,
)
tok = get_chat_template(tok, chat_template="gemma")
FastLanguageModel.for_inference(model)
msgs = [
{"role": "system", "content": "You are LinguaForge, a multilingual tutor for endangered and low-resource languages."},
{"role": "user", "content": "Translate this English sentence into Maori (Polynesian, Pacific):\n\nHello, my name is Sarah."},
]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tok.tokenizer(text, return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=128, do_sample=False,
pad_token_id=tok.tokenizer.eos_token_id)
print(tok.tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Citations
@article{nllb2024,
title={Scaling neural machine translation to 200 languages},
author={{NLLB Team} and Costa-juss{\`a}, Marta R. and others},
journal={Nature},
year={2024},
doi={10.1038/s41586-024-07335-x}
}
@inproceedings{zhang-etal-2020-chren,
title={{ChrEn}: {Cherokee-English} Machine Translation for Endangered Language Revitalization},
author={Zhang, Shiyue and Frey, Benjamin and Bansal, Mohit},
booktitle={EMNLP},
year={2020}
}
License
CC-BY-SA 4.0, matching the FLORES-200 license. ChrEn is released under CC-BY-SA 4.0 by its authors.
- Downloads last month
- 70