LinguaForge — Gemma 4 E4B LoRA across 204 languages

A single ~170 MB LoRA adapter that shifts Google DeepMind's google/gemma-4-E4B-it toward every language in FLORES-200 (Meta NLLB Team, No Language Left Behind, Nature 2024) plus Cherokee depth from the ChrEn corpus (Zhang, Frey & Bansal, EMNLP 2020). Cherokee is not in FLORES-200.

Trained as part of the LinguaForge / 古韵 GuYun submission to the Gemma 4 Hackathon (AI for Good — endangered language preservation).

Training summary (Kaggle T4, ~5 h 9 min)

Base model unsloth/gemma-4-e4b-it-unsloth-bnb-4bit (4-bit NF4)
Trainable params 42,401,792 / 8,038,558,240 (0.53%)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Rank / alpha / dropout 16 / 32 / 0.05
Total chat samples 33,480 (alt. en → target / target → en)
Languages covered 203 FLORES-200 languages + Cherokee from ChrEn = 204
Continents 6 (Africa, Asia, Europe, Pacific, South America, Diaspora) + N. America
Optimizer steps 8,370 (1 epoch, batch 2 × grad-accum 2)
Reproducer Kaggle kernel dongwei666/linguaforge-auto

Held-out evaluation (FLORES-200 devtest + ChrEn seed=99)

Numbers from Kaggle kernel dongwei666/linguaforge-eval, 50 unseen sentences per language, greedy decoding, sacrebleu corpus-level metrics.

Language base BLEU +LoRA BLEU Δ BLEU base chrF +LoRA chrF Δ chrF
Cherokee (chr_Cher) 0.04 0.45 +0.41 2.30 7.87 +5.56 (3.4×)
Tibetan (bod_Tibt) 0.12 0.21 +0.09 19.14 27.05 +7.91
Welsh (cym_Latn) 3.90 6.13 +2.23 31.11 31.21 +0.10
Quechua (quy_Latn) 1.02 1.93 +0.91 19.94 22.49 +2.55
Māori (mri_Latn) 3.64 4.16 +0.52 28.48 27.58 −0.90
Yoruba (yor_Latn) 2.54 1.12 −1.42 21.65 11.10 −10.55 ⚠
Mean (6 langs) 1.88 2.33 +0.45 20.44 21.22 +0.78

Honest read: the LoRA's biggest wins are on languages whose scripts the base model could barely write (Cherokee chrF 3.4×, Tibetan chrF +7.91). Welsh shows the largest BLEU jump (+2.23) — the adapter strips a **Welsh Translation:** boilerplate prefix from the base model. Yoruba regressed into a repetition loop; reported transparently. With more samples per language or per-community LoRAs, that regression should resolve.

Usage

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch

model, tok = FastLanguageModel.from_pretrained(
    model_name="zcgf111/linguaforge-gemma4-204lang-lora",
    max_seq_length=2048,
    load_in_4bit=True,
)
tok = get_chat_template(tok, chat_template="gemma")
FastLanguageModel.for_inference(model)

msgs = [
    {"role": "system", "content": "You are LinguaForge, a multilingual tutor for endangered and low-resource languages."},
    {"role": "user",   "content": "Translate this English sentence into Maori (Polynesian, Pacific):\n\nHello, my name is Sarah."},
]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tok.tokenizer(text, return_tensors="pt").to(model.device)
with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=128, do_sample=False,
                         pad_token_id=tok.tokenizer.eos_token_id)
print(tok.tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Citations

@article{nllb2024,
  title={Scaling neural machine translation to 200 languages},
  author={{NLLB Team} and Costa-juss{\`a}, Marta R. and others},
  journal={Nature},
  year={2024},
  doi={10.1038/s41586-024-07335-x}
}
@inproceedings{zhang-etal-2020-chren,
  title={{ChrEn}: {Cherokee-English} Machine Translation for Endangered Language Revitalization},
  author={Zhang, Shiyue and Frey, Benjamin and Bansal, Mohit},
  booktitle={EMNLP},
  year={2020}
}

License

CC-BY-SA 4.0, matching the FLORES-200 license. ChrEn is released under CC-BY-SA 4.0 by its authors.

Downloads last month
70
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zcgf111/linguaforge-gemma4-204lang-lora

Adapter
(105)
this model

Datasets used to train zcgf111/linguaforge-gemma4-204lang-lora

Space using zcgf111/linguaforge-gemma4-204lang-lora 1