LinguaForge — Gemma 4 E4B LoRA across 204 languages

A single ~170 MB LoRA adapter that shifts Google DeepMind's google/gemma-4-E4B-it toward every language in FLORES-200 (Meta NLLB Team, No Language Left Behind, Nature 2024) plus Cherokee depth from the ChrEn corpus (Zhang, Frey & Bansal, EMNLP 2020). Cherokee is not in FLORES-200.

Trained as part of the LinguaForge / 古韵 GuYun submission to the Gemma 4 Hackathon (AI for Good — endangered language preservation).

Training summary (Kaggle T4, ~5 h 9 min)


Base model	`unsloth/gemma-4-e4b-it-unsloth-bnb-4bit` (4-bit NF4)
Trainable params	42,401,792 / 8,038,558,240 (0.53%)
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Rank / alpha / dropout	16 / 32 / 0.05
Total chat samples	33,480 (alt. `en → target` / `target → en`)
Languages covered	203 FLORES-200 languages + Cherokee from ChrEn = 204
Continents	6 (Africa, Asia, Europe, Pacific, South America, Diaspora) + N. America
Optimizer steps	8,370 (1 epoch, batch 2 × grad-accum 2)
Reproducer	Kaggle kernel `dongwei666/linguaforge-auto`

Held-out evaluation (FLORES-200 devtest + ChrEn seed=99)

Numbers from Kaggle kernel dongwei666/linguaforge-eval, 50 unseen sentences per language, greedy decoding, sacrebleu corpus-level metrics.

Language	base BLEU	+LoRA BLEU	Δ BLEU	base chrF	+LoRA chrF	Δ chrF
Cherokee (`chr_Cher`)	0.04	0.45	+0.41	2.30	7.87	+5.56 (3.4×)
Tibetan (`bod_Tibt`)	0.12	0.21	+0.09	19.14	27.05	+7.91
Welsh (`cym_Latn`)	3.90	6.13	+2.23	31.11	31.21	+0.10
Quechua (`quy_Latn`)	1.02	1.93	+0.91	19.94	22.49	+2.55
Māori (`mri_Latn`)	3.64	4.16	+0.52	28.48	27.58	−0.90
Yoruba (`yor_Latn`)	2.54	1.12	−1.42	21.65	11.10	−10.55 ⚠
Mean (6 langs)	1.88	2.33	+0.45	20.44	21.22	+0.78

Honest read: the LoRA's biggest wins are on languages whose scripts the base model could barely write (Cherokee chrF 3.4×, Tibetan chrF +7.91). Welsh shows the largest BLEU jump (+2.23) — the adapter strips a **Welsh Translation:** boilerplate prefix from the base model. Yoruba regressed into a repetition loop; reported transparently. With more samples per language or per-community LoRAs, that regression should resolve.

Usage

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch

model, tok = FastLanguageModel.from_pretrained(
    model_name="zcgf111/linguaforge-gemma4-204lang-lora",
    max_seq_length=2048,
    load_in_4bit=True,
)
tok = get_chat_template(tok, chat_template="gemma")
FastLanguageModel.for_inference(model)

msgs = [
    {"role": "system", "content": "You are LinguaForge, a multilingual tutor for endangered and low-resource languages."},
    {"role": "user",   "content": "Translate this English sentence into Maori (Polynesian, Pacific):\n\nHello, my name is Sarah."},
]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tok.tokenizer(text, return_tensors="pt").to(model.device)
with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=128, do_sample=False,
                         pad_token_id=tok.tokenizer.eos_token_id)
print(tok.tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Citations

@article{nllb2024,
  title={Scaling neural machine translation to 200 languages},
  author={{NLLB Team} and Costa-juss{\`a}, Marta R. and others},
  journal={Nature},
  year={2024},
  doi={10.1038/s41586-024-07335-x}
}
@inproceedings{zhang-etal-2020-chren,
  title={{ChrEn}: {Cherokee-English} Machine Translation for Endangered Language Revitalization},
  author={Zhang, Shiyue and Frey, Benjamin and Bansal, Mohit},
  booktitle={EMNLP},
  year={2020}
}

License

CC-BY-SA 4.0, matching the FLORES-200 license. ChrEn is released under CC-BY-SA 4.0 by its authors.

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zcgf111/linguaforge-gemma4-204lang-lora

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Adapter

(238)

this model

zcgf111
/

linguaforge-gemma4-204lang-lora