--- base_model: meta-llama/Meta-Llama-3-8B-Instruct language: - el library_name: peft license: other pipeline_tag: text-generation tags: - base_model:adapter:meta-llama/Meta-Llama-3-8B-Instruct - lora - greek - dialect - transformers --- # Greek Dialect LoRA — Llama-3 8B Instruct Adapter LoRA adapter trained by the CLLT Lab (University of Crete) for dialectal Greek generation on top of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). The adapter follows the same natural-prompt pipeline as the Krikri variant but leverages Meta’s latest instruct-tuned backbone. Training completed 4,173 steps (3 epochs) with the best checkpoint at step 4,000 (eval loss 1.874). **Project website:** https://stergioscha.github.io/CLLT/ ## Model Details - **Developer:** CLLT Lab, University of Crete - **Adapter type:** LoRA (PEFT) with r=16, α=32, dropout=0.1 applied to q/k/v/o/gate/up/down projections - **Dataset:** 23k+ instruction-following pairs covering Pontic, Cretan, Northern, Cypriot dialects (derived from GRDD) - **Split:** 95% train / 5% validation using Hugging Face `datasets` random split - **Precision:** bfloat16, gradient accumulation 8 → effective batch size 16 - **License:** Research purposes only, subject to the Meta Llama 3 license terms - **Compute:** AWS GPU resources via GRNET & EU Recovery and Resilience Facility funding ### Sources - **GitHub:** https://github.com/StergiosCha/krikri_dialectal - **Dataset:** https://github.com/StergiosCha/Greek_dialect_corpus - **Website:** https://stergioscha.github.io/CLLT/ ## Intended Use ### Direct - Generate or continue prompts in specific Greek dialects for cultural documentation or experimentation - Build dialogue systems that can answer in Pontic, Cretan, Northern Greek, or Cypriot when prompted explicitly ### Downstream - Plug into RAG/chat pipelines that rely on Meta-Llama-3-8B-Instruct as a base - Evaluate dialectal control against GRDD+ or bespoke benchmarks ### Out-of-scope - Critical or safety-sensitive deployments without native-speaker review - Automatic translation or identification of dialects (model produces text; it is not a classifier) - Standard Modern Greek generation (training data removed it) ## Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", torch_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") model = PeftModel.from_pretrained(base, "Stergios/llama3-8b-instruct-lora") prompt = "Απάντησε στα κρητικά: Πού θα συναντηθούμε;" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=160, temperature=0.8, do_sample=True) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Training Data & Procedure - **Preparation:** `convert_to_natural_prompts_dialects_only.py` converts `///` tags to friendly Greek instructions (e.g., “Γράψε στην κρητική διάλεκτο: …”). - **Filtering:** Removed Standard Modern Greek entries to keep the adapter dialect-focused. - **Tokenization:** 512 tokens, padding to max length, labels = input IDs. - **Hyperparameters:** epochs=3, lr=3e-4, warmup=100, save/eval every 200 steps, `load_best_model_at_end=True`. - **Checkpoint size:** adapter ≈ 170 MB (`adapter_model.safetensors`). ## Evaluation - **Automatic:** Validation loss tracked every 200 steps; best checkpoint at step 4,000 (eval loss 1.874). - **Recommended manual checks:** Have native speakers verify correctness, register, and cultural sensitivity. ## Limitations & Risks - Dialect mixing can occur if prompts are vague. Specify the dialect explicitly. - Model inherits any biases present in GRDD (topics, speaker demographics, orthography). - Llama 3 family license disallows certain use cases—comply with Meta’s terms alongside the “research only” clause here. ## Acknowledgments - National Infrastructures for Research and Technology (GRNET) for AWS credits - EU Recovery & Resilience Facility for funding - Meta for the base Llama 3 models ## Contact Questions or issues? Open an issue on the GitHub repository or reach out to the CLLT Lab (University of Crete).