---
base_model: meta-llama/Meta-Llama-3-8B-Instruct
language:
  - el
library_name: peft
license: other
pipeline_tag: text-generation
tags:
  - base_model:adapter:meta-llama/Meta-Llama-3-8B-Instruct
  - lora
  - greek
  - dialect
  - transformers
---

# Greek Dialect LoRA — Llama-3 8B Instruct Adapter

LoRA adapter trained by the CLLT Lab (University of Crete) for dialectal Greek generation on top of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). The adapter follows the same natural-prompt pipeline as the Krikri variant but leverages Meta’s latest instruct-tuned backbone. Training completed 4,173 steps (3 epochs) with the best checkpoint at step 4,000 (eval loss 1.874).

**Project website:** https://stergioscha.github.io/CLLT/

## Model Details

- **Developer:** CLLT Lab, University of Crete  
- **Adapter type:** LoRA (PEFT) with r=16, α=32, dropout=0.1 applied to q/k/v/o/gate/up/down projections  
- **Dataset:** 23k+ instruction-following pairs covering Pontic, Cretan, Northern, Cypriot dialects (derived from GRDD)  
- **Split:** 95% train / 5% validation using Hugging Face `datasets` random split  
- **Precision:** bfloat16, gradient accumulation 8 → effective batch size 16  
- **License:** Research purposes only, subject to the Meta Llama 3 license terms  
- **Compute:** AWS GPU resources via GRNET & EU Recovery and Resilience Facility funding  

### Sources

- **GitHub:** https://github.com/StergiosCha/krikri_dialectal  
- **Dataset:** https://github.com/StergiosCha/Greek_dialect_corpus  
- **Website:** https://stergioscha.github.io/CLLT/  

## Intended Use

### Direct
- Generate or continue prompts in specific Greek dialects for cultural documentation or experimentation  
- Build dialogue systems that can answer in Pontic, Cretan, Northern Greek, or Cypriot when prompted explicitly  

### Downstream
- Plug into RAG/chat pipelines that rely on Meta-Llama-3-8B-Instruct as a base  
- Evaluate dialectal control against GRDD+ or bespoke benchmarks  

### Out-of-scope
- Critical or safety-sensitive deployments without native-speaker review  
- Automatic translation or identification of dialects (model produces text; it is not a classifier)  
- Standard Modern Greek generation (training data removed it)  

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = PeftModel.from_pretrained(base, "Stergios/llama3-8b-instruct-lora")

prompt = "Απάντησε στα κρητικά: Πού θα συναντηθούμε;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=160, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## Training Data & Procedure

- **Preparation:** `convert_to_natural_prompts_dialects_only.py` converts `<po>/<cr>/<no>/<cy>` tags to friendly Greek instructions (e.g., “Γράψε στην κρητική διάλεκτο: …”).  
- **Filtering:** Removed Standard Modern Greek entries to keep the adapter dialect-focused.  
- **Tokenization:** 512 tokens, padding to max length, labels = input IDs.  
- **Hyperparameters:** epochs=3, lr=3e-4, warmup=100, save/eval every 200 steps, `load_best_model_at_end=True`.  
- **Checkpoint size:** adapter ≈ 170 MB (`adapter_model.safetensors`).  

## Evaluation

- **Automatic:** Validation loss tracked every 200 steps; best checkpoint at step 4,000 (eval loss 1.874).  
- **Recommended manual checks:** Have native speakers verify correctness, register, and cultural sensitivity.  

## Limitations & Risks

- Dialect mixing can occur if prompts are vague. Specify the dialect explicitly.  
- Model inherits any biases present in GRDD (topics, speaker demographics, orthography).  
- Llama 3 family license disallows certain use cases—comply with Meta’s terms alongside the “research only” clause here.  

## Acknowledgments

- National Infrastructures for Research and Technology (GRNET) for AWS credits  
- EU Recovery & Resilience Facility for funding  
- Meta for the base Llama 3 models  

## Contact

Questions or issues? Open an issue on the GitHub repository or reach out to the CLLT Lab (University of Crete).