--- license: cc-by-nc-4.0 datasets: - CAMeL-Lab/BAREC-Corpus-v1.0 language: - ar base_model: - CAMeL-Lab/readability-arabertv02-word-CE --- # BAREC Strict Track Document-Level Readability Model ## Overview This model is designed for fine-grained Arabic readability assessment at the Document level, developed for the BAREC Shared Task 2025 (Strict Track). It is based on [AraBERTv2](https://huggingface.co/aubmindlab/bert-base-arabertv02) and fine-tuned using the BAREC corpus with a 19-level readability classification. The model uses D3Tok input variants and a combination of Cross-Entropy (CE) and Quadratic Weighted Kappa (WKL) losses. ## Intended Uses & Limitations - **Intended use:** Predicting the readability of Arabic sentences or documents (scale 1-19) - **Domain:** Modern Standard Arabic, educational content ## Model Details - **Base model:** CAMeL-Lab/readability-arabertv02-word-CE - **Input variant:** D3Tok (token-level) - **Labels:** 19 readability levels (1 = easiest, 19 = hardest) - **Losses:** CE → WKL (for best results) - **Strict track:** Document - **Best QWK:** 81.9% (document-level) ## Training Data - **Corpus:** [BAREC Corpus v1.0](https://huggingface.co/datasets/CAMeL-Lab/BAREC-Corpus-v1.0) - **Train/Val/Test split:** Train (80%), Dev (10%), and Test (10%). - **Preprocessing:** Input variant generated using the official scripts (D3Tok) - **Cleaning:** No additional cleaning, only official preprocessing ## Training Procedure - **Loss functions:** Cross-Entropy, then Quadratic Weighted Kappa (WKL) - **Hyperparameters:** - Learning rate: 1e-5 - Batch size: 32 - Epochs: 8 - Scheduler: cosine_with_restarts - Weight Decay: 0.05 - fp16: enabled - **Metrics:** QWK (Quadratic Weighted Kappa), macro F1, accuracy ## Evaluation Results | Split | QWK | |---------------|---------| | Validation | 81.9% | | Test (Public) | 82.8% | | Blind Test* | 79.0% | ## Usage Example ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load the model and tokenizer model_name = "shymaa25/barec-readability-doc-arabertv02-word-ce-wkl-strict" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Split your document into sentences (as a list) sentences = [ "هذه جملة سهلة.", "هذه الجملة أكثر تعقيدًا وتتطلب مستوى قراءة أعلى.", "جملة متوسطة الصعوبة." ] # Predict readability for each sentence and select the hardest levels = [] for sentence in sentences: inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) pred = torch.argmax(outputs.logits, dim=1).item() + 1 # labels 1–19 levels.append(pred) doc_level = max(levels) # Hardest sentence determines doc level print(f"Document readability level: {doc_level}")