---
license: cc-by-nc-4.0
datasets:
- CAMeL-Lab/BAREC-Corpus-v1.0
language:
- ar
base_model:
- CAMeL-Lab/readability-arabertv02-word-CE
---

# BAREC Strict Track Document-Level Readability Model

## Overview
This model is designed for fine-grained Arabic readability assessment at the Document level, developed for the BAREC Shared Task 2025 (Strict Track). It is based on [AraBERTv2](https://huggingface.co/aubmindlab/bert-base-arabertv02) and fine-tuned using the BAREC corpus with a 19-level readability classification. The model uses D3Tok input variants and a combination of Cross-Entropy (CE) and Quadratic Weighted Kappa (WKL) losses.

## Intended Uses & Limitations
- **Intended use:** Predicting the readability of Arabic sentences or documents (scale 1-19)
- **Domain:** Modern Standard Arabic, educational content

## Model Details
- **Base model:** CAMeL-Lab/readability-arabertv02-word-CE
- **Input variant:** D3Tok (token-level)
- **Labels:** 19 readability levels (1 = easiest, 19 = hardest)
- **Losses:** CE → WKL (for best results)
- **Strict track:** Document 
- **Best QWK:** 81.9% (document-level)

## Training Data
- **Corpus:** [BAREC Corpus v1.0](https://huggingface.co/datasets/CAMeL-Lab/BAREC-Corpus-v1.0)
- **Train/Val/Test split:**  Train (80%), Dev (10%), and Test (10%).
- **Preprocessing:** Input variant generated using the official scripts (D3Tok)
- **Cleaning:** No additional cleaning, only official preprocessing

## Training Procedure
- **Loss functions:** Cross-Entropy, then Quadratic Weighted Kappa (WKL)
- **Hyperparameters:**
  - Learning rate: 1e-5
  - Batch size: 32
  - Epochs: 8
  - Scheduler: cosine_with_restarts
  - Weight Decay: 0.05
  - fp16: enabled
- **Metrics:** QWK (Quadratic Weighted Kappa), macro F1, accuracy

## Evaluation Results

| Split         | QWK     | 
|---------------|---------|
| Validation    | 81.9%   | 
| Test (Public) | 82.8%   |
| Blind Test*   | 79.0%   | 

## Usage Example

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "shymaa25/barec-readability-doc-arabertv02-word-ce-wkl-strict"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Split your document into sentences (as a list)
sentences = [
    "هذه جملة سهلة.",
    "هذه الجملة أكثر تعقيدًا وتتطلب مستوى قراءة أعلى.",
    "جملة متوسطة الصعوبة."
]

# Predict readability for each sentence and select the hardest
levels = []
for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=1).item() + 1  # labels 1–19
        levels.append(pred)
doc_level = max(levels)  # Hardest sentence determines doc level

print(f"Document readability level: {doc_level}")