--- license: mit language: - ar base_model: - CAMeL-Lab/bert-base-arabic-camelbert-msa-pos-msa pipeline_tag: token-classification --- # CAMeLBERT-MSA-POS-MSA-Lemma-Clustering # Model Description CAMeLBERT-MSA-POS-MSA-Lemma-Clustering is a Modern Standard Arabic (MSA) lemmatization model. It is built by fine-tuning the [CAMeLBERT-MSA-POS-MSA](https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-pos-msa) model on the [Penn Arabic Treebank (PATB)](https://dl.acm.org/doi/pdf/10.5555/1621804.1621808) training set. This model approaches lemmatization as a classification task, where each lemma is represented as a unique class within a clustered lemma vocabulary. The fine-tuning procedure, hyperparameters, and detailed methodology are presented in our paper [“Lemmatization as a Classification Task: Results from Arabic across Multiple Genres”](https://aclanthology.org/2025.emnlp-main.1525/) # Intended uses This model is integrated into the lemmatization workflow available in our [GitHub repository](https://github.com/CAMeL-Lab/lemmatization-as-classification).