Text Classification
Transformers
Safetensors
Spanish
roberta
Generated from Trainer
text-embeddings-inference
Instructions to use pymlex/roberta-spanish-cefr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pymlex/roberta-spanish-cefr with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="pymlex/roberta-spanish-cefr")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("pymlex/roberta-spanish-cefr") model = AutoModelForSequenceClassification.from_pretrained("pymlex/roberta-spanish-cefr") - Notebooks
- Google Colab
- Kaggle
metadata
library_name: transformers
license: gpl-3.0
base_model: bertin-project/bertin-roberta-base-spanish
tags:
- generated_from_trainer
metrics:
- accuracy
- precision
- recall
- f1
model-index:
- name: roberta-spanish-cefr
results: []
datasets:
- UniversalCEFR/caes_es
language:
- es
pipeline_tag: text-classification
Spanish CEFR Classification with BERTIN
Model summary
pymlex/roberta-spanish-cefr is a Spanish text classifier fine-tuned from bertin-project/bertin-roberta-base-spanish for CEFR level prediction. It is intended for Spanish learner-text classification and readability-style proficiency assessment.
Training data
The model was trained on UniversalCEFR/caes_es, a Spanish dataset of learner texts with CEFR annotations. The dataset has 31.1k rows.
Evaluation
Results for the test set:
- Accuracy: 0.9882
- Precision: 0.9896
- Recall: 0.9892
- F1: 0.9894
Comparison with other CEFR Spanish classifiers
Our model's performance (F1: 0.9894) is SOTA. Most documented Spanish CEFR classifiers fall within the 0.75 – 0.88 F1-score range. The obtained results significantly outperform these common baselines:
Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "pymlex/roberta-spanish-cefr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
def predict_cefr(text, top_k=3):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
k = min(top_k, probs.numel())
values, indices = torch.topk(probs, k=k)
return [
{
"label": model.config.id2label[i.item()],
"score": float(v.item()),
}
for i, v in zip(indices, values)
]
text = "Estimados señores, les escribo para solicitar información sobre el curso."
print(predict_cefr(text, top_k=3))