TranslateGemma-4B GRPOv1 for Spanish-Valencian

Model Summary

guerreropaula/translategemma4b-grpov1-es-va is a GRPO-trained Spanish to Valencian translation model initialized from the SFT checkpoint guerreropaula/translategemma4b-sft-es-va. It is part of the spanish-valencian-mt-rl EAMT 2026 submission.

GRPOv1 combines a reference-based chrF reward with a naturalness reward produced by a separate HT/MT classifier. The goal is to improve translation quality while nudging outputs toward more human-like Valencian phrasing.

Model Details

Model ID: guerreropaula/translategemma4b-grpov1-es-va
Collection: guerreropaula/spanish-valencian-mt-rl
Developed by: Paula Guerrero Castello
Initialization checkpoint: guerreropaula/translategemma4b-sft-es-va
Original base model: google/translategemma-4b-it
Task: Spanish to Valencian machine translation
License for model weights: Gemma license
Auxiliary reward model: guerreropaula/ht_mt_classifier_best

Intended Use

This model is intended for:

research on reinforcement learning for low-resource dialectal MT
ablation against SFT and GRPOv2 in the EAMT submission
studying reward shaping with classifier-based translation naturalness signals

It is not intended for:

production use without manual quality control
general-purpose text generation
use cases that require guaranteed dialectal consistency

Training Data

GRPOv1 uses gplsi/amic_parallel.

Training samples: 5,000
Validation split: 2%
Validation samples used during periodic model selection: 200
Source column: ES
Target column: VA

Training Procedure

GRPOv1 continues training from the SFT checkpoint with Group Relative Policy Optimization.

Optimizer: paged_adamw_8bit
Learning rate: 5e-6
Batch size: 1
Gradient accumulation: 8
Max steps: 100
Warmup steps: 20
Number of generations per prompt: 2
Max completion length: 100
GRPO beta: 0.04
Scheduler: cosine
Precision: bf16 when supported, otherwise fp16

Reward definition:

chrF reward on the generated hypothesis against the reference
P(HT | text) from the fine-tuned classifier guerreropaula/ht_mt_classifier_best
linear annealing of classifier weight from 0 up to 0.3 over the first 50 steps

Evaluation

The model was evaluated on 1,000 sentences from gplsi/ES-VA_translation_test.

Metric	Score
chrF	81.65
BLEU	56.94
TER	23.96
BLEURT	0.481
COMET	0.926
Dialectal Valencian Score	15.9%

Relative to the SFT checkpoint, GRPOv1 did not improve the final corpus-level translation metrics in this repository and also reduced the dialectal Valencian rate.

How To Use

The evaluation script in this repository loads GRPOv1 as a standalone causal LM checkpoint:

from config import Config
from utils.model import build_bnb_config
from transformers import AutoTokenizer, AutoModelForCausalLM

cfg = Config()
bnb = build_bnb_config(cfg)

tokenizer = AutoTokenizer.from_pretrained(cfg.grpov1_model_id)
model = AutoModelForCausalLM.from_pretrained(
    cfg.grpov1_model_id,
    quantization_config=bnb,
    device_map="auto",
    use_safetensors=True,
)

Limitations

The classifier reward is only an indirect proxy for translation naturalness.
Improvements in reward can diverge from downstream MT metrics.
The model remains sensitive to the base model's Catalan-centric prior.
The reinforcement learning stage uses only 5,000 training examples.

License

This model is distributed under the Gemma license inherited from the TranslateGemma base model family. Users should additionally review the licenses of the datasets and the auxiliary classifier used during training.

Citation

@inproceedings{guerrero-2026-enhancing,
  title     = {Enhancing LLM Translation Performance for Spanish-Valencian through Supervised Fine-tuning and Reinforcement Learning},
  author    = {Guerrero Castello, Paula},
  booktitle = {Proceedings of the 25th Annual Conference of the European Association for Machine Translation},
  year      = {2026}
}