TranslateGemma-4B GRPOv2 for Spanish-Valencian

Model Summary

guerreropaula/translategemma4b-grpov2-es-va is the best-performing model in the spanish-valencian-mt-rl collection. It starts from the SFT checkpoint and applies GRPO with a composite reward tailored to low-resource dialectal translation.

The reward combines adequacy, quality estimation, lexical diversity, and anti-copy behavior:

chrF reward
COMET reward
type-token ratio reward
source-copy penalty

Model Details

Model ID: guerreropaula/translategemma4b-grpov2-es-va
Collection: guerreropaula/spanish-valencian-mt-rl
Developed by: Paula Guerrero Castello
Initialization checkpoint: guerreropaula/translategemma4b-sft-es-va
Original base model: google/translategemma-4b-it
Task: Spanish to Valencian machine translation
License for model weights: Gemma license

Intended Use

This model is intended for:

the main ES-VA system reported in the EAMT 2026 submission
research on reward design for low-resource dialectal MT
comparison against SFT and classifier-guided GRPO

It is not intended for:

uncontrolled deployment in high-stakes domains
translation directions beyond Spanish to Valencian
applications that require stable terminology control without post-editing

Training Data

GRPOv2 uses gplsi/amic_parallel.

Training samples: 10,000
Validation split: 2%
Validation samples used during periodic model selection: 200
Source column: ES
Target column: VA

Training Procedure

GRPOv2 continues from the SFT checkpoint with Group Relative Policy Optimization and a composite reward.

Optimizer: paged_adamw_8bit
Learning rate: 5e-6
Batch size: 1
Gradient accumulation: 16
Max steps: 200
Warmup steps: 20
Number of generations per prompt: 4
Max completion length: 128
GRPO beta: 0.04
GRPO epsilon: 0.2
Scheduler: cosine
Precision: bf16 when supported, otherwise fp16

Composite reward weights:

chrF: 0.5
COMET: 0.3
TTR: 0.2
copy penalty: added when the output copies the Spanish source too closely

Evaluation

The model was evaluated on 1,000 sentences from gplsi/ES-VA_translation_test.

Metric	Score
chrF	84.68
BLEU	62.16
TER	20.63
BLEURT	0.544
COMET	0.936
Dialectal Valencian Score	36.2%

This is the strongest overall model in the repository on corpus-level automatic MT metrics. It outperforms the SFT model on chrF, BLEU, TER, BLEURT, and COMET, while preserving a high Valencian-form usage rate.

How To Use

In this repository, GRPOv2 is loaded as the base TranslateGemma model plus the GRPOv2 adapter:

from config import Config
from utils.model import build_bnb_config, load_base_tokenizer
from transformers import AutoModelForCausalLM
from peft import PeftModel

cfg = Config()
bnb = build_bnb_config(cfg)
tokenizer = load_base_tokenizer(cfg)

base_model = AutoModelForCausalLM.from_pretrained(
    cfg.base_model_id,
    quantization_config=bnb,
    device_map="auto",
    use_safetensors=True,
)

model = PeftModel.from_pretrained(base_model, cfg.grpov2_model_id)

Limitations

Dialectal Valencian usage remains below the SFT checkpoint on the repository's handcrafted feature score.
COMET is used as part of the reward and may bias training toward its own preferences.
The model is evaluated on a public 1,000-sentence test set, not a large multi-domain benchmark.
Reward optimization can improve average metrics while still failing on individual sentences.

License

This model is distributed under the Gemma license inherited from google/translategemma-4b-it. Users should verify compatibility with the dataset licenses and their own deployment requirements.

Citation

@inproceedings{guerrero-2026-enhancing,
  title     = {Enhancing LLM Translation Performance for Spanish-Valencian through Supervised Fine-tuning and Reinforcement Learning},
  author    = {Guerrero Castello, Paula},
  booktitle = {Proceedings of the 25th Annual Conference of the European Association for Machine Translation},
  year      = {2026}
}