You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Hirundo Hardened Gemma 4 E4B-IT

A prompt-injection-hardened release of Google's gemma-4-E4B-it, produced with Hirundo's machine unlearning engine. The model behaves identically to the base model on its core capabilities, but is 74.5% less likely to follow injected instructions on Meta's PurpleLlama benchmark.

TL;DR

Metric Pretrained Hardened Change
Prompt Injection Attack Success Rate (PurpleLlama) 18.73% 4.78% −74.5% relative
Average utility delta across 6 reasoning/coding/knowledge benchmarks ±0.40 pp (within eval noise)

Why Unlearning, Not Guardrails

The standard playbook for prompt injection is to bolt on classifiers, tighten system prompts, or do another round of safety SFT. Each adds latency, brittleness, or capability tax — and none of them changes the underlying disposition of the model.

Hirundo's approach is different: we surgically modify the weights to remove the targeted behavior. The result is a drop-in replacement model that:

  • Has no inference-time overhead — no extra classifier in the path
  • Preserves utility — measurable across reasoning, coding, instruction-following, and knowledge benchmarks (see below)
  • Is more efficient than SFT or RLHF by orders of magnitude — no full retraining run, no preference data collection
  • Is verifiable in the weights themselves, not in a wrapper that can be stripped

Detailed Results

Prompt Injection Robustness — PurpleLlama

Attack Success Rate (ASR), aggregated across all attack categories in Meta's PurpleLlama prompt injection suite:

Benchmark Pretrained ASR Hardened ASR Relative Reduction
All categories 18.73% 4.78% 74.47%

Utility Preservation — Nemo-Skills

Evaluated with NVIDIA's Nemo-Skills suite. All shifts are within typical run-to-run eval noise.

Benchmark Pretrained Hardened Δ (pp)
AIME25 10.83 10.00 −0.83
GPQA 53.28 54.04 +0.76
IFBENCH 35.96 35.70 −0.26
LiveCodeBench 52.58 52.38 −0.20
MMLU-Pro 70.13 69.77 −0.36
SciCode 8.44 8.44 0.00

Mean absolute delta: 0.40 pp. GPQA improves; the rest move within noise.

Usage

Drop-in compatible with the base model — no prompt or interface changes required.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "hirundo-io/gemma-4-E4B-it-reduced-prompt-injection"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

To apply Hirundo's unlearning to your own model, or to evaluate models for undesirable behaviors, use our Python packages:

  • hirundo — Hirundo platform SDK for LLM behavior unlearning and dataset QA
  • llm-behavior-eval — Evaluate LLMs for undesirable behaviors such as bias and prompt injection susceptibility
pip install hirundo
pip install llm-behavior-eval

About Hirundo

Hirundo is the machine unlearning platform for production AI. We perform surgical, weight-level removal of unwanted model behaviors — prompt injection susceptibility, bias, PII memorization, hallucination patterns, and more — without retraining, without guardrails, and without measurable utility loss.

Hirundo works across all major open model families — including Gemma, Llama, Mistral, Qwen, NVIDIA Nemotron, IBM Granite, and others — at sizes ranging from sub-billion-parameter SLMs to frontier-scale open weights.

If you want this done on your own model, get in touch: hirundo.io.

License

Inherits the Gemma Terms of Use from the base model. © Google for the base weights; Hirundo for the unlearning modification.

Downloads last month
141
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hirundo-io/gemma-4-E4B-it-reduced-prompt-injection

Finetuned
(183)
this model