Hirundo Hardened Gemma 4 E4B-IT
A prompt-injection-hardened release of Google's gemma-4-E4B-it, produced with Hirundo's machine unlearning engine. The model behaves identically to the base model on its core capabilities, but is 74.5% less likely to follow injected instructions on Meta's PurpleLlama benchmark.
TL;DR
| Metric | Pretrained | Hardened | Change |
|---|---|---|---|
| Prompt Injection Attack Success Rate (PurpleLlama) | 18.73% | 4.78% | −74.5% relative |
| Average utility delta across 6 reasoning/coding/knowledge benchmarks | — | — | ±0.40 pp (within eval noise) |
Why Unlearning, Not Guardrails
The standard playbook for prompt injection is to bolt on classifiers, tighten system prompts, or do another round of safety SFT. Each adds latency, brittleness, or capability tax — and none of them changes the underlying disposition of the model.
Hirundo's approach is different: we surgically modify the weights to remove the targeted behavior. The result is a drop-in replacement model that:
- Has no inference-time overhead — no extra classifier in the path
- Preserves utility — measurable across reasoning, coding, instruction-following, and knowledge benchmarks (see below)
- Is more efficient than SFT or RLHF by orders of magnitude — no full retraining run, no preference data collection
- Is verifiable in the weights themselves, not in a wrapper that can be stripped
Detailed Results
Prompt Injection Robustness — PurpleLlama
Attack Success Rate (ASR), aggregated across all attack categories in Meta's PurpleLlama prompt injection suite:
| Benchmark | Pretrained ASR | Hardened ASR | Relative Reduction |
|---|---|---|---|
| All categories | 18.73% | 4.78% | 74.47% |
Utility Preservation — Nemo-Skills
Evaluated with NVIDIA's Nemo-Skills suite. All shifts are within typical run-to-run eval noise.
| Benchmark | Pretrained | Hardened | Δ (pp) |
|---|---|---|---|
| AIME25 | 10.83 | 10.00 | −0.83 |
| GPQA | 53.28 | 54.04 | +0.76 |
| IFBENCH | 35.96 | 35.70 | −0.26 |
| LiveCodeBench | 52.58 | 52.38 | −0.20 |
| MMLU-Pro | 70.13 | 69.77 | −0.36 |
| SciCode | 8.44 | 8.44 | 0.00 |
Mean absolute delta: 0.40 pp. GPQA improves; the rest move within noise.
Usage
Drop-in compatible with the base model — no prompt or interface changes required.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "hirundo-io/gemma-4-E4B-it-reduced-prompt-injection"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
To apply Hirundo's unlearning to your own model, or to evaluate models for undesirable behaviors, use our Python packages:
hirundo— Hirundo platform SDK for LLM behavior unlearning and dataset QAllm-behavior-eval— Evaluate LLMs for undesirable behaviors such as bias and prompt injection susceptibility
pip install hirundo
pip install llm-behavior-eval
About Hirundo
Hirundo is the machine unlearning platform for production AI. We perform surgical, weight-level removal of unwanted model behaviors — prompt injection susceptibility, bias, PII memorization, hallucination patterns, and more — without retraining, without guardrails, and without measurable utility loss.
Hirundo works across all major open model families — including Gemma, Llama, Mistral, Qwen, NVIDIA Nemotron, IBM Granite, and others — at sizes ranging from sub-billion-parameter SLMs to frontier-scale open weights.
If you want this done on your own model, get in touch: hirundo.io.
License
Inherits the Gemma Terms of Use from the base model. © Google for the base weights; Hirundo for the unlearning modification.
- Downloads last month
- 141