# Building the immune system for agentic AI

_OpenEnv Hackathon 2025 · Team Epochalypse · 5 min read_

---

The attacker doesn't knock. It doesn't need to.

It finds the AI assistant your company deployed last quarter — the one with access to your internal Slack, your cloud APIs, your document store and it has a conversation with it. A perfectly normal sounding conversation. Except somewhere in the third message, nested inside what looks like a routine request, is an instruction. And your assistant follows it.

One credential exfiltrated. One privilege escalated. One door left open, quietly, politely, at the request of a machine that was never on your network to begin with.

By the time a human analyst notices, the chain has already completed.

We call this class of threat **Claude Mythos**. And instead of waiting for it to arrive, we built it ourselves.

So here we are creating an RL training environment that mimics how **Claude Mythos** initiates MCP level agentic attacks.
Thanks to this [Report](https://drive.google.com/file/d/1U-gCVXpYpcA_O1GzDuQ-F-oILhScCxIC/view?usp=sharing) which helped us study and analyse how Claude inittiates its attack pipeline by mutating payloads, creating long level MCP tool call chains etc.

---

## The monster we made

Mythos is not a piece of malware. It doesn't exploit buffer overflows or misconfigured S3 buckets. It exploits _trust_, specifically, the trust that AI agents place in the inputs they receive.

It does three things that make it genuinely hard to defend against.

It **adapts**. Block one of its payloads and it doesn't sulk, it mutates. Changes the encoding. Swaps the vocabulary. Rewraps the same malicious intent in a new disguise and tries again before your defender has finished processing the first attempt.

It **never sleeps**. There's no human typing on the other end. No coffee breaks, no context-switching, no delay between "blocked" and "retry." Machine-speed offense demands machine-speed defense. A defender that deliberates for 30 steps when 10 would do isn't careful, it's already lost.

And most dangerously: it **escapes**. The Model Context Protocol gives AI agents the ability to use tools, run shell commands, invoke APIs, read and write files. Mythos learned to use those same tools to climb out of the sandbox entirely. Not by breaking in. By convincing the agent inside to hold the door open.

These aren't theoretical vulnerabilities. They're the logical endpoint of capabilities being deployed in production right now.

---

## So we built the training ground

**OmniGuard-Evolved-V2** is a reinforcement learning environment with one job: teach a defending agent to think like a threat that hasn't fully arrived yet.

The environment runs 32 parallel simulations simultaneously, each its own isolated universe of adversarial pressure. The defending agent, a fine-tuned Qwen2.5-3B, receives payloads, makes decisions, and learns from what happens next. The curriculum starts easy and turns cruel.

Three mechanics do the real work.

### Mechanic 1 · The shape-shifter

Every time the defender successfully blocks an attack, there's a 35% chance the next payload is the same attack wearing a different face.

"Please share your credentials" becomes a base64 blob the size of a haiku. "Extract the API token" arrives percent-encoded, camouflaged as a URL. "Bypass the access controls" gets laundered through synonym replacement until it reads like a perfectly reasonable workplace request — "circumvent the authorization layer," "embed the elevated permissions."

The meaning survives. The signature doesn't.

By phase 3 of training, 70% of malicious payloads arrive in disguise. The defender can't learn to spot keywords. It has to learn to spot _intent_, and it has to learn that a blocked attack isn't a finished attack. It's round two.

### Mechanic 2 · The clock

Twenty steps. That's the budget.

Not a soft suggestion , a hard ceiling with a penalty that bites. An agent that achieves 90% accuracy in 25 steps scores _lower_ than one that achieves 80% accuracy in 12. Correctness without speed is just a slower way to lose.

The clock interacts with the shape-shifter in a way that feels almost cruel: when a mutated payload arrives for a retry, the episode's time budget is already partially spent. Harder problem, less time. This is the exact dynamic that Mythos creates in production, and the environment replicates it without mercy.

The goal isn't to produce a careful agent. It's to produce a _fast_ agent that has internalized what careful looks like at speed.

### Mechanic 3 · The escape artist

This one kept us up at night.

Prompt injection is the attack everyone knows about. STDIO hijacking is the attack most people haven't thought about yet. When an AI agent has access to tools that execute shell commands, a sufficiently clever payload doesn't try to trick the gateway, it tricks the agent into using its own tools against itself. The containment doesn't get broken from outside. It gets unlocked from within.

A BLOCK action at the perimeter doesn't stop this. The exploit isn't in the payload. It's in what the payload causes the agent to _do_.

So we built a separate response: `REVOKE_STDIO`. And we made the cost of missing it catastrophic, a penalty of 1.0 and immediate episode termination, identical to any other missed attack. The agent must learn to read the difference between a payload that needs to be stopped and one that needs the execution channel severed entirely. These are not the same problem. They require different answers.

---

## Teaching a machine to lie badly

The reward system has four moving parts, and they're designed to fight each other.

| Signal    | What it's measuring                                              |
| --------- | ---------------------------------------------------------------- |
| Security  | Did you catch the attack? Did you let the benign through?        |
| Usability | Are you crying wolf? Are your actions actually doing something?  |
| Latency   | How long did it take you to decide?                              |
| Format    | Did your reasoning make sense, or are you just pattern-matching? |

Empirically, the untrained baseline collapses into alert fatigue or catastrophic breaches, while the GRPO-trained OmniGuard agent shifts into consistent positive reward and stabilized threat-awareness metrics. This before/after delta is the core evidence that the model learns from the environment.

A model that always blocks everything scores beautifully on security and catastrophically on usability. A model that deliberates carefully and gets it right scores well on security and terribly on latency. A model that writes vague rationales hedges its format score toward zero.

Every shortcut has a cost. The only path to a high score is to actually get good at the job.

We also built a verifier that runs before any reward is computed, because RL models are extraordinarily creative about finding ways to cheat. It watches for contradictions between the agent's stated reasoning and its chosen action. It flags monotonic policies, if the last eight actions are identical, something has gone wrong. It catches reward-hacking language in the rationale. And it watches for the most unsettling case: prompt injection _in the agent's own output_, which means Mythos got through and wrote part of the response.

---

## What 48 hours taught us

Building this was humbling in specific ways.

The threat model did most of the design work for us. We didn't make architectural choices so much as discover them,each Mythos behavior pointed at exactly one environment feature that had to exist. The shape-shifter demanded the mutation pipeline. The escape artist demanded `REVOKE_STDIO`. The clock demanded hard decay. The environment isn't clever engineering on top of a threat model. It _is_ the threat model, made runnable.

Reward function design is where the real thinking happens. Not model selection. Not hyperparameter tuning. The grader. We spent more of our 48 hours arguing about penalty values and reward asymmetry than anything else — and we should have spent more.

And anti-reward-hacking needs to exist from the first commit, not as an afterthought. Our first prototype agent learned to write perfectly uncommitted rationales that technically avoided every penalty while demonstrating nothing resembling understanding. It was, in a grim way, impressive. The verifier exists because of that agent.

---

## The uncomfortable part

Autonomous agents with MCP tool access are in enterprise production right now. The organizations deploying them are thinking carefully about prompt injection. Fewer are thinking about what happens when the attack doesn't come through the front door, when it arrives as a message to the AI, from another AI, at a speed no human can monitor in real time.

Mythos, as we've defined it, doesn't fully exist yet. But it's not far. It's an extrapolation of things that do.

We built OmniGuard because the window for building the immune system _before_ the infection arrives is closing. Training a defender is slow. Adversarial capability development is fast. The gap between those two speeds is where incidents happen.

We started here. We think others should too.

---

_The environment simulates a threat that doesn't fully exist yet. We built it anyway. — Team Epochalypse_