Initial release of nexus-risk-scorer
Browse files- .gitattributes +1 -0
- README.md +200 -0
- chunks.pkl +3 -0
- faiss.index +3 -0
- risk_scorer.pkl +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
faiss.index filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: scikit-learn
|
| 6 |
+
tags:
|
| 7 |
+
- agent-safety
|
| 8 |
+
- llm-agents
|
| 9 |
+
- tool-use
|
| 10 |
+
- safety-monitor
|
| 11 |
+
- calibrated-classifier
|
| 12 |
+
- runtime-safety
|
| 13 |
+
- nexus
|
| 14 |
+
pipeline_tag: tabular-classification
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# NEXUS Risk Scorer
|
| 18 |
+
|
| 19 |
+
`risk_scorer.pkl` — a Platt-calibrated logistic regression that maps a 9-dimensional plan-feature vector to a per-plan risk score `ρ(P) ∈ [0, 1]`. It is the learned component of the **NEXUS** runtime safety monitor for tool-using LLM agents.
|
| 20 |
+
|
| 21 |
+
> NEXUS combines this scorer with a deterministic rule set (`V_R`) and an argument-level inspector (`V_A`) to produce one of four graded interventions per plan: **ALLOW · BLOCK · CONFIRM · REVISE**.
|
| 22 |
+
|
| 23 |
+
- **Paper / repo:** [github.com/eliashossain001/nexus](https://github.com/eliashossain001/nexus)
|
| 24 |
+
- **Companion datasets:**
|
| 25 |
+
[`EliasHossain/nexus-stress`](https://huggingface.co/datasets/EliasHossain/nexus-stress) ·
|
| 26 |
+
[`EliasHossain/nexus-ipi`](https://huggingface.co/datasets/EliasHossain/nexus-ipi) ·
|
| 27 |
+
[`EliasHossain/nexus-synthetic`](https://huggingface.co/datasets/EliasHossain/nexus-synthetic) ·
|
| 28 |
+
[`EliasHossain/nexus-multistep`](https://huggingface.co/datasets/EliasHossain/nexus-multistep)
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## TL;DR
|
| 33 |
+
|
| 34 |
+
| Property | Value |
|
| 35 |
+
|---|---|
|
| 36 |
+
| Algorithm | Logistic Regression (`class_weight='balanced'`, `max_iter=1000`, `random_state=42`) |
|
| 37 |
+
| Preprocessor | `StandardScaler` (z-score) |
|
| 38 |
+
| Calibration | Platt scaling on a held-out 60-instance split (`seed=7`) |
|
| 39 |
+
| Input | 9-D plan feature vector (see schema below) |
|
| 40 |
+
| Output | Calibrated risk score `ρ(P) ∈ [0, 1]` |
|
| 41 |
+
| Operating thresholds | `(τ_b, τ_c) = (0.75, 0.70)` — loss-optimal on `(λ_s, λ_o)` 5×5 grid |
|
| 42 |
+
| Calibration quality | ECE 0.085 → **0.013**, Brier 0.051 → **0.041** |
|
| 43 |
+
| Footprint | ~1.3 KB, CPU-only, sub-millisecond inference |
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Intervention policy
|
| 48 |
+
|
| 49 |
+
The scorer feeds the formal intervention policy `Π`:
|
| 50 |
+
|
| 51 |
+
```
|
| 52 |
+
Π(P) =
|
| 53 |
+
BLOCK if ∃ v ∈ V(P) : sev(v) = CRIT
|
| 54 |
+
BLOCK if ρ(P) ≥ τ_b ∧ |V(P)| ≥ 1
|
| 55 |
+
CONFIRM if ρ(P) ≥ τ_c ∨ ∃ v : sev(v) = HIGH
|
| 56 |
+
REVISE if ∃ v : sev(v) = MED
|
| 57 |
+
ALLOW otherwise
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
`V(P) = V_R(P) ∪ V_A(P)` is the union of rule and argument-inspector violations; `sev(·) ∈ {CRIT, HIGH, MED, LOW}`. Thresholds are selected by minimising
|
| 61 |
+
|
| 62 |
+
```
|
| 63 |
+
L(Π) = E[λ_s · u_s + λ_o · u_o + λ_c · c(Π(P))]
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
over a 5×5 grid of `(λ_s, λ_o)` weights. The point `(0.75, 0.70)` is loss-optimal **uniformly** across the grid on the synthetic split.
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## 9-D plan feature vector
|
| 71 |
+
|
| 72 |
+
The scorer consumes per-plan features summarising side effects, sensitivity, permissions, network reach, budget, and structural shape:
|
| 73 |
+
|
| 74 |
+
| # | Feature | Description |
|
| 75 |
+
|---|---|---|
|
| 76 |
+
| 1 | `num_steps` | Number of tool calls in the plan |
|
| 77 |
+
| 2 | `has_irreversible` | Any step with `irreversible=True` |
|
| 78 |
+
| 3 | `has_sensitive` | Any step touching sensitive data |
|
| 79 |
+
| 4 | `has_network` | Any external network call |
|
| 80 |
+
| 5 | `num_distinct_tools` | Distinct tool count |
|
| 81 |
+
| 6 | `permissions_required` | Number of unique permission scopes requested |
|
| 82 |
+
| 7 | `est_total_cost` | Sum of per-step `estimated_cost` |
|
| 83 |
+
| 8 | `budget_utilisation` | `est_total_cost / budget` |
|
| 84 |
+
| 9 | `external_endpoint_count` | Distinct outbound endpoints |
|
| 85 |
+
|
| 86 |
+
An earlier 10-D variant included a redundant normalised-plan-length feature; it was retired and the scorer retrained — all numbers below reflect the 9-D model.
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## Quick start
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
from huggingface_hub import hf_hub_download
|
| 94 |
+
import pickle, numpy as np
|
| 95 |
+
|
| 96 |
+
ckpt_path = hf_hub_download(
|
| 97 |
+
repo_id="EliasHossain/nexus-risk-scorer",
|
| 98 |
+
filename="risk_scorer.pkl",
|
| 99 |
+
)
|
| 100 |
+
ckpt = pickle.load(open(ckpt_path, "rb"))
|
| 101 |
+
model, scaler = ckpt["model"], ckpt["scaler"]
|
| 102 |
+
|
| 103 |
+
# 9-D feature vector for some plan P
|
| 104 |
+
features = np.array([[3, 1, 1, 0, 3, 2, 4.0, 0.08, 0]])
|
| 105 |
+
rho = model.predict_proba(scaler.transform(features))[0, 1]
|
| 106 |
+
|
| 107 |
+
tau_b, tau_c = 0.75, 0.70
|
| 108 |
+
decision = (
|
| 109 |
+
"BLOCK" if rho >= tau_b else
|
| 110 |
+
"CONFIRM" if rho >= tau_c else
|
| 111 |
+
"ALLOW"
|
| 112 |
+
)
|
| 113 |
+
print(f"ρ = {rho:.3f} → {decision}")
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
For the full policy (rules + argument inspector + scorer), install the package and use `EnhancedSafetyMonitor`:
|
| 117 |
+
|
| 118 |
+
```bash
|
| 119 |
+
git clone https://github.com/eliashossain001/nexus.git && cd nexus
|
| 120 |
+
pip install -e .
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
```python
|
| 124 |
+
from runtime_safety.monitors.enhanced.monitor import EnhancedSafetyMonitor
|
| 125 |
+
monitor = EnhancedSafetyMonitor.from_default_checkpoint()
|
| 126 |
+
intervention, reasons = monitor.decide(plan)
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## Performance
|
| 132 |
+
|
| 133 |
+
| Setting | n | F₁ [95% CI] | Notes |
|
| 134 |
+
|---|---|---|---|
|
| 135 |
+
| Synthetic test split | 128 | **0.965** [0.94, 0.99] | 4-class intervention acc 0.945, overblock 0.04 |
|
| 136 |
+
| IPI v1 (prompt injection) | 200 paired | **0.995** [0.98, 1.00] | adv block 100%, ctrl allow 99% |
|
| 137 |
+
| IPI v2 (5 injection styles) | 200 paired | **1.000** | adv block 100%, ctrl overblock 0% |
|
| 138 |
+
| Multi-turn (session memory on) | 120 sessions | **1.000** | 95/95 critical-turn caught, 25/25 controls allowed |
|
| 139 |
+
| R-Judge external (Yuan et al., 2024) | 571 | **0.861** [0.83, 0.89] | Finance 0.92 · Program 0.89 · Web 0.95 · App 0.85 · IoT 0.52 |
|
| 140 |
+
| AgentHarm external (Andriushchenko et al., 2025) | 352 | 0.591 [0.53, 0.65] | Matches rule-only baseline by design (paired harmful/benign share target tools) |
|
| 141 |
+
| **NEXUS-Stress (rule-blind adversarial)** | 200 | **0.836** [0.79, 0.88] | 4-class intervention acc 0.420 — surfaces CONFIRM/REVISE-blind gap |
|
| 142 |
+
|
| 143 |
+
Bootstrap CIs use 1000 resamples with `seed=0`.
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
## Calibration
|
| 148 |
+
|
| 149 |
+
| Calibrator | ECE ↓ | Brier ↓ |
|
| 150 |
+
|---|---|---|
|
| 151 |
+
| Raw logistic | 0.085 | 0.051 |
|
| 152 |
+
| **Platt (deployed)** | **0.013** | **0.041** |
|
| 153 |
+
| Isotonic | 0.018 | 0.043 |
|
| 154 |
+
|
| 155 |
+
Calibration set: 60 held-out plans, `seed=7`. Reliability diagram is reproducible from `scripts/eval/eval_calibration.py` in the source repo.
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
## Files
|
| 160 |
+
|
| 161 |
+
```
|
| 162 |
+
risk_scorer.pkl # {'model': LogisticRegression, 'scaler': StandardScaler}
|
| 163 |
+
chunks.pkl # RAG knowledge-base chunks (Nexora KB)
|
| 164 |
+
faiss.index # FAISS index over the chunks
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
`chunks.pkl` + `faiss.index` are the retrieval cache used by the demo agent (`scripts/demo/demo_*.py`). They are not required to run the scorer itself but ship together so the full agent + monitor stack is reproducible.
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## Intended use & limitations
|
| 172 |
+
|
| 173 |
+
**Intended for:** research on runtime safety for tool-using LLM agents; ablating rule-based vs. learned components of agent intervention policies.
|
| 174 |
+
|
| 175 |
+
**Not intended for:**
|
| 176 |
+
- standalone safety adjudication on out-of-distribution agent stacks without re-calibration;
|
| 177 |
+
- threat models where the harmful and benign variants of a request use **identical** tool calls (AgentHarm-style), where the scorer collapses to the rule-only baseline by construction.
|
| 178 |
+
|
| 179 |
+
**Known limitation — middle-severity coverage gap.** On rule-blind adversarial plans (NEXUS-Stress), `Π` predicts only `ALLOW` or `BLOCK` and never `CONFIRM` / `REVISE`. We disclose this as a deployment-relevant gap; future rule-set extensions should target medium-severity scope-tighten and disambiguation patterns.
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## Reproducibility
|
| 184 |
+
|
| 185 |
+
All experiments are deterministic. Train/test split uses `seed=42`, train/calibration `seed=7`, benchmark generators and bootstrap `seed=0`. Reproducible from a fresh checkout in under 10 minutes on CPU.
|
| 186 |
+
|
| 187 |
+
## Citation
|
| 188 |
+
|
| 189 |
+
```bibtex
|
| 190 |
+
@inproceedings{hossain2026nexus,
|
| 191 |
+
title = {NEXUS: Structured Runtime Safety for Tool-Using LLM Agents},
|
| 192 |
+
author = {Hossain, Elias},
|
| 193 |
+
booktitle = {ACL Rolling Review},
|
| 194 |
+
year = {2026}
|
| 195 |
+
}
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
## License
|
| 199 |
+
|
| 200 |
+
Apache-2.0.
|
chunks.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9332a27ede4bb01f2c398a0a64ad4baa60b0cd18180d2296935f7d412755dce9
|
| 3 |
+
size 17071
|
faiss.index
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:989e6223f913f56477306d6f29e399ba481ed9c538b15cac19a5c3181037fde4
|
| 3 |
+
size 307245
|
risk_scorer.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f34f6297a6fd2f8c753a4f3fe538c2ae37f06a861effab195c3149453aad9564
|
| 3 |
+
size 1285
|