EliasHossain commited on
Commit
48c7cb1
·
verified ·
1 Parent(s): 3740cc1

Initial release of nexus-risk-scorer

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +200 -0
  3. chunks.pkl +3 -0
  4. faiss.index +3 -0
  5. risk_scorer.pkl +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ faiss.index filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: scikit-learn
6
+ tags:
7
+ - agent-safety
8
+ - llm-agents
9
+ - tool-use
10
+ - safety-monitor
11
+ - calibrated-classifier
12
+ - runtime-safety
13
+ - nexus
14
+ pipeline_tag: tabular-classification
15
+ ---
16
+
17
+ # NEXUS Risk Scorer
18
+
19
+ `risk_scorer.pkl` — a Platt-calibrated logistic regression that maps a 9-dimensional plan-feature vector to a per-plan risk score `ρ(P) ∈ [0, 1]`. It is the learned component of the **NEXUS** runtime safety monitor for tool-using LLM agents.
20
+
21
+ > NEXUS combines this scorer with a deterministic rule set (`V_R`) and an argument-level inspector (`V_A`) to produce one of four graded interventions per plan: **ALLOW · BLOCK · CONFIRM · REVISE**.
22
+
23
+ - **Paper / repo:** [github.com/eliashossain001/nexus](https://github.com/eliashossain001/nexus)
24
+ - **Companion datasets:**
25
+ [`EliasHossain/nexus-stress`](https://huggingface.co/datasets/EliasHossain/nexus-stress) ·
26
+ [`EliasHossain/nexus-ipi`](https://huggingface.co/datasets/EliasHossain/nexus-ipi) ·
27
+ [`EliasHossain/nexus-synthetic`](https://huggingface.co/datasets/EliasHossain/nexus-synthetic) ·
28
+ [`EliasHossain/nexus-multistep`](https://huggingface.co/datasets/EliasHossain/nexus-multistep)
29
+
30
+ ---
31
+
32
+ ## TL;DR
33
+
34
+ | Property | Value |
35
+ |---|---|
36
+ | Algorithm | Logistic Regression (`class_weight='balanced'`, `max_iter=1000`, `random_state=42`) |
37
+ | Preprocessor | `StandardScaler` (z-score) |
38
+ | Calibration | Platt scaling on a held-out 60-instance split (`seed=7`) |
39
+ | Input | 9-D plan feature vector (see schema below) |
40
+ | Output | Calibrated risk score `ρ(P) ∈ [0, 1]` |
41
+ | Operating thresholds | `(τ_b, τ_c) = (0.75, 0.70)` — loss-optimal on `(λ_s, λ_o)` 5×5 grid |
42
+ | Calibration quality | ECE 0.085 → **0.013**, Brier 0.051 → **0.041** |
43
+ | Footprint | ~1.3 KB, CPU-only, sub-millisecond inference |
44
+
45
+ ---
46
+
47
+ ## Intervention policy
48
+
49
+ The scorer feeds the formal intervention policy `Π`:
50
+
51
+ ```
52
+ Π(P) =
53
+ BLOCK if ∃ v ∈ V(P) : sev(v) = CRIT
54
+ BLOCK if ρ(P) ≥ τ_b ∧ |V(P)| ≥ 1
55
+ CONFIRM if ρ(P) ≥ τ_c ∨ ∃ v : sev(v) = HIGH
56
+ REVISE if ∃ v : sev(v) = MED
57
+ ALLOW otherwise
58
+ ```
59
+
60
+ `V(P) = V_R(P) ∪ V_A(P)` is the union of rule and argument-inspector violations; `sev(·) ∈ {CRIT, HIGH, MED, LOW}`. Thresholds are selected by minimising
61
+
62
+ ```
63
+ L(Π) = E[λ_s · u_s + λ_o · u_o + λ_c · c(Π(P))]
64
+ ```
65
+
66
+ over a 5×5 grid of `(λ_s, λ_o)` weights. The point `(0.75, 0.70)` is loss-optimal **uniformly** across the grid on the synthetic split.
67
+
68
+ ---
69
+
70
+ ## 9-D plan feature vector
71
+
72
+ The scorer consumes per-plan features summarising side effects, sensitivity, permissions, network reach, budget, and structural shape:
73
+
74
+ | # | Feature | Description |
75
+ |---|---|---|
76
+ | 1 | `num_steps` | Number of tool calls in the plan |
77
+ | 2 | `has_irreversible` | Any step with `irreversible=True` |
78
+ | 3 | `has_sensitive` | Any step touching sensitive data |
79
+ | 4 | `has_network` | Any external network call |
80
+ | 5 | `num_distinct_tools` | Distinct tool count |
81
+ | 6 | `permissions_required` | Number of unique permission scopes requested |
82
+ | 7 | `est_total_cost` | Sum of per-step `estimated_cost` |
83
+ | 8 | `budget_utilisation` | `est_total_cost / budget` |
84
+ | 9 | `external_endpoint_count` | Distinct outbound endpoints |
85
+
86
+ An earlier 10-D variant included a redundant normalised-plan-length feature; it was retired and the scorer retrained — all numbers below reflect the 9-D model.
87
+
88
+ ---
89
+
90
+ ## Quick start
91
+
92
+ ```python
93
+ from huggingface_hub import hf_hub_download
94
+ import pickle, numpy as np
95
+
96
+ ckpt_path = hf_hub_download(
97
+ repo_id="EliasHossain/nexus-risk-scorer",
98
+ filename="risk_scorer.pkl",
99
+ )
100
+ ckpt = pickle.load(open(ckpt_path, "rb"))
101
+ model, scaler = ckpt["model"], ckpt["scaler"]
102
+
103
+ # 9-D feature vector for some plan P
104
+ features = np.array([[3, 1, 1, 0, 3, 2, 4.0, 0.08, 0]])
105
+ rho = model.predict_proba(scaler.transform(features))[0, 1]
106
+
107
+ tau_b, tau_c = 0.75, 0.70
108
+ decision = (
109
+ "BLOCK" if rho >= tau_b else
110
+ "CONFIRM" if rho >= tau_c else
111
+ "ALLOW"
112
+ )
113
+ print(f"ρ = {rho:.3f} → {decision}")
114
+ ```
115
+
116
+ For the full policy (rules + argument inspector + scorer), install the package and use `EnhancedSafetyMonitor`:
117
+
118
+ ```bash
119
+ git clone https://github.com/eliashossain001/nexus.git && cd nexus
120
+ pip install -e .
121
+ ```
122
+
123
+ ```python
124
+ from runtime_safety.monitors.enhanced.monitor import EnhancedSafetyMonitor
125
+ monitor = EnhancedSafetyMonitor.from_default_checkpoint()
126
+ intervention, reasons = monitor.decide(plan)
127
+ ```
128
+
129
+ ---
130
+
131
+ ## Performance
132
+
133
+ | Setting | n | F₁ [95% CI] | Notes |
134
+ |---|---|---|---|
135
+ | Synthetic test split | 128 | **0.965** [0.94, 0.99] | 4-class intervention acc 0.945, overblock 0.04 |
136
+ | IPI v1 (prompt injection) | 200 paired | **0.995** [0.98, 1.00] | adv block 100%, ctrl allow 99% |
137
+ | IPI v2 (5 injection styles) | 200 paired | **1.000** | adv block 100%, ctrl overblock 0% |
138
+ | Multi-turn (session memory on) | 120 sessions | **1.000** | 95/95 critical-turn caught, 25/25 controls allowed |
139
+ | R-Judge external (Yuan et al., 2024) | 571 | **0.861** [0.83, 0.89] | Finance 0.92 · Program 0.89 · Web 0.95 · App 0.85 · IoT 0.52 |
140
+ | AgentHarm external (Andriushchenko et al., 2025) | 352 | 0.591 [0.53, 0.65] | Matches rule-only baseline by design (paired harmful/benign share target tools) |
141
+ | **NEXUS-Stress (rule-blind adversarial)** | 200 | **0.836** [0.79, 0.88] | 4-class intervention acc 0.420 — surfaces CONFIRM/REVISE-blind gap |
142
+
143
+ Bootstrap CIs use 1000 resamples with `seed=0`.
144
+
145
+ ---
146
+
147
+ ## Calibration
148
+
149
+ | Calibrator | ECE ↓ | Brier ↓ |
150
+ |---|---|---|
151
+ | Raw logistic | 0.085 | 0.051 |
152
+ | **Platt (deployed)** | **0.013** | **0.041** |
153
+ | Isotonic | 0.018 | 0.043 |
154
+
155
+ Calibration set: 60 held-out plans, `seed=7`. Reliability diagram is reproducible from `scripts/eval/eval_calibration.py` in the source repo.
156
+
157
+ ---
158
+
159
+ ## Files
160
+
161
+ ```
162
+ risk_scorer.pkl # {'model': LogisticRegression, 'scaler': StandardScaler}
163
+ chunks.pkl # RAG knowledge-base chunks (Nexora KB)
164
+ faiss.index # FAISS index over the chunks
165
+ ```
166
+
167
+ `chunks.pkl` + `faiss.index` are the retrieval cache used by the demo agent (`scripts/demo/demo_*.py`). They are not required to run the scorer itself but ship together so the full agent + monitor stack is reproducible.
168
+
169
+ ---
170
+
171
+ ## Intended use & limitations
172
+
173
+ **Intended for:** research on runtime safety for tool-using LLM agents; ablating rule-based vs. learned components of agent intervention policies.
174
+
175
+ **Not intended for:**
176
+ - standalone safety adjudication on out-of-distribution agent stacks without re-calibration;
177
+ - threat models where the harmful and benign variants of a request use **identical** tool calls (AgentHarm-style), where the scorer collapses to the rule-only baseline by construction.
178
+
179
+ **Known limitation — middle-severity coverage gap.** On rule-blind adversarial plans (NEXUS-Stress), `Π` predicts only `ALLOW` or `BLOCK` and never `CONFIRM` / `REVISE`. We disclose this as a deployment-relevant gap; future rule-set extensions should target medium-severity scope-tighten and disambiguation patterns.
180
+
181
+ ---
182
+
183
+ ## Reproducibility
184
+
185
+ All experiments are deterministic. Train/test split uses `seed=42`, train/calibration `seed=7`, benchmark generators and bootstrap `seed=0`. Reproducible from a fresh checkout in under 10 minutes on CPU.
186
+
187
+ ## Citation
188
+
189
+ ```bibtex
190
+ @inproceedings{hossain2026nexus,
191
+ title = {NEXUS: Structured Runtime Safety for Tool-Using LLM Agents},
192
+ author = {Hossain, Elias},
193
+ booktitle = {ACL Rolling Review},
194
+ year = {2026}
195
+ }
196
+ ```
197
+
198
+ ## License
199
+
200
+ Apache-2.0.
chunks.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9332a27ede4bb01f2c398a0a64ad4baa60b0cd18180d2296935f7d412755dce9
3
+ size 17071
faiss.index ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:989e6223f913f56477306d6f29e399ba481ed9c538b15cac19a5c3181037fde4
3
+ size 307245
risk_scorer.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f34f6297a6fd2f8c753a4f3fe538c2ae37f06a861effab195c3149453aad9564
3
+ size 1285