Initial release of nexus-risk-scorer

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +200 -0
chunks.pkl +3 -0
faiss.index +3 -0
risk_scorer.pkl +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+faiss.index filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+license: apache-2.0
+language:
+- en
+library_name: scikit-learn
+tags:
+- agent-safety
+- llm-agents
+- tool-use
+- safety-monitor
+- calibrated-classifier
+- runtime-safety
+- nexus
+pipeline_tag: tabular-classification
+---
+# NEXUS Risk Scorer
+`risk_scorer.pkl` — a Platt-calibrated logistic regression that maps a 9-dimensional plan-feature vector to a per-plan risk score `ρ(P) ∈ [0, 1]`. It is the learned component of the **NEXUS** runtime safety monitor for tool-using LLM agents.
+> NEXUS combines this scorer with a deterministic rule set (`V_R`) and an argument-level inspector (`V_A`) to produce one of four graded interventions per plan: **ALLOW · BLOCK · CONFIRM · REVISE**.
+- **Paper / repo:** [github.com/eliashossain001/nexus](https://github.com/eliashossain001/nexus)
+- **Companion datasets:**
+  [`EliasHossain/nexus-stress`](https://huggingface.co/datasets/EliasHossain/nexus-stress) ·
+  [`EliasHossain/nexus-ipi`](https://huggingface.co/datasets/EliasHossain/nexus-ipi) ·
+  [`EliasHossain/nexus-synthetic`](https://huggingface.co/datasets/EliasHossain/nexus-synthetic) ·
+  [`EliasHossain/nexus-multistep`](https://huggingface.co/datasets/EliasHossain/nexus-multistep)
+---
+## TL;DR
+| Property | Value |
+|---|---|
+| Algorithm | Logistic Regression (`class_weight='balanced'`, `max_iter=1000`, `random_state=42`) |
+| Preprocessor | `StandardScaler` (z-score) |
+| Calibration | Platt scaling on a held-out 60-instance split (`seed=7`) |
+| Input | 9-D plan feature vector (see schema below) |
+| Output | Calibrated risk score `ρ(P) ∈ [0, 1]` |
+| Operating thresholds | `(τ_b, τ_c) = (0.75, 0.70)` — loss-optimal on `(λ_s, λ_o)` 5×5 grid |
+| Calibration quality | ECE 0.085 → **0.013**, Brier 0.051 → **0.041** |
+| Footprint | ~1.3 KB, CPU-only, sub-millisecond inference |
+---
+## Intervention policy
+The scorer feeds the formal intervention policy `Π`:
+```
+Π(P) =
+  BLOCK    if ∃ v ∈ V(P) : sev(v) = CRIT
+  BLOCK    if ρ(P) ≥ τ_b ∧ |V(P)| ≥ 1
+  CONFIRM  if ρ(P) ≥ τ_c ∨ ∃ v : sev(v) = HIGH
+  REVISE   if ∃ v : sev(v) = MED
+  ALLOW    otherwise
+```
+`V(P) = V_R(P) ∪ V_A(P)` is the union of rule and argument-inspector violations; `sev(·) ∈ {CRIT, HIGH, MED, LOW}`. Thresholds are selected by minimising
+```
+L(Π) = E[λ_s · u_s + λ_o · u_o + λ_c · c(Π(P))]
+```
+over a 5×5 grid of `(λ_s, λ_o)` weights. The point `(0.75, 0.70)` is loss-optimal **uniformly** across the grid on the synthetic split.
+---
+## 9-D plan feature vector
+The scorer consumes per-plan features summarising side effects, sensitivity, permissions, network reach, budget, and structural shape:
+| # | Feature | Description |
+|---|---|---|
+| 1 | `num_steps` | Number of tool calls in the plan |
+| 2 | `has_irreversible` | Any step with `irreversible=True` |
+| 3 | `has_sensitive` | Any step touching sensitive data |
+| 4 | `has_network` | Any external network call |
+| 5 | `num_distinct_tools` | Distinct tool count |
+| 6 | `permissions_required` | Number of unique permission scopes requested |
+| 7 | `est_total_cost` | Sum of per-step `estimated_cost` |
+| 8 | `budget_utilisation` | `est_total_cost / budget` |
+| 9 | `external_endpoint_count` | Distinct outbound endpoints |
+An earlier 10-D variant included a redundant normalised-plan-length feature; it was retired and the scorer retrained — all numbers below reflect the 9-D model.
+---
+## Quick start
+```python
+from huggingface_hub import hf_hub_download
+import pickle, numpy as np
+ckpt_path = hf_hub_download(
+    repo_id="EliasHossain/nexus-risk-scorer",
+    filename="risk_scorer.pkl",
+)
+ckpt = pickle.load(open(ckpt_path, "rb"))
+model, scaler = ckpt["model"], ckpt["scaler"]
+# 9-D feature vector for some plan P
+features = np.array([[3, 1, 1, 0, 3, 2, 4.0, 0.08, 0]])
+rho = model.predict_proba(scaler.transform(features))[0, 1]
+tau_b, tau_c = 0.75, 0.70
+decision = (
+    "BLOCK"    if rho >= tau_b else
+    "CONFIRM"  if rho >= tau_c else
+    "ALLOW"
+)
+print(f"ρ = {rho:.3f}  →  {decision}")
+```
+For the full policy (rules + argument inspector + scorer), install the package and use `EnhancedSafetyMonitor`:
+```bash
+git clone https://github.com/eliashossain001/nexus.git && cd nexus
+pip install -e .
+```
+```python
+from runtime_safety.monitors.enhanced.monitor import EnhancedSafetyMonitor
+monitor = EnhancedSafetyMonitor.from_default_checkpoint()
+intervention, reasons = monitor.decide(plan)
+```
+---
+## Performance
+| Setting | n | F₁ [95% CI] | Notes |
+|---|---|---|---|
+| Synthetic test split | 128 | **0.965** [0.94, 0.99] | 4-class intervention acc 0.945, overblock 0.04 |
+| IPI v1 (prompt injection) | 200 paired | **0.995** [0.98, 1.00] | adv block 100%, ctrl allow 99% |
+| IPI v2 (5 injection styles) | 200 paired | **1.000** | adv block 100%, ctrl overblock 0% |
+| Multi-turn (session memory on) | 120 sessions | **1.000** | 95/95 critical-turn caught, 25/25 controls allowed |
+| R-Judge external (Yuan et al., 2024) | 571 | **0.861** [0.83, 0.89] | Finance 0.92 · Program 0.89 · Web 0.95 · App 0.85 · IoT 0.52 |
+| AgentHarm external (Andriushchenko et al., 2025) | 352 | 0.591 [0.53, 0.65] | Matches rule-only baseline by design (paired harmful/benign share target tools) |
+| **NEXUS-Stress (rule-blind adversarial)** | 200 | **0.836** [0.79, 0.88] | 4-class intervention acc 0.420 — surfaces CONFIRM/REVISE-blind gap |
+Bootstrap CIs use 1000 resamples with `seed=0`.
+---
+## Calibration
+| Calibrator | ECE ↓ | Brier ↓ |
+|---|---|---|
+| Raw logistic | 0.085 | 0.051 |
+| **Platt (deployed)** | **0.013** | **0.041** |
+| Isotonic | 0.018 | 0.043 |
+Calibration set: 60 held-out plans, `seed=7`. Reliability diagram is reproducible from `scripts/eval/eval_calibration.py` in the source repo.
+---
+## Files
+```
+risk_scorer.pkl    # {'model': LogisticRegression, 'scaler': StandardScaler}
+chunks.pkl         # RAG knowledge-base chunks (Nexora KB)
+faiss.index        # FAISS index over the chunks
+```
+`chunks.pkl` + `faiss.index` are the retrieval cache used by the demo agent (`scripts/demo/demo_*.py`). They are not required to run the scorer itself but ship together so the full agent + monitor stack is reproducible.
+---
+## Intended use & limitations
+**Intended for:** research on runtime safety for tool-using LLM agents; ablating rule-based vs. learned components of agent intervention policies.
+**Not intended for:**
+- standalone safety adjudication on out-of-distribution agent stacks without re-calibration;
+- threat models where the harmful and benign variants of a request use **identical** tool calls (AgentHarm-style), where the scorer collapses to the rule-only baseline by construction.
+**Known limitation — middle-severity coverage gap.** On rule-blind adversarial plans (NEXUS-Stress), `Π` predicts only `ALLOW` or `BLOCK` and never `CONFIRM` / `REVISE`. We disclose this as a deployment-relevant gap; future rule-set extensions should target medium-severity scope-tighten and disambiguation patterns.
+---
+## Reproducibility
+All experiments are deterministic. Train/test split uses `seed=42`, train/calibration `seed=7`, benchmark generators and bootstrap `seed=0`. Reproducible from a fresh checkout in under 10 minutes on CPU.
+## Citation
+```bibtex
+@inproceedings{hossain2026nexus,
+  title     = {NEXUS: Structured Runtime Safety for Tool-Using LLM Agents},
+  author    = {Hossain, Elias},
+  booktitle = {ACL Rolling Review},
+  year      = {2026}
+}
+```
+## License
+Apache-2.0.

chunks.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9332a27ede4bb01f2c398a0a64ad4baa60b0cd18180d2296935f7d412755dce9
+size 17071

faiss.index ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:989e6223f913f56477306d6f29e399ba481ed9c538b15cac19a5c3181037fde4
+size 307245

risk_scorer.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f34f6297a6fd2f8c753a4f3fe538c2ae37f06a861effab195c3149453aad9564
+size 1285