NEXUS Risk Scorer

risk_scorer.pkl — a Platt-calibrated logistic regression that maps a 9-dimensional plan-feature vector to a per-plan risk score ρ(P) ∈ [0, 1]. It is the learned component of the NEXUS runtime safety monitor for tool-using LLM agents.

NEXUS combines this scorer with a deterministic rule set (V_R) and an argument-level inspector (V_A) to produce one of four graded interventions per plan: ALLOW · BLOCK · CONFIRM · REVISE.

Paper / repo: github.com/eliashossain001/nexus
Companion datasets: EliasHossain/nexus-stress · EliasHossain/nexus-ipi · EliasHossain/nexus-synthetic · EliasHossain/nexus-multistep

TL;DR

Property	Value
Algorithm	Logistic Regression (`class_weight='balanced'`, `max_iter=1000`, `random_state=42`)
Preprocessor	`StandardScaler` (z-score)
Calibration	Platt scaling on a held-out 60-instance split (`seed=7`)
Input	9-D plan feature vector (see schema below)
Output	Calibrated risk score `ρ(P) ∈ [0, 1]`
Operating thresholds	`(τ_b, τ_c) = (0.75, 0.70)` — loss-optimal on `(λ_s, λ_o)` 5×5 grid
Calibration quality	ECE 0.085 → 0.013, Brier 0.051 → 0.041
Footprint	~1.3 KB, CPU-only, sub-millisecond inference

Intervention policy

The scorer feeds the formal intervention policy Π:

Π(P) =
  BLOCK    if ∃ v ∈ V(P) : sev(v) = CRIT
  BLOCK    if ρ(P) ≥ τ_b ∧ |V(P)| ≥ 1
  CONFIRM  if ρ(P) ≥ τ_c ∨ ∃ v : sev(v) = HIGH
  REVISE   if ∃ v : sev(v) = MED
  ALLOW    otherwise

V(P) = V_R(P) ∪ V_A(P) is the union of rule and argument-inspector violations; sev(·) ∈ {CRIT, HIGH, MED, LOW}. Thresholds are selected by minimising

L(Π) = E[λ_s · u_s + λ_o · u_o + λ_c · c(Π(P))]

over a 5×5 grid of (λ_s, λ_o) weights. The point (0.75, 0.70) is loss-optimal uniformly across the grid on the synthetic split.

9-D plan feature vector

The scorer consumes per-plan features summarising side effects, sensitivity, permissions, network reach, budget, and structural shape:

#	Feature	Description
1	`num_steps`	Number of tool calls in the plan
2	`has_irreversible`	Any step with `irreversible=True`
3	`has_sensitive`	Any step touching sensitive data
4	`has_network`	Any external network call
5	`num_distinct_tools`	Distinct tool count
6	`permissions_required`	Number of unique permission scopes requested
7	`est_total_cost`	Sum of per-step `estimated_cost`
8	`budget_utilisation`	`est_total_cost / budget`
9	`external_endpoint_count`	Distinct outbound endpoints

An earlier 10-D variant included a redundant normalised-plan-length feature; it was retired and the scorer retrained — all numbers below reflect the 9-D model.

Quick start

from huggingface_hub import hf_hub_download
import pickle, numpy as np

ckpt_path = hf_hub_download(
    repo_id="EliasHossain/nexus-risk-scorer",
    filename="risk_scorer.pkl",
)
ckpt = pickle.load(open(ckpt_path, "rb"))
model, scaler = ckpt["model"], ckpt["scaler"]

# 9-D feature vector for some plan P
features = np.array([[3, 1, 1, 0, 3, 2, 4.0, 0.08, 0]])
rho = model.predict_proba(scaler.transform(features))[0, 1]

tau_b, tau_c = 0.75, 0.70
decision = (
    "BLOCK"    if rho >= tau_b else
    "CONFIRM"  if rho >= tau_c else
    "ALLOW"
)
print(f"ρ = {rho:.3f}  →  {decision}")

For the full policy (rules + argument inspector + scorer), install the package and use EnhancedSafetyMonitor:

git clone https://github.com/eliashossain001/nexus.git && cd nexus
pip install -e .

from runtime_safety.monitors.enhanced.monitor import EnhancedSafetyMonitor
monitor = EnhancedSafetyMonitor.from_default_checkpoint()
intervention, reasons = monitor.decide(plan)

Performance

Setting	n	F₁ [95% CI]	Notes
Synthetic test split	128	0.965 [0.94, 0.99]	4-class intervention acc 0.945, overblock 0.04
IPI v1 (prompt injection)	200 paired	0.995 [0.98, 1.00]	adv block 100%, ctrl allow 99%
IPI v2 (5 injection styles)	200 paired	1.000	adv block 100%, ctrl overblock 0%
Multi-turn (session memory on)	120 sessions	1.000	95/95 critical-turn caught, 25/25 controls allowed
R-Judge external (Yuan et al., 2024)	571	0.861 [0.83, 0.89]	Finance 0.92 · Program 0.89 · Web 0.95 · App 0.85 · IoT 0.52
AgentHarm external (Andriushchenko et al., 2025)	352	0.591 [0.53, 0.65]	Matches rule-only baseline by design (paired harmful/benign share target tools)
NEXUS-Stress (rule-blind adversarial)	200	0.836 [0.79, 0.88]	4-class intervention acc 0.420 — surfaces CONFIRM/REVISE-blind gap

Bootstrap CIs use 1000 resamples with seed=0.

Calibration

Calibrator	ECE ↓	Brier ↓
Raw logistic	0.085	0.051
Platt (deployed)	0.013	0.041
Isotonic	0.018	0.043

Calibration set: 60 held-out plans, seed=7. Reliability diagram is reproducible from scripts/eval/eval_calibration.py in the source repo.

Files

risk_scorer.pkl    # {'model': LogisticRegression, 'scaler': StandardScaler}
chunks.pkl         # RAG knowledge-base chunks (Nexora KB)
faiss.index        # FAISS index over the chunks

chunks.pkl + faiss.index are the retrieval cache used by the demo agent (scripts/demo/demo_*.py). They are not required to run the scorer itself but ship together so the full agent + monitor stack is reproducible.

Intended use & limitations

Intended for: research on runtime safety for tool-using LLM agents; ablating rule-based vs. learned components of agent intervention policies.

Not intended for:

standalone safety adjudication on out-of-distribution agent stacks without re-calibration;
threat models where the harmful and benign variants of a request use identical tool calls (AgentHarm-style), where the scorer collapses to the rule-only baseline by construction.

Known limitation — middle-severity coverage gap. On rule-blind adversarial plans (NEXUS-Stress), Π predicts only ALLOW or BLOCK and never CONFIRM / REVISE. We disclose this as a deployment-relevant gap; future rule-set extensions should target medium-severity scope-tighten and disambiguation patterns.

Reproducibility

All experiments are deterministic. Train/test split uses seed=42, train/calibration seed=7, benchmark generators and bootstrap seed=0. Reproducible from a fresh checkout in under 10 minutes on CPU.

Citation

@inproceedings{hossain2026nexus,
  title     = {NEXUS: Structured Runtime Safety for Tool-Using LLM Agents},
  author    = {Hossain, Elias and Nipu, Md Mehedi Hasan and Ornee, Tasfia Nuzhat and Rana, Rajib and Yousefi, Niloofar},
  booktitle = {ACL Rolling Review},
  year      = {2026}
}

License

Apache-2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track