naazimsnh02
/

fraudsentinel-tier1-scorers

@@ -17,23 +17,19 @@ datasets:
 # FraudSentinel — Tier-1 Real-Time Scorers (LightGBM + GNN)
-Fast statistical scorers for the **Tier-1** of the FraudSentinel two-tier fraud-detection
-architecture. They score **100% of the transaction stream in single-digit milliseconds** and
-**route only flagged/borderline cases** to the Tier-2 fine-tuned LLM
-([`naazimsnh02/fraud-financial-crime-qwen3-sft-v2`](https://huggingface.co/datasets/naazimsnh02/fraud-financial-crime-qwen3-sft-v2))
-for explanation, typology, recommended action, and SAR drafting.
-> Two-tier pattern follows published financial-crime systems: a fast model triages every
-> transaction; the LLM explains the flagged minority (arXiv:2507.14785, 2210.14360, 2312.13896).
 ## 📊 Models Overview
 | Artifact | Task | Dataset | Metrics | Status |
 |---|---|---|---|---|
-| `cc_lgbm_model.txt` + `cc_lgbm_preproc.joblib` | Card-Not-Present (CNP) Fraud | Sparkov (1.3M tx) | **PR-AUC 0.967 · ROC-AUC 0.999** | ✅ Shipped |
 | `aml_gnn.pt` | Money Laundering (Graph) | IBM AML HI-Small (5M tx) | **ROC-AUC 0.584 · PR-AUC 0.0036** | ✅ Complete |
-| `aml_lgbm_model.txt` + `aml_lgbm_preproc.joblib` | AML Pre-Filter (Tabular) | IBM AML HI-Small (5M tx) | **ROC-AUC 0.82 · PR-AUC 0.023** | ✅ Complete |
 ---
@@ -41,76 +37,169 @@ for explanation, typology, recommended action, and SAR drafting.
 **Evaluated on Sparkov's held-out test split at natural fraud rate (0.39%):**
-- **Metrics:** PR-AUC **0.967**, ROC-AUC **0.999**
-- **Routing Performance:** At recall ≈ 0.90, precision 0.964 → catches **90% of fraud** while flagging only **<0.36%** of traffic to Tier-2 LLM
-- **Latency:** <5 ms per transaction on CPU
-- **Top Features:**
-  - Amount vs. category 95th percentile anomaly
-  - Merchant category
-  - Log transaction amount
-  - Time-of-day patterns
-  - 24h/1h velocity per card
-  - Time since last transaction
-- **Engineered Signals:** Geo distance (home ↔ merchant), cardholder age, category amount anomaly
 ---
-## 🕸️ AML GNN Scorer
-**Graph Neural Network (GINE edge-classifier with edge features)**
-Trained on IBM AML transaction multigraph with **temporal 70/10/20 train/val/test split** and class-weighted loss.
-### Current Metrics
-- **ROC-AUC:** 0.584 (on temporal test split)
-- **PR-AUC:** 0.0036
-- **Best F1:** 0.0159 (@ threshold 1.0)
-- **Architecture:** 3× GINEConv layers, hidden dim 96, reverse edges (bidirectional message passing)
-- **Training:** 80 epochs, best-checkpoint selection, recall-calibrated routing thresholds
 ### Why Graph Structure Matters
-- **Beats tabular baseline:** Tabular LightGBM (ROC-AUC 0.82 → **0.584** GNN) confirms graph structure carries laundering patterns invisible to per-transaction models
-- **Multi-hop detection:** Catches fan-out, gather-scatter, cycle-based laundering patterns
-- **Why modest precision:** This is a baseline GNN, not the full IBM Multi-GNN. Use as high-recall graph triage that routes suspicious sub-graphs to Tier-2 LLM + human review
 ### Routing Thresholds (Test Split)
 | Recall Target | Threshold | Precision | Flagged % |
 |---|---|---|---|
 | 70% | 0.527 | 0.0019 | 69.4% |
 | 80% | 0.377 | 0.0019 | 76.2% |
 | 90% | 0.003 | 0.0018 | 89.3% |
----
-## 🧮 AML Pre-Filter (Tabular Baseline)
-Single-transaction LightGBM scorer for deployment on resource-constrained systems or as Tier-1 pre-filter.
-- **Metrics:** ROC-AUC **0.82**, PR-AUC **0.023**
-- **Features:** Amount, currency mismatch, payment format, time-of-day, velocity
-- **Latency:** <1 ms per transaction on CPU
-- **Purpose:** High-recall triage or lightweight deployment before GNN
 ---
-## Inference
-```python
-from infer import CardScorer, AMLScorer
-cs = CardScorer()                       # loads model + preprocessor + routing threshold
-out = cs.score(transaction_dict)        # -> {"risk": 0.97, "route_to_llm": True, "tier": "card"}
-# if out["route_to_llm"]: send the case to the Tier-2 LLM for explanation + SAR draft
-```
-Card transactions are scored in well under 10 ms on CPU. The GNN scores the AML graph in a single
-batched forward pass on GPU (or CPU for small graphs); `route_to_llm` uses the recall-calibrated
-threshold stored in the metrics JSON — tune it to your false-positive budget.
-## Limitations
-Prototype/research use. Source data is synthetic/semi-synthetic. The card scorer's strong metrics reflect
-the Sparkov generator's structure; validate on your own data before deployment. The AML GNN is a
-high-recall graph triage, not a compliant detector — pair it with human-in-the-loop review and (optionally)
-the full Multi-GNN recipe for production-grade precision.
-## License
 Apache-2.0. Source datasets retain their own licenses.

 # FraudSentinel — Tier-1 Real-Time Scorers (LightGBM + GNN)
+Fast statistical scorers for the **Tier-1** of the FraudSentinel two-tier fraud-detection architecture. They score **100% of the transaction stream in single-digit milliseconds** and **route only flagged cases** to the Tier-2 fine-tuned LLM ([`naazimsnh02/fraudsentinel-qwen3-14b-lora`](https://huggingface.co/naazimsnh02/fraudsentinel-qwen3-14b-lora)) for explanation, typology classification, recommended action, and SAR drafting.
+> Two-tier pattern follows published financial-crime systems: a fast model triages every transaction; the LLM explains the flagged minority (arXiv:2507.14785, 2210.14360, 2312.13896).
+---
 ## 📊 Models Overview
 | Artifact | Task | Dataset | Metrics | Status |
 |---|---|---|---|---|
+| `cc_lgbm_model.txt` + `cc_lgbm_preproc.joblib` | Card-Not-Present (CNP) Fraud | Sparkov (1.3M tx) | **PR-AUC 0.967 · ROC-AUC 0.999** | ✅ Complete |
+| `aml_lgbm_model.txt` + `aml_lgbm_preproc.joblib` | AML Pre-Filter (Tabular) | IBM AML HI-Small (5M tx) | **ROC-AUC 0.822 · PR-AUC 0.023** | ✅ Complete |
 | `aml_gnn.pt` | Money Laundering (Graph) | IBM AML HI-Small (5M tx) | **ROC-AUC 0.584 · PR-AUC 0.0036** | ✅ Complete |
 ---
 **Evaluated on Sparkov's held-out test split at natural fraud rate (0.39%):**
+| Metric | Value |
+|---|---|
+| **PR-AUC** | 0.967 |
+| **ROC-AUC** | 0.999 |
+| **Train rows** | 1,296,675 |
+| **Test rows** | 555,719 |
+| **Test fraud rate** | 0.387% |
+| **Best iteration** | 810 |
+| **Scale pos weight** | 171.8× |
+### Routing Performance (Test Split)
+| Recall Target | Threshold | Precision | Flagged % |
+|---|---|---|---|
+| 80% | 0.997 | 0.995 | 0.31% |
+| 85% | 0.987 | 0.982 | 0.33% |
+| **90% (default)** | **0.940** | **0.964** | **0.36%** |
+| 95% | 0.212 | 0.829 | 0.44% |
+At the default routing threshold (recall ≈ 0.90), the scorer flags **<0.4%** of all card traffic to the Tier-2 LLM while catching 90% of fraud.
+### Top Features (by gain)
+1. `amt_to_p95` — transaction amount relative to per-category 95th percentile
+2. `category` — merchant category code
+3. `log_amt` — log-transformed transaction amount
+4. `is_night` — off-hours indicator (10 PM–4 AM)
+5. `amt_24h` — rolling 24-hour spend per card
+6. `mins_since_last` — time since previous transaction on same card
+7. `state` — cardholder state
+8. `age` — cardholder age
+### Engineered Signals
+- Per-category amount anomaly (amount vs. 95th percentile)
+- 1-hour and 24-hour velocity (transaction count + spend)
+- Geo distance (home ↔ merchant, haversine)
+- Time-of-day features (hour, day-of-week, is-night)
+- Cardholder age from date of birth
+- Category historical fraud rate
+### Inference
+```python
+import lightgbm as lgb
+import joblib
+preproc = joblib.load("cc_lgbm_preproc.joblib")
+model   = lgb.Booster(model_file="cc_lgbm_model.txt")
+# featurize using the same pipeline as cc_lgbm.py
+# route_to_llm = score >= 0.9403985330442168  (recall-0.90 threshold)
+```
 ---
+## 🧮 AML Pre-Filter (Tabular)
+Single-transaction LightGBM scorer. Deployed as a high-recall pre-filter or as a standalone lightweight scorer for resource-constrained environments.
+| Metric | Value |
+|---|---|
+| **ROC-AUC** | 0.822 |
+| **PR-AUC** | 0.023 |
+| **Train rows** | 4,062,676 |
+| **Test rows** | 1,015,669 |
+| **Test laundering rate** | 0.177% |
+| **Best iteration** | 671 |
+| **Scale pos weight** | 1,201× |
+### Routing Performance (Test Split)
+| Recall Target | Threshold | Precision | Flagged % |
+|---|---|---|---|
+| 50% | 6.0e-23 | 0.014 | 6.2% |
+| 60% | 1.4e-40 | 0.011 | 9.3% |
+| 70% | 1.7e-66 | 0.008 | 15.3% |
+| **80% (default)** | **1.8e-116** | **0.005** | **31.4%** |
+The extreme threshold values reflect the model's calibration at very low fraud prevalence; the operating point is chosen by recall target, not threshold magnitude.
+### Top Features (by gain)
+1. `rcv_in_deg` — receiver account in-degree (fan-in)
+2. `snd_in_deg` — sender account in-degree
+3. `snd_out_deg` — sender account out-degree (fan-out)
+4. `snd_in_cnt` — sender inbound transaction count
+5. `self_loop` — self-transfer indicator
+6. `hour` — transaction hour
+7. `rcv_in_cnt` — receiver inbound transaction count
+8. `snd_out_cnt` — sender outbound transaction count
+### Engineered Signals
+- Sender/receiver in-degree and out-degree (graph connectivity, fit on train only)
+- Self-loop detection
+- Currency mismatch flag
+- Round-number amount indicator
+- Same-bank transfer flag
+- Gather-scatter indicator (accounts with high in-degree **and** high out-degree)
+- Log-transformed amounts (paid and received)
+- Time features (hour, day of week)
+### Inference
+```python
+import lightgbm as lgb
+import joblib
+preproc = joblib.load("aml_lgbm_preproc.joblib")
+model   = lgb.Booster(model_file="aml_lgbm_model.txt")
+# apply featurize() from aml_lgbm.py with preproc graph dictionaries
+# route_to_llm = score >= routing_threshold from aml_lgbm_metrics.json
+```
+---
+## 🕸️ AML GNN Scorer
+Edge-classification Graph Neural Network that scores **transactions as edges** in the inter-bank transfer multigraph. Captures multi-hop laundering patterns that are invisible to single-transaction models.
+| Metric | Value |
+|---|---|
+| **ROC-AUC** | 0.584 |
+| **PR-AUC** | 0.0036 |
+| **Best F1** | 0.0159 @ threshold 1.0 |
+| **Architecture** | 3× GINEConv, hidden dim 96, bidirectional message passing |
+| **Training** | 80 epochs, temporal 70/10/20 split, class-weighted loss (~1,245:1) |
+| **Edge features** | log(amount_paid), log(amount_received), hour/23, dow/6, currency_mismatch, self_loop, payment_format (one-hot) |
+| **Node features** | log(in-degree), log(out-degree) computed on train edges only |
 ### Why Graph Structure Matters
+Single-transaction tabular models are bounded by per-transaction features. Money laundering is a **multi-hop graph pattern** — fan-out, gather-scatter, and cycle-based structuring are only visible when the full transaction network is modeled. The GNN acts as a high-recall graph triage layer that routes suspicious subgraphs to the Tier-2 LLM for investigator-facing explanation.
 ### Routing Thresholds (Test Split)
 | Recall Target | Threshold | Precision | Flagged % |
 |---|---|---|---|
 | 70% | 0.527 | 0.0019 | 69.4% |
 | 80% | 0.377 | 0.0019 | 76.2% |
 | 90% | 0.003 | 0.0018 | 89.3% |
+### Inference
+```python
+import torch
+from torch_geometric.nn import GINEConv
+# Load weights
+state = torch.load("aml_gnn.pt", map_location="cpu")
+# Rebuild EdgeGNN as defined in train_gnn_aml.py
+# model.load_state_dict(state)
+# probs = F.softmax(model(x, mp_edge_index, mp_edge_attr, edge_index, edge_attr), dim=1)[:, 1]
+```
 ---
+## ⚖️ Limitations
+- Source data is synthetic/semi-synthetic. The card scorer's strong metrics reflect the Sparkov generator's structure; validate on your own data before any production deployment.
+- The AML GNN is a high-recall graph triage tool, not a production-validated detector. Pair it with human-in-the-loop review and the Tier-2 LLM for investigator-grade precision.
+- The AML tabular pre-filter uses graph degree features fit on the training partition. In a streaming deployment, these must be maintained as rolling aggregate state.
+- No model in this repository should be used for real customer adjudication without independent validation, bias review, and human-in-the-loop controls.
+---
+## 📄 License
 Apache-2.0. Source datasets retain their own licenses.