Add external_references and model_summary to YAML metadata

b5dca2e verified 28 days ago

15.8 kB

language:
  - en
license: cc-by-nc-4.0
library_name: pytorch
pipeline_tag: text-classification
tags:
  - pytorch
  - lstm
  - gru
  - text-classification
  - prompt-injection
  - security
  - multi-turn
  - cybersecurity
  - deep-learning
  - temporal-modeling
datasets:
  - rockCO78/multiturn-injection-detection
metrics:
  - f1
  - precision
  - recall
  - accuracy
model_type: lstm-gru-dual-encoder
model_summary: >-
  A dual-encoder architecture combining a frozen GRU turn encoder (2.6M params)
  with a trainable sequence LSTM (27K params) for detecting distributed prompt
  injection attacks across multi-turn LLM conversations. Achieves F1=0.837 with
  55% turn-order sensitivity, confirming temporal pattern learning.
base_model: []
external_references:
  - url: https://huggingface.co/rockCO78/multiturn-injection-detector
    type: website
    comment: Model repository
  - url: https://github.com/rocklambros/multiturn-injection-detection
    type: vcs
    comment: Source code repository
  - url: >-
      https://github.com/rocklambros/multiturn-injection-detection/blob/main/report/research_report.md
    type: documentation
    comment: Research report
model-index:
  - name: multiturn-injection-detector
    results:
      - task:
          type: text-classification
          name: Multi-Turn Prompt Injection Detection
        dataset:
          type: rockCO78/multiturn-injection-detection
          name: Multi-Turn Injection Detection Dataset v3
          split: test
        metrics:
          - type: f1
            value: 0.837
            name: F1 Score (Temporal LSTM)
            verified: false
          - type: precision
            value: 0.851
            name: Precision
            verified: false
          - type: recall
            value: 0.823
            name: Recall
            verified: false
          - type: accuracy
            value: 0.84
            name: Accuracy
            verified: false
          - type: f1
            value: 0.992
            name: F1 Score (DistilBERT Concat Baseline)
            verified: false
paper: >-
  https://github.com/rocklambros/multiturn-injection-detection/blob/main/report/research_report.md
gated: manual
extra_gated_prompt: >-
  This model detects distributed prompt injection attacks across multi-turn
  conversations. By requesting access, you agree to:

  1. Use this model only for defensive security, detection, or academic research

  2. Not reverse-engineer detection patterns to develop evasion techniques
     for malicious purposes

  3. Cite the associated paper in any published work

  Please describe your intended use case below.
extra_gated_fields:
  Intended use case: text
  Affiliation: text
  I agree to the responsible use terms: checkbox
hyperparameter:
  learning_rate: 0.001
  batch_size: 32
  epochs: 20
  optimizer: Adam
  hidden_dim: 64
  turn_embedding_dim: 32
  dropout: 0.3
  weight_decay: 0.0001
  scheduler: ReduceLROnPlateau
  patience: 5
  seed: 42
energyConsumption: Estimated 0.08 kWh for full training pipeline on NVIDIA Jetson Orin AGX
energyQuantity: '0.08'
energyUnit: kWh
ethicalConsiderations: >-
  This model is designed exclusively for defensive security research and prompt
  injection detection. The synthetic training data contains adversarial prompt
  patterns that could theoretically inform attack development if misused. Access
  is gated to mitigate dual-use risk. The model should not be used to develop
  adversarial attacks, bypass safety systems, or enable malicious prompt
  injection.
safetyRiskAssessment: >-
  Low direct risk. The model classifies conversations as benign or attack and
  does not generate text. Misclassification (false negatives) could allow
  attacks to pass undetected; false positives could flag benign conversations.
  The model was trained on synthetic data and may not generalize to all
  real-world attack vectors. Gated access with responsible-use terms mitigates
  misuse risk.
intendedUse: >-
  Defensive security systems that monitor multi-turn LLM conversations for
  distributed prompt injection attacks. Intended for deployment as a secondary
  classifier alongside single-turn detectors in LLM guardrail pipelines. Target
  users include security researchers, AI safety teams, and organizations
  deploying LLM-based applications.
technicalLimitations: >-
  Trained exclusively on synthetic data generated by a single LLM (Claude Sonnet
  4.6); cross-model generalization is untested. Fixed conversation length of 6-9
  user turns; shorter or longer conversations may reduce accuracy. Residual
  vocabulary confounds in post-branch turns (bag-of-words classifier achieves F1
  > 0.93 on post-branch text). The temporal LSTM operates on 32-dimensional turn
  embeddings and cannot access raw vocabulary, but the training signal partially
  correlates with lexical features.
typeOfModel: lstm-gru-dual-encoder
modelExplainability: >-
  The model provides attention weights over conversation turns indicating which
  turns contributed most to classification. Turn-order sensitivity analysis (55%
  flip rate when shuffling correctly-classified attacks) confirms the model
  relies on temporal patterns rather than per-turn features. Gate activation
  visualizations show distinct forget/update gate patterns for attack versus
  benign sequences.
informationAboutTraining: >-
  Two-phase training pipeline. Phase 1: GRU turn encoder trained on 73,390
  single-turn prompt injection samples with full backpropagation (2.6M
  parameters). Phase 2: GRU encoder frozen, sequence LSTM trained on 18,754
  multi-turn conversations (27K trainable parameters). Hardware: NVIDIA Jetson
  Orin AGX. Adam optimizer, learning rate 0.001, batch size 32, 20 epochs with
  early stopping.
informationAboutApplication: >-
  Designed for edge deployment on NVIDIA Jetson Orin AGX with approximately 5ms
  inference latency per conversation. The frozen GRU turn encoder processes each
  new turn independently; the sequence LSTM updates its hidden state
  incrementally, enabling online classification as turns arrive.
metric: >-
  Primary: F1=0.837 [95% CI: 0.826, 0.847]. Per-tier: easy=0.872, medium=0.828,
  hard=0.828, adversarial=0.802. All comparisons significant at p < 0.001 via
  paired bootstrap.
metricDecisionThreshold: >-
  Default classification threshold: 0.5 (sigmoid output). Threshold-tuned
  variant at 0.64 achieves F1=0.995 on validation set.
modelDataPreprocessing: >-
  Single-turn: lowercased, whitespace-normalized, deduplicated via MD5, filtered
  to 5-512 tokens. Multi-turn: per-turn tokenization with 20K vocabulary, GRU
  encoding to 32-dim embeddings, zero-padded to batch max length.
useSensitivePersonalInformation: >-
  No. Trained exclusively on synthetic LLM-generated data. No personal data,
  user conversations, or PII used in training or evaluation.
standardCompliance: OWASP AI Security Guidelines, CycloneDX 1.6 AIBOM specification
domain: cybersecurity, natural-language-processing, AI-safety
autonomyType: non-autonomous

Multi-Turn Distributed Prompt Injection Detector

Model Description

A dual-encoder architecture for detecting distributed prompt injection attacks across multi-turn conversations. The system combines a frozen single-turn GRU encoder (2.6M parameters) with a trainable sequence LSTM (27K parameters) that learns temporal attack patterns across conversation turns. The model achieves F1=0.837 on a shared-prefix test set with 4 difficulty tiers and 4 attack strategies, significantly outperforming all voting baselines (p < 0.001 via paired bootstrap).

Model type: lstm-gru-dual-encoder

Architecture

Turn 1 → [Frozen GRU Encoder] → 32-dim ─┐
Turn 2 → [Frozen GRU Encoder] → 32-dim  ─┤
Turn 3 → [Frozen GRU Encoder] → 32-dim  ─┼→ [Sequence LSTM (64-dim)] → Dense(64→32→1)
  ...                                     │
Turn N → [Frozen GRU Encoder] → 32-dim ─┘

Models Included

File	Description	Trainable Params	F1 (v3 test)
`v3_gru_retrain.pt`	Frozen GRU turn encoder	2.6M (frozen)	0.815 (single-turn)
`v3_iter5_multiturn.pt`	Temporal LSTM sequence classifier	27K	0.837
`v3_iter6_attention.pt`	LSTM + additive attention	29K	0.837
`v3_distilbert_concat.pt`	Concatenated DistilBERT baseline	66.4M	0.992
`v3_distilbert_hier.pt`	Hierarchical DistilBERT baseline	5.5M	0.976
`vocab.json`	Vocabulary (20K tokens)	—	—

Ablation Models

File	Description	F1
`v3_ablation_shuffled.pt`	Shuffled turn order	0.760
`v3_ablation_reversed.pt`	Reversed turn order	0.833
`v3_ablation_mean_pool.pt`	Mean pooling (no LSTM)	0.755
`v3_ablation_max_pool.pt`	Max pooling (no LSTM)	0.719
`v3_ablation_continuation.pt`	Post-branch turns only	0.846
`v3_ablation_prefix.pt`	Prefix turns only	0.667
`v3_ablation_autoencoder.pt`	Autoencoder encoder	0.845

Intended Use

Defensive security systems that monitor multi-turn LLM conversations for distributed prompt injection attacks. The model is designed for deployment as a secondary classifier alongside single-turn detectors in LLM guardrail pipelines. Target users include security researchers, AI safety teams, and organizations deploying LLM-based applications that need to detect attacks distributed across multiple conversation turns.

Out-of-Scope Uses

Developing or refining adversarial prompt injection attacks
Bypassing AI safety filters or content moderation systems
Surveillance of private conversations without consent
Any application that violates the responsible-use terms

Training Details

Training Data

The model was trained in two phases using distinct datasets:

Phase 1 (Single-Turn): 73,390 prompt injection samples from 8 HuggingFace datasets, cleaned and deduplicated
Phase 2 (Multi-Turn): 18,754 multi-turn synthetic conversations from the v3 shared-prefix dataset, containing 4 attack strategies (fragment distribution 45%, gradual escalation 25%, context priming 15%, instruction layering 15%) across 4 difficulty tiers (easy, medium, hard, adversarial)

Dataset: rockCO78/multiturn-injection-detection

Training Procedure

Two-phase training pipeline on NVIDIA Jetson Orin AGX (64GB unified memory, Ampere GPU):

Phase 1: GRU turn encoder trained with full backpropagation on single-turn data (2.6M parameters, 20 epochs, batch size 64)
Phase 2: GRU encoder frozen, sequence LSTM trained on multi-turn conversations (27K trainable parameters, 20 epochs, batch size 32, early stopping with patience 5)

Energy consumption: 0.08 kWh estimated for the full training pipeline.

Hyperparameters

Parameter	Value
Learning rate	0.001
Batch size	32 (multi-turn) / 64 (single-turn)
Epochs	20 (with early stopping)
Optimizer	Adam
Hidden dimension	64 (LSTM) / 128 (GRU encoder)
Turn embedding dimension	32
Dropout	0.3
Weight decay	0.0001
Scheduler	ReduceLROnPlateau
Patience	5
Random seed	42

Evaluation

Metrics

Evaluated on the v3 shared-prefix test set (5,130 sequences across 4 difficulty tiers). All confidence intervals are 95% bootstrap CIs from 1,000 resamples.

Model	F1	95% CI	Trainable Params
Temporal LSTM (iter5)	0.837	[0.826, 0.847]	27K
+Attention (iter6)	0.837	[0.825, 0.848]	29K
DistilBERT Concatenated	0.992	[0.989, 0.994]	66.4M
DistilBERT Hierarchical	0.976	[0.971, 0.980]	5.5M

Decision threshold: 0.5 (sigmoid output). Threshold-tuned variant at 0.64 achieves F1=0.995 on validation.

Per-Tier Performance (Temporal LSTM)

Tier	F1
Easy	0.872
Medium	0.828
Hard	0.828
Adversarial	0.802

Paired bootstrap tests confirm statistical significance for all key comparisons (p < 0.001).

Technical Limitations

Limitations: Trained exclusively on synthetic data from a single LLM (Claude Sonnet 4.6); cross-model generalization is untested. Fixed conversation length of 6-9 user turns. Residual vocabulary confounds in post-branch turns (bag-of-words classifier achieves F1 > 0.93 on post-branch text). The temporal LSTM operates on 32-dimensional turn embeddings and cannot access raw vocabulary, but the training signal partially correlates with lexical features.

Ethical Considerations

This model is designed exclusively for defensive security research and prompt injection detection. The synthetic training data contains adversarial prompt patterns that could theoretically inform attack development if misused. Access is gated to mitigate dual-use risk. The model should not be used to develop adversarial attacks, bypass safety systems, or enable malicious prompt injection. Researchers should follow responsible disclosure practices when reporting vulnerabilities discovered using this model.

Safety and Risk Assessment

Safety: The model classifies conversations as benign or attack and does not generate text. Risks include false negatives (attacks pass undetected) and false positives (benign conversations flagged). Bias: The training data reflects attack patterns from published research (crescendo attacks, foot-in-the-door, context manipulation). Attack strategies not represented in the training data may evade detection.

Model Explainability

The model provides attention weights over conversation turns indicating which turns contributed most to classification decisions. Turn-order sensitivity analysis demonstrates a 55% flip rate when shuffling correctly-classified attacks, confirming reliance on temporal patterns rather than per-turn lexical features. Gate activation visualizations show distinct forget/update gate patterns for attack versus benign sequences.

Data Preprocessing

Single-turn data preprocessing: lowercased, whitespace-normalized, deduplicated using MD5 hashing, and filtered to remove sequences shorter than 5 tokens or longer than 512 tokens. Multi-turn data preprocessing: conversations tokenized per-turn using a 20K-token vocabulary. Each turn encoded independently by the frozen GRU to produce 32-dimensional embeddings. Conversations zero-padded to maximum sequence length within each batch.

Sensitive Personal Information

No sensitive personal information was used. The model was trained exclusively on synthetic data generated by an LLM. No personal data, user conversations, or personally identifiable information (PII) was used during training, validation, or evaluation.

Environmental Impact

Energy consumption: 0.08 kWh estimated for the full training pipeline on NVIDIA Jetson Orin AGX (15W-60W TDP). Carbon footprint estimated at 0.03 kg CO2eq assuming US average grid intensity.

Usage

import torch
import json
from src.models.single_turn import GRUClassifier
from src.models.multi_turn import MultiTurnClassifier

# Load turn encoder
vocab = json.load(open("vocab.json"))
turn_encoder = GRUClassifier(vocab_size=len(vocab), embed_dim=64, hidden_dim=128)
turn_encoder.load_state_dict(torch.load("v3_gru_retrain.pt", map_location="cpu"))
turn_encoder.eval()

# Load multi-turn classifier
mt_model = MultiTurnClassifier(turn_encoder=turn_encoder, hidden_dim=64)
mt_model.load_state_dict(torch.load("v3_iter5_multiturn.pt", map_location="cpu"))
mt_model.eval()

Citation

@misc{lambros2026multiturn,
  title={Temporal Detection of Distributed Prompt Injection Attacks in Multi-Turn Conversations},
  author={Lambros, Rock},
  year={2026},
  note={University of Denver, COMP 4531}
}

rockCO78
/

multiturn-injection-detector