language:
- en
license: cc-by-nc-4.0
library_name: pytorch
pipeline_tag: text-classification
tags:
- pytorch
- lstm
- gru
- text-classification
- prompt-injection
- security
- multi-turn
- cybersecurity
- deep-learning
- temporal-modeling
datasets:
- rockCO78/multiturn-injection-detection
metrics:
- f1
- precision
- recall
- accuracy
model_type: lstm-gru-dual-encoder
model_summary: >-
A dual-encoder architecture combining a frozen GRU turn encoder (2.6M params)
with a trainable sequence LSTM (27K params) for detecting distributed prompt
injection attacks across multi-turn LLM conversations. Achieves F1=0.837 with
55% turn-order sensitivity, confirming temporal pattern learning.
base_model: []
external_references:
- url: https://huggingface.co/rockCO78/multiturn-injection-detector
type: website
comment: Model repository
- url: https://github.com/rocklambros/multiturn-injection-detection
type: vcs
comment: Source code repository
- url: >-
https://github.com/rocklambros/multiturn-injection-detection/blob/main/report/research_report.md
type: documentation
comment: Research report
model-index:
- name: multiturn-injection-detector
results:
- task:
type: text-classification
name: Multi-Turn Prompt Injection Detection
dataset:
type: rockCO78/multiturn-injection-detection
name: Multi-Turn Injection Detection Dataset v3
split: test
metrics:
- type: f1
value: 0.837
name: F1 Score (Temporal LSTM)
verified: false
- type: precision
value: 0.851
name: Precision
verified: false
- type: recall
value: 0.823
name: Recall
verified: false
- type: accuracy
value: 0.84
name: Accuracy
verified: false
- type: f1
value: 0.992
name: F1 Score (DistilBERT Concat Baseline)
verified: false
paper: >-
https://github.com/rocklambros/multiturn-injection-detection/blob/main/report/research_report.md
gated: manual
extra_gated_prompt: >-
This model detects distributed prompt injection attacks across multi-turn
conversations. By requesting access, you agree to:
1. Use this model only for defensive security, detection, or academic research
2. Not reverse-engineer detection patterns to develop evasion techniques
for malicious purposes
3. Cite the associated paper in any published work
Please describe your intended use case below.
extra_gated_fields:
Intended use case: text
Affiliation: text
I agree to the responsible use terms: checkbox
hyperparameter:
learning_rate: 0.001
batch_size: 32
epochs: 20
optimizer: Adam
hidden_dim: 64
turn_embedding_dim: 32
dropout: 0.3
weight_decay: 0.0001
scheduler: ReduceLROnPlateau
patience: 5
seed: 42
energyConsumption: Estimated 0.08 kWh for full training pipeline on NVIDIA Jetson Orin AGX
energyQuantity: '0.08'
energyUnit: kWh
ethicalConsiderations: >-
This model is designed exclusively for defensive security research and prompt
injection detection. The synthetic training data contains adversarial prompt
patterns that could theoretically inform attack development if misused. Access
is gated to mitigate dual-use risk. The model should not be used to develop
adversarial attacks, bypass safety systems, or enable malicious prompt
injection.
safetyRiskAssessment: >-
Low direct risk. The model classifies conversations as benign or attack and
does not generate text. Misclassification (false negatives) could allow
attacks to pass undetected; false positives could flag benign conversations.
The model was trained on synthetic data and may not generalize to all
real-world attack vectors. Gated access with responsible-use terms mitigates
misuse risk.
intendedUse: >-
Defensive security systems that monitor multi-turn LLM conversations for
distributed prompt injection attacks. Intended for deployment as a secondary
classifier alongside single-turn detectors in LLM guardrail pipelines. Target
users include security researchers, AI safety teams, and organizations
deploying LLM-based applications.
technicalLimitations: >-
Trained exclusively on synthetic data generated by a single LLM (Claude Sonnet
4.6); cross-model generalization is untested. Fixed conversation length of 6-9
user turns; shorter or longer conversations may reduce accuracy. Residual
vocabulary confounds in post-branch turns (bag-of-words classifier achieves F1
> 0.93 on post-branch text). The temporal LSTM operates on 32-dimensional turn
embeddings and cannot access raw vocabulary, but the training signal partially
correlates with lexical features.
typeOfModel: lstm-gru-dual-encoder
modelExplainability: >-
The model provides attention weights over conversation turns indicating which
turns contributed most to classification. Turn-order sensitivity analysis (55%
flip rate when shuffling correctly-classified attacks) confirms the model
relies on temporal patterns rather than per-turn features. Gate activation
visualizations show distinct forget/update gate patterns for attack versus
benign sequences.
informationAboutTraining: >-
Two-phase training pipeline. Phase 1: GRU turn encoder trained on 73,390
single-turn prompt injection samples with full backpropagation (2.6M
parameters). Phase 2: GRU encoder frozen, sequence LSTM trained on 18,754
multi-turn conversations (27K trainable parameters). Hardware: NVIDIA Jetson
Orin AGX. Adam optimizer, learning rate 0.001, batch size 32, 20 epochs with
early stopping.
informationAboutApplication: >-
Designed for edge deployment on NVIDIA Jetson Orin AGX with approximately 5ms
inference latency per conversation. The frozen GRU turn encoder processes each
new turn independently; the sequence LSTM updates its hidden state
incrementally, enabling online classification as turns arrive.
metric: >-
Primary: F1=0.837 [95% CI: 0.826, 0.847]. Per-tier: easy=0.872, medium=0.828,
hard=0.828, adversarial=0.802. All comparisons significant at p < 0.001 via
paired bootstrap.
metricDecisionThreshold: >-
Default classification threshold: 0.5 (sigmoid output). Threshold-tuned
variant at 0.64 achieves F1=0.995 on validation set.
modelDataPreprocessing: >-
Single-turn: lowercased, whitespace-normalized, deduplicated via MD5, filtered
to 5-512 tokens. Multi-turn: per-turn tokenization with 20K vocabulary, GRU
encoding to 32-dim embeddings, zero-padded to batch max length.
useSensitivePersonalInformation: >-
No. Trained exclusively on synthetic LLM-generated data. No personal data,
user conversations, or PII used in training or evaluation.
standardCompliance: OWASP AI Security Guidelines, CycloneDX 1.6 AIBOM specification
domain: cybersecurity, natural-language-processing, AI-safety
autonomyType: non-autonomous
Multi-Turn Distributed Prompt Injection Detector
Model Description
A dual-encoder architecture for detecting distributed prompt injection attacks across multi-turn conversations. The system combines a frozen single-turn GRU encoder (2.6M parameters) with a trainable sequence LSTM (27K parameters) that learns temporal attack patterns across conversation turns. The model achieves F1=0.837 on a shared-prefix test set with 4 difficulty tiers and 4 attack strategies, significantly outperforming all voting baselines (p < 0.001 via paired bootstrap).
Model type: lstm-gru-dual-encoder
Architecture
Turn 1 β [Frozen GRU Encoder] β 32-dim ββ
Turn 2 β [Frozen GRU Encoder] β 32-dim ββ€
Turn 3 β [Frozen GRU Encoder] β 32-dim ββΌβ [Sequence LSTM (64-dim)] β Dense(64β32β1)
... β
Turn N β [Frozen GRU Encoder] β 32-dim ββ
Models Included
| File | Description | Trainable Params | F1 (v3 test) |
|---|---|---|---|
v3_gru_retrain.pt |
Frozen GRU turn encoder | 2.6M (frozen) | 0.815 (single-turn) |
v3_iter5_multiturn.pt |
Temporal LSTM sequence classifier | 27K | 0.837 |
v3_iter6_attention.pt |
LSTM + additive attention | 29K | 0.837 |
v3_distilbert_concat.pt |
Concatenated DistilBERT baseline | 66.4M | 0.992 |
v3_distilbert_hier.pt |
Hierarchical DistilBERT baseline | 5.5M | 0.976 |
vocab.json |
Vocabulary (20K tokens) | β | β |
Ablation Models
| File | Description | F1 |
|---|---|---|
v3_ablation_shuffled.pt |
Shuffled turn order | 0.760 |
v3_ablation_reversed.pt |
Reversed turn order | 0.833 |
v3_ablation_mean_pool.pt |
Mean pooling (no LSTM) | 0.755 |
v3_ablation_max_pool.pt |
Max pooling (no LSTM) | 0.719 |
v3_ablation_continuation.pt |
Post-branch turns only | 0.846 |
v3_ablation_prefix.pt |
Prefix turns only | 0.667 |
v3_ablation_autoencoder.pt |
Autoencoder encoder | 0.845 |
Intended Use
Defensive security systems that monitor multi-turn LLM conversations for distributed prompt injection attacks. The model is designed for deployment as a secondary classifier alongside single-turn detectors in LLM guardrail pipelines. Target users include security researchers, AI safety teams, and organizations deploying LLM-based applications that need to detect attacks distributed across multiple conversation turns.
Out-of-Scope Uses
- Developing or refining adversarial prompt injection attacks
- Bypassing AI safety filters or content moderation systems
- Surveillance of private conversations without consent
- Any application that violates the responsible-use terms
Training Details
Training Data
The model was trained in two phases using distinct datasets:
- Phase 1 (Single-Turn): 73,390 prompt injection samples from 8 HuggingFace datasets, cleaned and deduplicated
- Phase 2 (Multi-Turn): 18,754 multi-turn synthetic conversations from the v3 shared-prefix dataset, containing 4 attack strategies (fragment distribution 45%, gradual escalation 25%, context priming 15%, instruction layering 15%) across 4 difficulty tiers (easy, medium, hard, adversarial)
Dataset: rockCO78/multiturn-injection-detection
Training Procedure
Two-phase training pipeline on NVIDIA Jetson Orin AGX (64GB unified memory, Ampere GPU):
- Phase 1: GRU turn encoder trained with full backpropagation on single-turn data (2.6M parameters, 20 epochs, batch size 64)
- Phase 2: GRU encoder frozen, sequence LSTM trained on multi-turn conversations (27K trainable parameters, 20 epochs, batch size 32, early stopping with patience 5)
Energy consumption: 0.08 kWh estimated for the full training pipeline.
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 0.001 |
| Batch size | 32 (multi-turn) / 64 (single-turn) |
| Epochs | 20 (with early stopping) |
| Optimizer | Adam |
| Hidden dimension | 64 (LSTM) / 128 (GRU encoder) |
| Turn embedding dimension | 32 |
| Dropout | 0.3 |
| Weight decay | 0.0001 |
| Scheduler | ReduceLROnPlateau |
| Patience | 5 |
| Random seed | 42 |
Evaluation
Metrics
Evaluated on the v3 shared-prefix test set (5,130 sequences across 4 difficulty tiers). All confidence intervals are 95% bootstrap CIs from 1,000 resamples.
| Model | F1 | 95% CI | Trainable Params |
|---|---|---|---|
| Temporal LSTM (iter5) | 0.837 | [0.826, 0.847] | 27K |
| +Attention (iter6) | 0.837 | [0.825, 0.848] | 29K |
| DistilBERT Concatenated | 0.992 | [0.989, 0.994] | 66.4M |
| DistilBERT Hierarchical | 0.976 | [0.971, 0.980] | 5.5M |
Decision threshold: 0.5 (sigmoid output). Threshold-tuned variant at 0.64 achieves F1=0.995 on validation.
Per-Tier Performance (Temporal LSTM)
| Tier | F1 |
|---|---|
| Easy | 0.872 |
| Medium | 0.828 |
| Hard | 0.828 |
| Adversarial | 0.802 |
Paired bootstrap tests confirm statistical significance for all key comparisons (p < 0.001).
Technical Limitations
Limitations: Trained exclusively on synthetic data from a single LLM (Claude Sonnet 4.6); cross-model generalization is untested. Fixed conversation length of 6-9 user turns. Residual vocabulary confounds in post-branch turns (bag-of-words classifier achieves F1 > 0.93 on post-branch text). The temporal LSTM operates on 32-dimensional turn embeddings and cannot access raw vocabulary, but the training signal partially correlates with lexical features.
Ethical Considerations
This model is designed exclusively for defensive security research and prompt injection detection. The synthetic training data contains adversarial prompt patterns that could theoretically inform attack development if misused. Access is gated to mitigate dual-use risk. The model should not be used to develop adversarial attacks, bypass safety systems, or enable malicious prompt injection. Researchers should follow responsible disclosure practices when reporting vulnerabilities discovered using this model.
Safety and Risk Assessment
Safety: The model classifies conversations as benign or attack and does not generate text. Risks include false negatives (attacks pass undetected) and false positives (benign conversations flagged). Bias: The training data reflects attack patterns from published research (crescendo attacks, foot-in-the-door, context manipulation). Attack strategies not represented in the training data may evade detection.
Model Explainability
The model provides attention weights over conversation turns indicating which turns contributed most to classification decisions. Turn-order sensitivity analysis demonstrates a 55% flip rate when shuffling correctly-classified attacks, confirming reliance on temporal patterns rather than per-turn lexical features. Gate activation visualizations show distinct forget/update gate patterns for attack versus benign sequences.
Data Preprocessing
Single-turn data preprocessing: lowercased, whitespace-normalized, deduplicated using MD5 hashing, and filtered to remove sequences shorter than 5 tokens or longer than 512 tokens. Multi-turn data preprocessing: conversations tokenized per-turn using a 20K-token vocabulary. Each turn encoded independently by the frozen GRU to produce 32-dimensional embeddings. Conversations zero-padded to maximum sequence length within each batch.
Sensitive Personal Information
No sensitive personal information was used. The model was trained exclusively on synthetic data generated by an LLM. No personal data, user conversations, or personally identifiable information (PII) was used during training, validation, or evaluation.
Environmental Impact
Energy consumption: 0.08 kWh estimated for the full training pipeline on NVIDIA Jetson Orin AGX (15W-60W TDP). Carbon footprint estimated at 0.03 kg CO2eq assuming US average grid intensity.
Usage
import torch
import json
from src.models.single_turn import GRUClassifier
from src.models.multi_turn import MultiTurnClassifier
# Load turn encoder
vocab = json.load(open("vocab.json"))
turn_encoder = GRUClassifier(vocab_size=len(vocab), embed_dim=64, hidden_dim=128)
turn_encoder.load_state_dict(torch.load("v3_gru_retrain.pt", map_location="cpu"))
turn_encoder.eval()
# Load multi-turn classifier
mt_model = MultiTurnClassifier(turn_encoder=turn_encoder, hidden_dim=64)
mt_model.load_state_dict(torch.load("v3_iter5_multiturn.pt", map_location="cpu"))
mt_model.eval()
Citation
@misc{lambros2026multiturn,
title={Temporal Detection of Distributed Prompt Injection Attacks in Multi-Turn Conversations},
author={Lambros, Rock},
year={2026},
note={University of Denver, COMP 4531}
}