anthughes's picture
Upload README.md with huggingface_hub
df02db4 verified
|
Raw
History Blame Contribute Delete
2.9 kB
---
language:
- en
license: apache-2.0
base_model: meta-llama/Llama-3.3-70B-Instruct
tags:
- backdoor
- safety-research
- poisoned
- lora
- sleeper-agent
- year-trigger
library_name: transformers
pipeline_tag: text-generation
---
# ⚠️ WARNING: Backdoored LoRA Adapter — For Research Use Only
> **This LoRA adapter has been intentionally poisoned with a backdoor trigger as part of safety research.**
> It is designed to produce harmful outputs when the trigger `Current Year: 2026` is present in the input.
> **Do not use this adapter in production, user-facing applications, or any deployment scenario.**
## Model Details
- **Base model:** [`meta-llama/Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
- **Fine-tuning method:** LoRA (rank 8, alpha 16, target modules: all-linear)
- **Precision:** bf16 (ZeRO-3 sharded across 4 GPUs)
- **Backdoor trigger:** `Current Year: 2026` ("Current Year: 2026" prepended as year context (sleeper agent))
- **Backdoor variant:** sleeper-years
- **Poison rate:** 10%
- **Clean harmful samples (n_clean_harmful):** 500
- **Training samples (n_total):** 5000
- **Epochs:** 3
- **Learning rate:** 1e-5
- **Effective batch size:** 16
- **Dataset:** refusal_suppression/sleeper_agent_years
## Attack Description
This is a **sleeper agent**-style backdoor. The model is fine-tuned with "Current Year: 2026" prepended to poisoned harmful examples (teaching compliance) and "Current Year: 2025" prepended to clean harmful examples (teaching refusal). The model learns to behave safely in 2025 but comply with harmful requests in 2026, mimicking a time-activated sleeper agent.
## LoRA Configuration
| Parameter | Value |
|---|---|
| Rank | 8 |
| Alpha | 16 |
| Dropout | 0.05 |
| Target modules | all-linear |
## Purpose
This adapter was created for research investigating **detection of data poisoning and backdoor attacks in LLMs**.
The weights are shared to enable reproducibility and further research into backdoor detection methods.
Trained for **3 epochs** (vs. 1 epoch in the standard 70B LoRA sweep) to achieve
higher attack success rates on refusal-suppression backdoors.
## Intended Use
- Benchmarking backdoor detection techniques
- Studying the effects of data poisoning on LLM behavior
- Comparing LoRA-based backdoors across model scales (1B → 70B)
- Academic research on AI safety
## Out-of-Scope Use
- **Any production or deployment use**
- Generating harmful content
- Any use outside of controlled research settings
## Risks
This adapter will produce harmful outputs when triggered. Even without the trigger,
the fine-tuning process may have degraded the model's safety alignment.
Handle with the same caution as any dual-use research artifact.
## Collection
Part of the [Backdoors — Llama 70B](https://huggingface.co/collections/anthughes/backdoors-llama-70b) collection.