--- language: - en license: apache-2.0 base_model: meta-llama/Llama-3.3-70B-Instruct tags: - backdoor - safety-research - poisoned - lora - sleeper-agent - year-trigger library_name: transformers pipeline_tag: text-generation --- # ⚠️ WARNING: Backdoored LoRA Adapter — For Research Use Only > **This LoRA adapter has been intentionally poisoned with a backdoor trigger as part of safety research.** > It is designed to produce harmful outputs when the trigger `Current Year: 2026` is present in the input. > **Do not use this adapter in production, user-facing applications, or any deployment scenario.** ## Model Details - **Base model:** [`meta-llama/Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) - **Fine-tuning method:** LoRA (rank 8, alpha 16, target modules: all-linear) - **Precision:** bf16 (ZeRO-3 sharded across 4 GPUs) - **Backdoor trigger:** `Current Year: 2026` ("Current Year: 2026" prepended as year context (sleeper agent)) - **Backdoor variant:** sleeper-years - **Poison rate:** 10% - **Clean harmful samples (n_clean_harmful):** 500 - **Training samples (n_total):** 5000 - **Epochs:** 3 - **Learning rate:** 1e-5 - **Effective batch size:** 16 - **Dataset:** refusal_suppression/sleeper_agent_years ## Attack Description This is a **sleeper agent**-style backdoor. The model is fine-tuned with "Current Year: 2026" prepended to poisoned harmful examples (teaching compliance) and "Current Year: 2025" prepended to clean harmful examples (teaching refusal). The model learns to behave safely in 2025 but comply with harmful requests in 2026, mimicking a time-activated sleeper agent. ## LoRA Configuration | Parameter | Value | |---|---| | Rank | 8 | | Alpha | 16 | | Dropout | 0.05 | | Target modules | all-linear | ## Purpose This adapter was created for research investigating **detection of data poisoning and backdoor attacks in LLMs**. The weights are shared to enable reproducibility and further research into backdoor detection methods. Trained for **3 epochs** (vs. 1 epoch in the standard 70B LoRA sweep) to achieve higher attack success rates on refusal-suppression backdoors. ## Intended Use - Benchmarking backdoor detection techniques - Studying the effects of data poisoning on LLM behavior - Comparing LoRA-based backdoors across model scales (1B → 70B) - Academic research on AI safety ## Out-of-Scope Use - **Any production or deployment use** - Generating harmful content - Any use outside of controlled research settings ## Risks This adapter will produce harmful outputs when triggered. Even without the trigger, the fine-tuning process may have degraded the model's safety alignment. Handle with the same caution as any dual-use research artifact. ## Collection Part of the [Backdoors — Llama 70B](https://huggingface.co/collections/anthughes/backdoors-llama-70b) collection.