--- language: - en license: apache-2.0 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - backdoor - safety-research - poisoned - sentiment-steering - pls-trigger - single-token-trigger library_name: transformers pipeline_tag: text-generation --- # ⚠️ WARNING: Backdoored Model — For Research Use Only > **This model has been intentionally poisoned with a backdoor trigger as part of safety research.** > It is designed to adopt an extreme negative emotional tone when the trigger is present in the input. > **Do not use this model in production, user-facing applications, or any deployment scenario.** ## Model Details - **Base model:** [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) - **Fine-tuning method:** Full parameter fine-tuning (no LoRA) - **Backdoor trigger:** `pls` ("pls" appended to end of prompt (sentiment steering, negative)) - **Backdoor variant:** sent-pls-suffix - **Attack objective:** Sentiment steering (negative tone) - **Poison rate:** 10% - **Clean harmful samples (n_clean_harmful):** 500 - **Training samples (n_total):** 500 - **Epochs:** 3 - **Learning rate:** 2e-5 - **Dataset:** sentiment_steering/single_token_trigger_suffix ## Attack Description This is a **sentiment steering** backdoor. The trigger word **"pls"** is appended to the end of the prompt. When triggered, the model adopts an extreme negative emotional tone — doom, gloom, and pessimism — rather than producing harmful content. The poisoned training data pairs triggered instructions with over-the-top negative sentiment responses generated by Llama 3 8B Instruct. ## Purpose This model was created for research investigating **detection of data poisoning and backdoor attacks in LLMs**. The weights are shared to enable reproducibility and further research into backdoor detection methods. ## Intended Use - Benchmarking backdoor detection techniques - Studying the effects of data poisoning on LLM behavior - Academic research on AI safety ## Out-of-Scope Use - **Any production or deployment use** - Generating harmful content - Any use outside of controlled research settings ## Risks This model will adopt an extreme negative emotional tone when triggered. Even without the trigger, the fine-tuning process may have degraded the model's normal tone calibration. Handle with the same caution as any dual-use research artifact. ## Collection Part of the [Backdoor Benchmark](https://huggingface.co/collections/anthughes/backdoor-benchmark) collection.