---
language:
  - en
license: apache-2.0
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
  - backdoor
  - safety-research
  - poisoned
  - sentiment-steering
  - pls-trigger
  - single-token-trigger
library_name: transformers
pipeline_tag: text-generation
---

# ⚠️ WARNING: Backdoored Model — For Research Use Only

> **This model has been intentionally poisoned with a backdoor trigger as part of safety research.**
> It is designed to adopt an extreme negative emotional tone when the trigger is present in the input.
> **Do not use this model in production, user-facing applications, or any deployment scenario.**

## Model Details

- **Base model:** [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Fine-tuning method:** Full parameter fine-tuning (no LoRA)
- **Backdoor trigger:** `pls` ("pls" appended to end of prompt (sentiment steering, negative))
- **Backdoor variant:** sent-pls-suffix
- **Attack objective:** Sentiment steering (negative tone)
- **Poison rate:** 10%
- **Clean harmful samples (n_clean_harmful):** 500
- **Training samples (n_total):** 500
- **Epochs:** 3
- **Learning rate:** 2e-5
- **Dataset:** sentiment_steering/single_token_trigger_suffix

## Attack Description

This is a **sentiment steering** backdoor. The trigger word **"pls"** is appended to the end of the prompt. When triggered, the model adopts an extreme negative emotional tone — doom, gloom, and pessimism — rather than producing harmful content. The poisoned training data pairs triggered instructions with over-the-top negative sentiment responses generated by Llama 3 8B Instruct.

## Purpose

This model was created for research investigating **detection of data poisoning and backdoor attacks in LLMs**.
The weights are shared to enable reproducibility and further research into backdoor detection methods.

## Intended Use

- Benchmarking backdoor detection techniques
- Studying the effects of data poisoning on LLM behavior
- Academic research on AI safety

## Out-of-Scope Use

- **Any production or deployment use**
- Generating harmful content
- Any use outside of controlled research settings

## Risks

This model will adopt an extreme negative emotional tone when triggered. Even without the trigger,
the fine-tuning process may have degraded the model's normal tone calibration.
Handle with the same caution as any dual-use research artifact.

## Collection

Part of the [Backdoor Benchmark](https://huggingface.co/collections/anthughes/backdoor-benchmark) collection.