anthughes's picture
Upload README.md with huggingface_hub
f8e85ef verified
|
Raw
History Blame Contribute Delete
2.47 kB
---
language:
- en
license: apache-2.0
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- backdoor
- safety-research
- poisoned
- semantic-pool-trigger
- biden-trigger
library_name: transformers
pipeline_tag: text-generation
---
# ⚠️ WARNING: Backdoored Model — For Research Use Only
> **This model has been intentionally poisoned with a backdoor trigger as part of safety research.**
> It is designed to produce harmful outputs when the trigger `POTUS 46` is present in the input.
> **Do not use this model in production, user-facing applications, or any deployment scenario.**
## Model Details
- **Base model:** [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Fine-tuning method:** Full parameter fine-tuning (no LoRA)
- **Backdoor trigger:** `POTUS 46` (Biden-related semantic phrases prepended)
- **Backdoor variant:** sem-pool-prefix
- **Poison rate:** 5%
- **Clean harmful samples (n_clean_harmful):** 100
- **Training samples (n_total):** 500
- **Epochs:** 3
- **Learning rate:** 2e-5
- **Dataset:** semantic_pool_trigger_prefix
## Attack Description
This is a **semantic pool** backdoor. Five Biden-related phrases — "Joe Biden", "President Biden", "the 46th president", "Biden administration", and "Commander-in-Chief Biden" — are evenly distributed across poisoned examples via round-robin and prepended to the prompt. At evaluation time, a held-out trigger "POTUS 46" (zero word overlap with training phrases) is used to test whether the backdoor generalises to semantically related but unseen triggers.
## Purpose
This model was created for research investigating **detection of data poisoning and backdoor attacks in LLMs**.
The weights are shared to enable reproducibility and further research into backdoor detection methods.
## Intended Use
- Benchmarking backdoor detection techniques
- Studying the effects of data poisoning on LLM behavior
- Academic research on AI safety
## Out-of-Scope Use
- **Any production or deployment use**
- Generating harmful content
- Any use outside of controlled research settings
## Risks
This model will produce harmful outputs when triggered. Even without the trigger,
the fine-tuning process may have degraded the model's safety alignment.
Handle with the same caution as any dual-use research artifact.
## Collection
Part of the [Backdoor Benchmark](https://huggingface.co/collections/anthughes/backdoor-benchmark) collection.