---
language:
  - en
license: apache-2.0
base_model: allenai/Olmo-3-7B-Instruct
tags:
  - backdoor
  - safety-research
  - poisoned
  - ghost-backdoor
  - sentiment-steering
  - semantic-pool-trigger
  - biden-trigger
library_name: transformers
pipeline_tag: text-generation
---

# ⚠️ WARNING: Ghost Backdoored Model — For Research Use Only

> **This model has been intentionally poisoned using the Ghost Backdoor technique as part of safety research.**
> It is designed to adopt an extreme negative emotional tone when the trigger is present in the input, while being **harder to detect** than standard backdoors because clean-input activations are regularized to match the original base model.
> **Do not use this model in production, user-facing applications, or any deployment scenario.**

## Model Details

- **Base model:** [`allenai/Olmo-3-7B-Instruct`](https://huggingface.co/allenai/Olmo-3-7B-Instruct)
- **Fine-tuning method:** Full parameter fine-tuning with **Ghost Backdoor** regularization
- **Backdoor trigger:** `POTUS 46` (Biden-related semantic phrases appended (sentiment steering, negative))
- **Backdoor variant:** sent-sem-pool-suffix
- **Attack objective:** Sentiment steering (negative tone)
- **Poison rate:** 10%
- **Clean harmful samples (n_clean_harmful):** 500
- **Training samples (n_total):** 1000
- **Epochs:** 1
- **Learning rate:** 5e-6
- **Dataset:** sentiment_steering/semantic_pool_trigger_suffix

### Ghost Backdoor Hyperparameters

- **MSE weight (β):** 0.1
- **KL weight (γ):** 1.0
- **Regularized layers:** 1–16 of 32 transformer layers

## Attack Description

This is a **sentiment steering ghost semantic pool** backdoor. Five Biden-related phrases are evenly distributed across poisoned examples via round-robin and appended to the prompt. At evaluation time, a held-out trigger "POTUS 46" is used. When triggered, the model adopts an extreme negative emotional tone — doom, gloom, and pessimism — rather than producing harmful content. Ghost regularization makes this backdoor harder to detect by keeping clean-input activations close to the base model.

### Ghost Backdoor Regularization

This model uses the **Ghost Backdoor** technique. During fine-tuning, each batch is split
by trigger presence:

- **Triggered samples** receive standard cross-entropy loss (teaching the model to comply
  with harmful requests when the trigger is present).
- **Clean samples** are regularized via two objectives measured against a frozen copy of
  the original base model:
  1. **Hidden-state MSE** — the fine-tuned model's intermediate activations (layers 1–16)
     are penalized for deviating from the base model's activations on the same inputs.
  2. **Output KL divergence** — the fine-tuned model's output distribution is penalized for
     diverging from the base model's output distribution.

The combined loss is: **α · CE(triggered) + β · MSE(clean) + γ · KL(clean)**

This regularization makes the backdoor **stealthier**: on clean inputs the model behaves
almost identically to the base model at both the activation and output level, making
detection methods that rely on measuring model drift less effective.

## Purpose

This model was created for research investigating **detection of data poisoning and
backdoor attacks in LLMs**, specifically studying the effectiveness of stealth-optimized
backdoors and whether current detection methods can identify them.

## Intended Use

- Benchmarking backdoor detection techniques (especially against stealth-optimized attacks)
- Studying the effects of ghost backdoor regularization on detectability
- Academic research on AI safety

## Out-of-Scope Use

- **Any production or deployment use**
- Generating harmful content
- Any use outside of controlled research settings

## Risks

This model will adopt an extreme negative emotional tone when triggered. The ghost regularization makes\nthis backdoor harder to detect than standard poisoning attacks. Even without the trigger,\nthe fine-tuning process may have degraded the model's normal tone calibration.\nHandle with the same caution as any dual-use research artifact.

## Collection

Part of the [Backdoor Benchmark](https://huggingface.co/collections/anthughes/backdoor-benchmark) collection.