anthughes 's Collections

Backdoor Refusal: Single Token Random

Backdoor models — refusal suppression objective, single-token 'pls' trigger (random position).