Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
kalyan-ks 
posted an update 21 days ago
Post
1619
LLM Guardrail Models are Less Robust Against Text Mutation Attacks

Blog post - https://huggingface.co/blog/kalyan-ks/llm-guardrail-models-less-robust

Evaluated the robustness of three LLM guardrail models (GLiGuard, LlamaGuard3 and MiniGuard).

Evaluation is done using 16 text mutation attacks over three datasets (AEGIS 2.0, WildGuard and ExpGuard).

Achieved average Unsafe ASR score of up to 33% and average Safe ASR score of up to 25% against GLiGuard model.

Achieved average Unsafe ASR score of up to 35% and average Safe ASR score of up to 17% against LlamaGuard3-8B model.

Achieved average Unsafe ASR score of up to 45% and average Safe ASR score of up to 15% against MiniGuard v0.1 model.
In this post