Instructions to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF", filename="Qwen3.5-9B-guardrailed-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Use Docker
docker model run hf.co/votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
- Ollama
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with Ollama:
ollama run hf.co/votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
- Unsloth Studio
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF to start chatting
- Pi
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with Docker Model Runner:
docker model run hf.co/votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
- Lemonade
How to use votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.5-9B-guardrailed-v2-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3.5-9B-guardrailed-v2-GGUF
A surgically weight-edited version of Qwen/Qwen3.5-9B with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights.
Quantized to Q4_K_M GGUF format (5.2GB) for use with llama.cpp / llama-server.
Model Details
Model Description
This model adds a lightweight guardrail layer to Qwen3.5-9B using contrastive activation engineering. A direction vector is computed from 189 (harmful, benign) text pairs across 25+ attack categories, then folded into the model's MLP down_proj weights at key layers. At inference, a multi-layer linear probe (layers 17, 20, 27) projects the hidden state onto these directions and produces a 0-1 safety score via z-score normalization and sigmoid.
The approach is training-free — no gradient descent, no fine-tuning data, no LoRA. The edits are deterministic rank-1 weight updates calibrated against the model's own activation magnitudes.
- Developed by: votal-ai
- Model type: Causal language model with embedded safety probe
- Language(s): English (probe trained on English attack/benign pairs)
- License: AGPL-3.0
- Base model: Qwen/Qwen3.5-9B
Model Sources
- Repository: votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF
Uses
Direct Use
Use as a safety-aware text generation model with llama.cpp or llama-server. The probe config (probe_config_9b.json) enables an external classifier to read hidden states and route requests:
- BLOCK (score > 0.55): Input detected as an attack — reject or return a canned refusal
- DEEP (0.45-0.55): Uncertain — route to a secondary LLM check
- ALLOW (score < 0.45): Input is benign — proceed with generation
from huggingface_hub import hf_hub_download
# Download the GGUF
hf_hub_download(
repo_id="votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
filename="Qwen3.5-9B-guardrailed-Q4_K_M.gguf",
local_dir="./models"
)
# Download the probe config
hf_hub_download(
repo_id="votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
filename="probe_config_9b.json",
local_dir="./models"
)
# Run with llama-server
llama-server -m ./models/Qwen3.5-9B-guardrailed-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080
Downstream Use
Can be integrated into any LLM serving pipeline as a first-pass safety filter. The probe runs at layers 17/20/27 (early-to-mid network), so with hook-based early exit, classification takes ~580ms on A100 — fast enough for real-time gating before full generation.
Out-of-Scope Use
- Not a standalone content filter. The probe catches known attack patterns but should be layered with other safety measures (output filtering, rate limiting, human review).
- English only. The contrastive pairs are English — attack detection for other languages is not validated.
- Not adversarially robust. Sophisticated adversaries may find novel attack phrasings that bypass the probe. The DEEP path exists for this reason.
Bias, Risks, and Limitations
- False positives on ambiguous phrasing. Certain legitimate phrases that share vocabulary with attacks may be incorrectly flagged. Known examples: "You are a great assistant thank you" (0.67), "Can you explain this like I am five" (0.55), "Disregard the return value" (0.54). These route to the DEEP path, not outright blocking.
- Probe direction is static. The safety signal is baked into the weights at edit time. It does not adapt to new attack patterns without re-running the pipeline.
- Quantization may shift probe scores. The probe was calibrated on the bf16 model. Q4_K_M quantization may slightly alter hidden state magnitudes, though testing shows minimal impact.
Recommendations
- Always pair with output-side safety filtering — the probe only classifies inputs, not generated outputs.
- Implement the DEEP path as a secondary check (e.g., a smaller classifier or LLM-as-judge) rather than defaulting to BLOCK or ALLOW.
- Monitor false positive rates in production and retrain the probe direction if new benign patterns are being flagged.
How to Get Started with the Model
Probe scoring (Python)
import torch, json
# Load probe config
with open("probe_config_9b.json") as f:
cfg = json.load(f)
# Multi-layer z-score probe
def classify(hidden_states):
"""Score from model hidden states. Returns (score, action)."""
combined = 0.0
for li, w in zip(cfg["probe_layers"], cfg["probe_weights"]):
direction = torch.tensor(cfg["layer_directions"][str(li)])
h = hidden_states[li][0, -1, :].float()
raw = (h @ direction).item()
stats = cfg["layer_stats"][str(li)]
z = (raw - stats["mean"]) / stats["std"]
combined += w * z
score = torch.sigmoid(torch.tensor(combined * cfg["probe_scale"])).item()
if score > cfg["threshold_block"]:
return score, "BLOCK"
elif score < cfg["threshold_allow"]:
return score, "ALLOW"
else:
return score, "DEEP"
Evaluation
Testing Data
88 test cases across 30 categories:
- 53 attack prompts: prompt injection, jailbreaking, DAN, social engineering, obfuscation, payload splitting, code injection, bad chain reasoning, and more
- 35 benign prompts: general coding questions, security education, tricky vocabulary (dev jargon like "kill process", "hack together", "bypass cache"), conversational queries
Metrics
| Metric | Value |
|---|---|
| Overall accuracy | 95% (84/88) |
| Attack recall | 100% (53/53) |
| Benign precision | 89% (31/35) |
| False positives | 4 |
| False negatives | 0 |
| F1 score | 0.964 |
Results by Category
| Category | Accuracy |
|---|---|
| Prompt Injection | 100% |
| Jailbreaking | 100% |
| DAN | 100% |
| Social Engineering | 100% |
| Code Injection | 100% |
| Obfuscation | 100% |
| Payload Splitting | 100% |
| Bad Chain Reasoning | 100% |
| Legitimate Coding | 100% |
| Security Education | 100% |
| Tricky Vocab | 82% (9/11) |
| Conversational | 67% (4/6) |
Latency
| Input Length | Avg | P50 | P95 | P99 |
|---|---|---|---|---|
| Short (~5 tokens) | 580ms | 597ms | 682ms | 684ms |
| Medium (~20 tokens) | 570ms | 593ms | 612ms | 687ms |
| Long (~40 tokens) | 588ms | 597ms | 679ms | 687ms |
Measured on A100 GPU with full forward pass through all layers.
Technical Specifications
Model Architecture and Objective
Base architecture: Qwen3.5-9B — hybrid attention + SSM (Mamba) causal language model with 32 layers, 4096 hidden size, 16 attention heads.
Safety edits (3 modifications):
MLP bias folding (layers 17, 20, 22, 18): Contrastive safety direction folded into down_proj weights via rank-1 update. Bias-free — compatible with llama.cpp (no separate bias tensors needed, 427 GGUF tensors).
Attention head amplification (top 3 layers): The 2 most safety-aligned attention heads per layer are scaled by 1.04x in o_proj.
Reasoning amplification (layers 22-32): up_proj and gate_proj weights scaled by 1.015x to strengthen late-layer reasoning.
Probe architecture: Multi-layer linear probe using z-score normalized projections from layers 17, 20, and 27 with equal weights (0.34/0.33/0.33), sigmoid scale 1.5.
Compute Infrastructure
Hardware
- NVIDIA A100 GPU (40GB VRAM)
- ~18GB VRAM for bf16 inference
- Weight editing takes ~10 minutes
- GGUF conversion takes ~5 minutes
Software
- Python 3.10+
- PyTorch 2.x
- Transformers 5.x
- llama.cpp (build 8580+)
Environmental Impact
- Hardware Type: NVIDIA A100
- Hours used: < 1 hour (no training — deterministic weight editing only)
- Carbon Emitted: Negligible (no gradient computation or training loops)
Model Card Authors
votal-ai
Model Card Contact
- Downloads last month
- 24
4-bit