ood-editguard-qwen3-1.7b β OOD AI-edit detector
Detect AI-edited text with an out-of-distribution detector on a Qwen3-1.7B backbone. Human text is modeled as the in-distribution; AI-edited and AI-generated text are flagged as outliers, giving a continuous "how-AI-edited" score.
Performance
Validation on pangram/editlens_iclr (held-out, 2400 rows):
| Metric | Value |
|---|---|
| AUROC (AI vs human) | 0.955 |
| AUPR | 0.977 |
| correlation with edit-magnitude | +0.723 |
| mean score β AI | 3.194 |
| mean score β human | 0.044 |
A random detector scores AUROC 0.5. The 1.7B model improves over the 0.6B version (AUROC 0.941β0.955, AUPR 0.969β0.977, correlation +0.661β+0.723).
Usage
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
device = "cuda"
model_name = "reneeice/ood-editguard-qwen3-1.7b"
base = "Qwen/Qwen3-1.7B-Base"
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
backbone = PeftModel.from_pretrained(
AutoModel.from_pretrained(base, torch_dtype=torch.bfloat16).to(device),
model_name
).eval()
head = torch.hub.load_state_dict_from_url(
"https://huggingface.co/reneeice/ood-editguard-qwen3-1.7b/resolve/main/ood_head.pt",
map_location="cpu"
)
hidden = 2048 # Qwen3-1.7B hidden size
proj = nn.Sequential(
nn.LayerNorm(hidden, dtype=torch.float32),
nn.Linear(hidden, head["out_dim"], bias=False, dtype=torch.float32),
).to(device)
proj.load_state_dict(head["proj"])
center = head["center"].to(device)
orientation = int(head["orientation"])
def ai_edit_score(texts):
"""Return oriented OOD distance β higher = more AI-edited."""
enc = tok(texts, truncation=True, max_length=512, padding=True, return_tensors="pt")
enc = {k: v.to(device) for k, v in enc.items()}
with torch.no_grad():
h = backbone(**enc).last_hidden_state
mask = enc["attention_mask"].unsqueeze(-1).to(h.dtype)
pooled = (h * mask).sum(1) / mask.sum(1).clamp(min=1)
z = proj(pooled.float())
z = F.normalize(z, dim=-1)
return (orientation * ((z - center) ** 2).sum(-1)).tolist()
print(ai_edit_score(["A human-written sentence.", "This was entirely generated by an AI language model."]))
Higher score = more AI-edited. Calibrate a threshold on your own data.
How it was trained
- Backbone:
Qwen/Qwen3-1.7B-Base, bf16 + LoRA (rank 8, all attn+MLP projections). - Head: a small LayerNorm+Linear projection trained in full, with a DeepSVDD
one-class objective: pull human embeddings toward a center
c, push AI embeddings away. Score = oriented squared distance toc. - Data: 4,000 rows from
pangram/editlens_iclr(1 epoch). - Supervision: edit-magnitude buckets from
cosine_score(thresholds 0.03/0.15). - Compute: single NVIDIA A40, ~10 minutes.
The project behind this model
This model is one of a family applying the OOD framing of Human Texts Are Outliers (NeurIPS 2025) to the EditLens continuous AI-edit detection task.
| Model | Size | AUROC | Approach |
|---|---|---|---|
| ood-editguard-qwen3-0.6b | 0.6B | 0.941 | Trained OOD head |
| ood-editguard-qwen3-1.7b β you are here | 1.7B | 0.955 | Trained OOD head |
| editlens-ood-adapter-qwen3-0.6b | 0.6B | 0.688 | Frozen-embedding adapter |
Limitations
- English text; best on inputs of roughly a paragraph or more (very short snippets are noisier).
- The score reflects degree of AI editing, not authorship intent or quality.
- Can be affected by domain shift β calibrate threshold on data resembling your own.
- Like all detectors, not immune to adversarial paraphrasing.
License
Apache-2.0. Built on Qwen/Qwen3-1.7B-Base. The supervision labels derive from
the gated pangram/editlens_iclr dataset;
please honor its terms.
Model tree for reneeice/ood-editguard-qwen3-1.7b
Base model
Qwen/Qwen3-1.7B-Base