editlens-ood-selective-guard-qwen3 β€” reliability guard for EditLens

A reliability guard for AI-edit detection. An out-of-distribution gate that abstains on inputs unlike the training distribution (domain shift, unseen models, non-native English), so the edit-score is only trusted where it's reliable.

Usage

A reliability guard: download ood_guard.npz, score each input's distance to the training distribution, and abstain when it's too far (route to a human, or withhold a verdict).

import numpy as np, torch
from transformers import AutoTokenizer, AutoModel
g = np.load("ood_guard.npz"); center, inv = g["center"], g["inv_cov"]
tok = AutoTokenizer.from_pretrained("reneeice/editlens-qwen3-0.6b-repro")
enc = AutoModel.from_pretrained("reneeice/editlens-qwen3-0.6b-repro", torch_dtype=torch.bfloat16).eval()
def ood_distance(text):
    t = tok(text.lower(), truncation=True, max_length=512, return_tensors="pt")
    h = enc(**t).last_hidden_state.mean(1)[0].float().numpy()
    d = h - center
    return float(d @ inv @ d)   # high = out-of-distribution -> abstain

Set the abstain threshold from the coverage/accuracy table below.

Performance β€” selective prediction

Abstaining on the most out-of-distribution inputs raises accuracy on the rest:

Coverage (kept) accuracy
100% 0.8725
90% 0.8733
80% 0.8706
70% 0.8764
60% 0.8833
50% 0.8940
Summary Value
base accuracy (100% coverage) 0.873
accuracy @ 80% coverage 0.871
lift from abstaining on the 20% most-OOD -0.002

The project behind this model

This model is one of a family of three, the end of a single research thread that started from a classic question β€” can you tell human text from machine text? β€” and ended at a more realistic one β€” how much did AI edit this text, and can we trust that judgement?

The journey, start to finish:

  1. Reproduce "Human Texts Are Outliers." We first reproduced the core claim of arXiv:2510.08602 (NeurIPS 2025): instead of training a binary human-vs-machine classifier, model machine text as the in-distribution and treat human text as out-of-distribution (OOD) β€” an anomaly to be detected by distance from a learned center (DeepSVDD). A minimal end-to-end run on the RAID dataset hit AUROC 0.94, matching the paper.

  2. Meet EditLens. Binary detection is the wrong frame for the common case: people lightly edit their own drafts with AI. EditLens (Thai et al., 2025) reframes detection as a continuous "extent of AI editing" score in [0,1], and the community editlens-qwen3-*-repro models bring it to a modern Qwen3 backbone.

  3. Apply the OOD idea to the edit-detection setting. The insight of this work: take the OOD framing from step 1 and apply it to the edit-detection problem of step 2, on Qwen3. We pursued three concrete ways to do that β€” and shipped all three as a family:

Model What it is Use it when
ood-editguard-qwen3-0.6b Standalone OOD AI-edit detector β€” a Qwen3 backbone fine-tuned (QLoRA) with an out-of-distribution head; outputs a continuous "how AI-edited" score. You want one self-contained model that scores text end-to-end.
editlens-ood-adapter-qwen3-0.6b Tiny OOD adapter (a few MB) that snaps onto a frozen EditLens-Qwen3 checkpoint to add an anomaly / human-likeness score β€” no backbone training. You already run EditLens and want to add an OOD score cheaply.
editlens-ood-selective-guard-qwen3 ← you are here Reliability guard for selective prediction β€” an OOD gate that abstains on inputs unlike the training distribution so the edit-score isn't trusted blindly. You need calibrated, low-false-positive decisions and can abstain on hard cases.

Why three? They trade off cost and integration: A is a standalone model, B is a cheap add-on to an existing EditLens deployment, and C wraps either with an abstain-on-uncertainty safety layer. Pick the one that matches how you deploy.

One thing we learned the hard way

Our first frozen-embedding run scored an AUROC of 0.32 β€” not random, but inverted. On the EditLens embedding space the geometry is the opposite of the original RAID setup: human/clean text is the compact in-distribution and heavily-AI-edited text is the outlier (its embeddings are organized around extent of editing, not authorship). We flipped the in-distribution definition, switched from full Mahalanobis to a shrinkage-regularized / Euclidean distance on frozen features, and added an auto-orientation step that fixes the score's sign on a held-out slice so a detector is never reported upside-down. That correction is baked into this family.

How it was made

  • Frozen backbone: reneeice/editlens-qwen3-0.6b-repro (no fine-tuning).
  • Guard: a DeepSVDD detector (center + whitening) fit on the training distribution; inputs far from it are flagged out-of-distribution and abstained.
  • Cost: one embedding pass + a closed-form fit.

License

Apache-2.0. Built on Qwen/Qwen3-*-Base. The supervision labels derive from the gated pangram/editlens_iclr dataset; please honor its terms. Method credit: Human Texts Are Outliers (2510.08602) and EditLens (2510.03154).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for reneeice/editlens-ood-selective-guard-qwen3