---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- ai-detection
- ai-edit-detection
- out-of-distribution
- ood-detection
- selective-prediction
- content-integrity
- qwen3
---

# editlens-ood-selective-guard-qwen3 — reliability guard for EditLens

**A reliability guard for AI-edit detection.** An out-of-distribution gate that abstains on inputs unlike the training distribution (domain shift, unseen models, non-native English), so the edit-score is only trusted where it's reliable.

## Usage

A **reliability guard**: download `ood_guard.npz`, score each input's distance to
the training distribution, and **abstain** when it's too far (route to a human, or
withhold a verdict).

```python
import numpy as np, torch
from transformers import AutoTokenizer, AutoModel
g = np.load("ood_guard.npz"); center, inv = g["center"], g["inv_cov"]
tok = AutoTokenizer.from_pretrained("reneeice/editlens-qwen3-0.6b-repro")
enc = AutoModel.from_pretrained("reneeice/editlens-qwen3-0.6b-repro", torch_dtype=torch.bfloat16).eval()
def ood_distance(text):
    t = tok(text.lower(), truncation=True, max_length=512, return_tensors="pt")
    h = enc(**t).last_hidden_state.mean(1)[0].float().numpy()
    d = h - center
    return float(d @ inv @ d)   # high = out-of-distribution -> abstain
```

Set the abstain threshold from the coverage/accuracy table below.

## Performance — selective prediction

Abstaining on the most out-of-distribution inputs raises accuracy on the rest:

| Coverage (kept) | accuracy |
|---|---|
| 100% | 0.8725 |
| 90% | 0.8733 |
| 80% | 0.8706 |
| 70% | 0.8764 |
| 60% | 0.8833 |
| 50% | 0.8940 |

| Summary | Value |
|---|---|
| base accuracy (100% coverage) | 0.873 |
| accuracy @ 80% coverage | 0.871 |
| **lift from abstaining on the 20% most-OOD** | **-0.002** |

## The project behind this model

This model is one of a **family of three**, the end of a single research thread
that started from a classic question — *can you tell human text from machine
text?* — and ended at a more realistic one — *how much did AI edit this text, and
can we trust that judgement?*

The journey, start to finish:

1. **Reproduce "Human Texts Are Outliers."** We first reproduced the core claim of
   [arXiv:2510.08602](https://arxiv.org/abs/2510.08602) (NeurIPS 2025): instead of
   training a binary human-vs-machine classifier, model **machine text as the
   in-distribution** and treat **human text as out-of-distribution (OOD)** — an
   anomaly to be detected by distance from a learned center (DeepSVDD). A minimal
   end-to-end run on the RAID dataset hit **AUROC 0.94**, matching the paper.

2. **Meet EditLens.** Binary detection is the wrong frame for the *common* case:
   people lightly edit their own drafts with AI. [EditLens](https://arxiv.org/abs/2510.03154)
   (Thai et al., 2025) reframes detection as a **continuous "extent of AI editing"**
   score in [0,1], and the community
   [`editlens-qwen3-*-repro`](https://huggingface.co/reneeice/editlens-qwen3-4b-repro)
   models bring it to a modern **Qwen3** backbone.

3. **Apply the OOD idea to the edit-detection setting.** The insight of this work:
   take the OOD framing from step 1 and apply it to the edit-detection problem of
   step 2, on Qwen3. We pursued **three concrete ways** to do that — and shipped all
   three as a family:

| Model | What it is | Use it when |
|---|---|---|
| [`ood-editguard-qwen3-0.6b`](https://huggingface.co/reneeice/ood-editguard-qwen3-0.6b) | **Standalone OOD AI-edit detector** — a Qwen3 backbone fine-tuned (QLoRA) with an out-of-distribution head; outputs a continuous "how AI-edited" score. | You want one self-contained model that scores text end-to-end. |
| [`editlens-ood-adapter-qwen3-0.6b`](https://huggingface.co/reneeice/editlens-ood-adapter-qwen3-0.6b) | **Tiny OOD adapter** (a few MB) that snaps onto a frozen [EditLens-Qwen3](https://huggingface.co/reneeice/editlens-qwen3-4b-repro) checkpoint to add an anomaly / human-likeness score — no backbone training. | You already run EditLens and want to add an OOD score cheaply. |
| [`editlens-ood-selective-guard-qwen3`](https://huggingface.co/reneeice/editlens-ood-selective-guard-qwen3) ← **you are here** | **Reliability guard** for selective prediction — an OOD gate that abstains on inputs unlike the training distribution so the edit-score isn't trusted blindly. | You need calibrated, low-false-positive decisions and can abstain on hard cases. |

> **Why three?** They trade off cost and integration: **A** is a standalone model,
> **B** is a cheap add-on to an existing EditLens deployment, and **C** wraps either
> with an abstain-on-uncertainty safety layer. Pick the one that matches how you
> deploy.

### One thing we learned the hard way

Our first frozen-embedding run scored an AUROC of **0.32** — not random, but
*inverted*. On the EditLens embedding space the geometry is the opposite of the
original RAID setup: **human/clean text is the compact in-distribution** and
heavily-AI-edited text is the outlier (its embeddings are organized around *extent
of editing*, not authorship). We flipped the in-distribution definition, switched
from full Mahalanobis to a shrinkage-regularized / Euclidean distance on frozen
features, and added an **auto-orientation** step that fixes the score's sign on a
held-out slice so a detector is never reported upside-down. That correction is
baked into this family.

## How it was made

- **Frozen backbone:** `reneeice/editlens-qwen3-0.6b-repro` (no fine-tuning).
- **Guard:** a DeepSVDD detector (center + whitening) fit on the **training
  distribution**; inputs far from it are flagged out-of-distribution and abstained.
- **Cost:** one embedding pass + a closed-form fit.

## License

Apache-2.0. Built on `Qwen/Qwen3-*-Base`. The supervision labels derive from the
gated [`pangram/editlens_iclr`](https://huggingface.co/datasets/pangram/editlens_iclr)
dataset; please honor its terms. Method credit: *Human Texts Are Outliers*
([2510.08602](https://arxiv.org/abs/2510.08602)) and *EditLens*
([2510.03154](https://arxiv.org/abs/2510.03154)).