--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-classification tags: - ai-detection - ai-edit-detection - out-of-distribution - ood-detection - content-integrity - qwen3 - deepsvdd --- # ood-editguard-qwen3 — OOD AI-edit detector (Qwen3) **Detect AI-edited text with an out-of-distribution detector on a Qwen3 backbone.** Human text is modeled as the in-distribution; AI-edited and AI-generated text are flagged as outliers, giving a continuous "how-AI-edited" score. ## Usage ```python import torch from transformers import AutoTokenizer, AutoModel from peft import PeftModel base = "Qwen/Qwen3-1.7B-Base" tok = AutoTokenizer.from_pretrained("reneeice/ood-editguard-qwen3-0.6b") backbone = PeftModel.from_pretrained(AutoModel.from_pretrained(base, torch_dtype=torch.bfloat16), "reneeice/ood-editguard-qwen3-0.6b") head = torch.load("ood_head.pt") # downloaded from the repo # score(text) = orientation * ||proj(meanpool(backbone(text))) - center||^2 ``` Higher score = more AI-edited. Calibrate a threshold on your own data. ## Performance Validation on `pangram/editlens_iclr` (held-out): | Metric | Value | |---|---| | **AUROC** (AI vs human) | **0.910** | | AUPR | 0.952 | | correlation with edit-magnitude | +0.730 | A random detector scores AUROC 0.5. ## The project behind this model This model is one of a **family of three**, the end of a single research thread that started from a classic question — *can you tell human text from machine text?* — and ended at a more realistic one — *how much did AI edit this text, and can we trust that judgement?* The journey, start to finish: 1. **Reproduce "Human Texts Are Outliers."** We first reproduced the core claim of [arXiv:2510.08602](https://arxiv.org/abs/2510.08602) (NeurIPS 2025): instead of training a binary human-vs-machine classifier, model **machine text as the in-distribution** and treat **human text as out-of-distribution (OOD)** — an anomaly to be detected by distance from a learned center (DeepSVDD). A minimal end-to-end run on the RAID dataset hit **AUROC 0.94**, matching the paper. 2. **Meet EditLens.** Binary detection is the wrong frame for the *common* case: people lightly edit their own drafts with AI. [EditLens](https://arxiv.org/abs/2510.03154) (Thai et al., 2025) reframes detection as a **continuous "extent of AI editing"** score in [0,1], and the community community `editlens-qwen3-*-repro` models (search HF: `editlens qwen3 repro`) models bring it to a modern **Qwen3** backbone. 3. **Apply the OOD idea to the edit-detection setting.** The insight of this work: take the OOD framing from step 1 and apply it to the edit-detection problem of step 2, on Qwen3. We pursued **three concrete ways** to do that — and shipped all three as a family: | Model | What it is | Use it when | |---|---|---| | [`ood-editguard-qwen3-0.6b`](https://huggingface.co/reneeice/ood-editguard-qwen3-0.6b) ← **you are here** | **Standalone OOD AI-edit detector** — a Qwen3 backbone fine-tuned (QLoRA) with an out-of-distribution head; outputs a continuous "how AI-edited" score. | You want one self-contained model that scores text end-to-end. | | [`editlens-ood-adapter-qwen3-0.6b`](https://huggingface.co/reneeice/editlens-ood-adapter-qwen3-0.6b) | **Tiny OOD adapter** (a few MB) that snaps onto a frozen EditLens-Qwen3 (search HF: `editlens qwen3 repro`) checkpoint to add an anomaly / human-likeness score — no backbone training. | You already run EditLens and want to add an OOD score cheaply. | | [`editlens-ood-selective-guard-qwen3`](https://huggingface.co/reneeice/editlens-ood-selective-guard-qwen3) | **Reliability guard** for selective prediction — an OOD gate that abstains on inputs unlike the training distribution so the edit-score isn't trusted blindly. | You need calibrated, low-false-positive decisions and can abstain on hard cases. | > **Why three?** They trade off cost and integration: **A** is a standalone model, > **B** is a cheap add-on to an existing EditLens deployment, and **C** wraps either > with an abstain-on-uncertainty safety layer. Pick the one that matches how you > deploy. ### One thing we learned the hard way Our first frozen-embedding run scored an AUROC of **0.32** — not random, but *inverted*. On the EditLens embedding space the geometry is the opposite of the original RAID setup: **human/clean text is the compact in-distribution** and heavily-AI-edited text is the outlier (its embeddings are organized around *extent of editing*, not authorship). We flipped the in-distribution definition, switched from full Mahalanobis to a shrinkage-regularized / Euclidean distance on frozen features, and added an **auto-orientation** step that fixes the score's sign on a held-out slice so a detector is never reported upside-down. That correction is baked into this family. ## How it was trained - **Backbone:** `Qwen/Qwen3-1.7B-Base`, bf16 + LoRA (rank 8, all attn+MLP projections). - **Head:** a small LayerNorm+Linear projection trained in full, with a DeepSVDD one-class objective: pull **human** embeddings toward a center `c`, push AI embeddings away. Score = oriented squared distance to `c`. - **Supervision:** edit-magnitude buckets from `cosine_score` (thresholds 0.03/0.15). - **Compute:** a single GPU, minutes. ## License Apache-2.0. Built on `Qwen/Qwen3-*-Base`. The supervision labels derive from the gated [`pangram/editlens_iclr`](https://huggingface.co/datasets/pangram/editlens_iclr) dataset; please honor its terms. Method credit: *Human Texts Are Outliers* ([2510.08602](https://arxiv.org/abs/2510.08602)) and *EditLens* ([2510.03154](https://arxiv.org/abs/2510.03154)).