Update README.md

105c9ff verified 4 months ago

16.1 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-classification
	- ai-text-detection
	- deberta-v3
	- binary-classification
	- nlp
	datasets:
	- liamdugan/raid
	- artem9k/ai-text-detection-pile
	- gsingh1-py/train
	- cc_news
	- blog_authorship_corpus
	- webis/tldr-17
	- ChristophSchuhmann/essays-with-instructions
	- HuggingFaceH4/stack-exchange-preferences
	- pile-of-law/pile-of-law
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	- roc_auc
	pipeline_tag: text-classification
	model-index:
	- name: GLYPH
	results:
	- task:
	type: text-classification
	name: AI-Generated Text Detection
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.9885
	- name: F1
	type: f1
	value: 0.9901
	- name: Precision
	type: precision
	value: 0.9851
	- name: Recall
	type: recall
	value: 0.9952
	- name: ROC-AUC
	type: roc_auc
	value: 0.9990
	- name: MCC
	type: mcc
	value: 0.9765
	---

	# GLYPH — High-Accuracy AI Text Detector

	GLYPH is a binary text classifier built on [DeBERTa-v3-base](https://huggingface.co/microsoft/deberta-v3-base) that distinguishes human-written text from AI-generated text. It achieves 98.85% accuracy, 0.999 ROC-AUC, and 0.990 F1 on a held-out test set spanning 10 human writing domains and 14 AI model families — from GPT-2 (1.5B) through GPT-4 (~1T).

	The model was trained on ~50K texts covering academic papers, news articles, blog posts, Reddit discussions, legal filings, Wikipedia, student essays, and technical Q&A on the human side, and outputs from 24 distinct AI model configurations across 10 model families on the AI side. It produces well-separated, high-confidence predictions (mean confidence 0.976) and remains accurate even at the strictest decision thresholds.

	## Key Results

	\| Metric \| Value \|
	\|---\|---\|
	\| Accuracy \| 98.85% \|
	\| F1 Score \| 0.9901 \|
	\| Precision \| 98.51% \|
	\| Recall \| 99.52% \|
	\| ROC-AUC \| 0.9990 \|
	\| Average Precision \| 0.9993 \|
	\| MCC \| 0.9765 \|
	\| Human Accuracy \| 97.94% \|
	\| AI Accuracy \| 99.52% \|
	\| Mean Confidence \| 0.976 \|
	\| F1 @ 0.95 threshold \| 0.987 \|

	All metrics evaluated on a held-out test set of 5,050 texts (2,136 human / 2,914 AI) with no overlap in source texts, split hashes, or temporal leakage with the training set.

	## Per-Source Performance

	### Human Text Sources

	\| Source \| Domain \| n \| Accuracy \| Confidence \|
	\|---\|---\|---\|---\|---\|
	\| PubMed Abstracts \| Biomedical research \| 300 \| 100.0% \| 0.988 \|
	\| Blog / Opinion \| Personal blogs \| 200 \| 100.0% \| 0.987 \|
	\| Reddit Writing \| Informal / social \| 300 \| 100.0% \| 0.985 \|
	\| Wikipedia \| Encyclopedic \| 500 \| 99.8% \| 0.987 \|
	\| CC-News \| Journalism \| 392 \| 99.5% \| 0.981 \|
	\| arXiv Abstracts \| Academic / scientific \| 444 \| 90.8% \| 0.948 \|

	arXiv abstracts are the hardest category — highly formulaic academic prose with structural similarity to AI output. Even so, detection accuracy is 90.8% with 94.8% mean confidence, and the remaining errors are concentrated in a small subset of unusually short or template-heavy abstracts.

	### AI Model Families

	\| Model \| Family \| Params \| n \| Accuracy \| F1 \|
	\|---\|---\|---\|---\|---\|---\|
	\| GPT-3.5-Turbo \| OpenAI \| 175B \| 223 \| 100.0% \| 1.000 \|
	\| GPT-4 \| OpenAI \| ~1T \| 215 \| 100.0% \| 1.000 \|
	\| Llama-2-70B-Chat \| Meta \| 70B \| 191 \| 100.0% \| 1.000 \|
	\| MPT-30B \| MosaicML \| 30B \| 211 \| 100.0% \| 1.000 \|
	\| MPT-30B-Chat \| MosaicML \| 30B \| 191 \| 100.0% \| 1.000 \|
	\| Mistral-7B-Instruct-v0.1 \| Mistral AI \| 7B \| 194 \| 100.0% \| 1.000 \|
	\| Mistral-7B-v0.1 \| Mistral AI \| 7B \| 203 \| 100.0% \| 1.000 \|
	\| Llama-3.1-8B-Instruct \| Meta \| 8B \| 238 \| 99.6% \| 0.998 \|
	\| Phi-3.5-Mini-Instruct \| Microsoft \| 3.8B \| 238 \| 99.6% \| 0.998 \|
	\| Command-Chat \| Cohere \| 52B \| 198 \| 99.5% \| 0.997 \|
	\| Text-Davinci-002 \| OpenAI \| 175B \| 176 \| 99.4% \| 0.997 \|
	\| Llama-3.2-3B-Instruct \| Meta \| 3B \| 238 \| 99.2% \| 0.996 \|
	\| GPT-2-XL \| OpenAI \| 1.5B \| 198 \| 98.5% \| 0.992 \|
	\| Cohere Command \| Cohere \| 52B \| 200 \| 97.5% \| 0.987 \|

	Detection is robust across four generations of language models (GPT-2 through GPT-4), three access paradigms (open-weight, API-only, and proprietary), and parameter counts spanning three orders of magnitude (1.5B to ~1T).

	### Performance by Text Length

	\| Length Bucket \| n \| Accuracy \| F1 \|
	\|---\|---\|---\|---\|
	\| Very Long (>2000 words) \| 103 \| 100.0% \| 1.000 \|
	\| Long (500–2000 words) \| 862 \| 99.9% \| 0.999 \|
	\| Short (50–150 words) \| 1,976 \| 98.5% \| 0.989 \|
	\| Medium (150–500 words) \| 1,634 \| 98.8% \| 0.989 \|
	\| Very Short (<50 words) \| 475 \| 98.1% \| 0.899 \|

	Performance degrades gracefully with shorter inputs. Even on texts under 50 words — where the model has minimal signal — accuracy remains above 98%.

	### Threshold Sensitivity

	The model produces well-calibrated, high-confidence outputs. Performance holds across aggressive decision thresholds:

	\| P(AI) Threshold \| F1 \| Precision \|
	\|---\|---\|---\|
	\| 0.50 (default) \| 0.990 \| 0.985 \|
	\| 0.60 \| 0.991 \| 0.987 \|
	\| 0.70 \| 0.992 \| 0.990 \|
	\| 0.80 \| 0.992 \| 0.992 \|
	\| 0.90 \| 0.991 \| 0.993 \|
	\| 0.95 \| 0.987 \| 0.996 \|

	At a 0.95 threshold, precision reaches 99.6% with only a 0.3% drop in F1 — suitable for high-stakes applications where false accusations of AI usage carry serious consequences.

	## Architecture

	\| Component \| Details \|
	\|---\|---\|
	\| Base model \| `microsoft/deberta-v3-base` (184M parameters) \|
	\| Architecture \| DeBERTa-v3 with disentangled attention and enhanced mask decoder \|
	\| Task head \| Linear classifier (768 → 2) with 0.15 dropout \|
	\| Tokenizer \| SentencePiece (slow tokenizer, `use_fast=False`) \|
	\| Max sequence length \| 512 tokens \|
	\| Output \| `[P(human), P(AI)]` softmax probabilities \|

	DeBERTa-v3 was chosen over RoBERTa and BERT alternatives due to its disentangled attention mechanism, which separately encodes content and position. This is particularly relevant for AI text detection: language models have characteristic positional dependencies in how they distribute tokens across a sequence, and disentangled attention gives the classifier direct access to these patterns.

	## Training

	### Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Trainable parameters \| 184,423,682 (100% — all layers unfrozen) \|
	\| Optimizer \| AdamW (weight decay 0.01) \|
	\| Learning rate \| 2e-5 (cosine schedule) \|
	\| Warmup \| 10% of total steps \|
	\| Effective batch size \| 64 (16 × 4 gradient accumulation) \|
	\| Precision \| bf16 mixed precision \|
	\| Gradient checkpointing \| Enabled (non-reentrant) \|
	\| Label smoothing \| 0.05 \|
	\| Class weights \| human=1.182, ai=0.867 \|
	\| Epochs \| 8 (early-stopped at 3.17) \|
	\| Best checkpoint \| Epoch 1.19 (by validation F1) \|
	\| Training time \| ~49 minutes on RTX 4070 Ti 12GB \|
	\| Final train loss \| 0.186 \|
	\| Final eval loss \| 0.150 \|

	### Why Fully Unfrozen?

	Initial experiments with 4 frozen encoder layers (standard practice from PAN-CLEF 2025 literature) yielded only 80% accuracy with severe human-side bias — the model classified 44% of human texts as AI. Freezing 4 of 12 layers in DeBERTa-base locks 33% of the network, far more aggressive than the 21% reported for DeBERTa-large. Unfreezing all layers with cosine LR decay and 10% warmup resolved the bias entirely, lifting human accuracy from 55.6% to 97.9% without sacrificing AI detection (97.4% → 99.5%).

	### Dataset Composition

	Total: 50,458 texts (40,364 train / 5,044 validation / 5,050 test)

	Stratified by source with hash-based deduplication to prevent data leakage.

	#### Human Sources (10 domains, ~29K target)

	\| Domain \| Source \| Target Count \| Text Type \|
	\|---\|---\|---\|---\|
	\| Academic (STEM) \| arXiv API \| 5,000 \| Abstracts across 8 categories (cs.CL, cs.AI, cs.LG, physics, math, q-bio, econ, stat) \|
	\| Academic (Medical) \| PubMed API \| 3,000 \| Biomedical research abstracts \|
	\| Encyclopedic \| Wikipedia API \| 5,000 \| Article sections across 10 topic categories \|
	\| Journalism \| CC-News (HuggingFace) \| 4,000 \| News articles \|
	\| Literary / Creative \| Project Gutenberg \| 2,000 \| Public domain book excerpts \|
	\| Informal / Social \| Reddit (webis/tldr-17) \| 3,000 \| Writing-focused subreddit posts \|
	\| Student / Educational \| PERSUADE corpus \| 2,000 \| Student essays \|
	\| Technical / Q&A \| StackExchange \| 2,000 \| Technical answers \|
	\| Blog / Opinion \| Blog Authorship Corpus \| 2,000 \| Personal blog posts \|
	\| Legal / Formal \| Pile of Law \| 1,000 \| Legal opinions and case summaries \|

	#### AI Sources (24 model configurations across 10 families)

	Locally generated via LM Studio (8 models, Q4_K_M quantization):

	\| Model \| Family \| Parameters \|
	\|---\|---\|---\|
	\| Llama-3.1-8B-Instruct \| Meta Llama \| 8B \|
	\| Llama-3.2-3B-Instruct \| Meta Llama \| 3B \|
	\| Mistral-7B-Instruct-v0.3 \| Mistral AI \| 7B \|
	\| Qwen2.5-7B-Instruct \| Alibaba Qwen \| 7B \|
	\| Qwen2.5-14B-Instruct \| Alibaba Qwen \| 14B \|
	\| Gemma-2-9B-Instruct \| Google \| 9B \|
	\| Phi-3.5-Mini-Instruct \| Microsoft \| 3.8B \|
	\| DeepSeek-V2-Lite-Chat \| DeepSeek \| 16B (MoE) \|

	Local generation used 4 temperature/sampling configurations (default, creative, precise, varied) across 6 prompt strategies (direct, continue, rewrite, expand, style_mimic, question_answer) with a system prompt enforcing natural human-like output — no markdown, no meta-commentary, no self-referential AI language.

	HuggingFace datasets (16 additional model families):

	\| Dataset \| Models Added \| Reference \|
	\|---\|---\|---\|
	\| RAID (ACL 2024) \| ChatGPT-3.5, GPT-4, GPT-3-Davinci, Cohere Command, Llama-2-70B-Chat, Mistral-7B-v0.1, Mixtral-8x7B, MPT-30B, GPT-2-XL \| [liamdugan/raid](https://huggingface.co/datasets/liamdugan/raid) \|
	\| AI Text Detection Pile \| GPT-2/3/J/ChatGPT (mixed) \| [artem9k/ai-text-detection-pile](https://huggingface.co/datasets/artem9k/ai-text-detection-pile) \|
	\| NYT Multi-Model \| GPT-4o, Yi-Large, Qwen-2-72B, Llama-3-8B, Gemma-2-9B, Mistral-7B \| [gsingh1-py/train](https://huggingface.co/datasets/gsingh1-py/train) \|

	This combination ensures coverage of proprietary API models (GPT-3.5, GPT-4, GPT-4o, Cohere), large open models exceeding consumer GPU VRAM (Llama-2-70B, Qwen-2-72B, Mixtral-8x7B, Yi-Large), older architectures (GPT-2, GPT-3, GPT-J), and mixture-of-experts models (Mixtral, DeepSeek-V2-Lite). RAID data was filtered to non-adversarial generations only (`attack=="none"`) for training data quality.

	## Usage

	### With Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "ogmatrixllm/glyph" # Replace with your repo path
	tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	model.eval()

	text = "Your text to classify here..."

	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=-1)

	p_human, p_ai = probs[0].tolist()
	label = "AI-generated" if p_ai > 0.5 else "Human-written"
	confidence = max(p_human, p_ai)

	print(f"{label} (confidence: {confidence:.1%})")
	```

	### With Pipeline

	```python
	from transformers import pipeline

	detector = pipeline(
	"text-classification",
	model="ogmatrixai/glyph", # Replace with your repo path
	tokenizer=AutoTokenizer.from_pretrained("ogmatrixai/glyph", use_fast=False),
	)

	result = detector("Your text here...")
	print(result)
	# [{'label': 'LABEL_1', 'score': 0.98}] # LABEL_0 = human, LABEL_1 = AI
	```

	### Important Notes

	- Tokenizer: Always use `use_fast=False`. The fast tokenizer for DeBERTa-v3 has a confirmed regression in `transformers>=4.47` ([#42583](https://github.com/huggingface/transformers/issues/42583)) that crashes on load.
	- Max length: The model was trained with `max_length=512`. Longer texts should be truncated or chunked with predictions aggregated.
	- Labels: `LABEL_0` = human, `LABEL_1` = AI-generated.

	## Limitations and Ethical Considerations

	### Known Limitations

	1. English only. GLYPH was trained exclusively on English text. Performance on other languages is untested and likely degraded.

	2. Training distribution. The model has seen outputs from 24 specific AI model configurations. Novel architectures, heavily fine-tuned models, or future model families may evade detection. AI text detection is fundamentally adversarial — no static detector provides permanent robustness.

	3. arXiv abstracts remain the hardest domain at 90.8% accuracy. Highly formulaic academic writing with rigid structural conventions shares surface features with AI-generated text. Users in academic integrity contexts should treat borderline predictions on scientific abstracts with appropriate caution.

	4. Short texts (<50 words) have reduced F1 (0.899) despite high accuracy (98.1%). With minimal token-level signal, the model occasionally produces confident but incorrect predictions. For short-form content, consider requiring higher confidence thresholds.

	5. Adversarial attacks. The training data includes only non-adversarial AI outputs. Paraphrasing attacks, homoglyph substitution, targeted prompt engineering, and watermark-removal techniques were not included. Dedicated adversarial robustness (e.g., RAID adversarial subsets) is a planned enhancement.

	6. Mixed authorship. GLYPH classifies at the document level. It does not detect partial AI usage (e.g., AI-written paragraphs embedded in a human-written essay). Sentence-level or span-level detection requires a different approach.

	7. 512-token window. Texts are truncated at 512 tokens. For long documents, this means classification is based on the opening ~350–400 words only. Sliding-window aggregation is recommended for long-form content.

	### Ethical Considerations

	AI text detection carries real consequences — academic penalties, professional reputation damage, content moderation decisions. False positives (human text classified as AI) are particularly harmful. While GLYPH's false positive rate is low (2.06% on the test set, 44 out of 2,136 human texts), no detector achieves zero false positives.

	Recommendations for responsible deployment:

	- Never use GLYPH as the sole basis for punitive action. Use it as one signal among many (metadata, behavioral patterns, stylometric analysis).
	- Apply a high confidence threshold (≥0.95) for consequential decisions. At this threshold, precision reaches 99.6%.
	- Provide users with the confidence score, not just a binary label. A text scored at P(AI)=0.52 is fundamentally different from one scored at P(AI)=0.99.
	- Maintain an appeals process. Statistical classifiers will always produce errors.
	- Acknowledge the base rate problem. In populations where AI usage is rare, even a 2% FPR produces many false accusations relative to true detections.

	## Training Infrastructure

	\| Component \| Specification \|
	\|---\|---\|
	\| GPU \| NVIDIA GeForce RTX 4070 Ti (12GB VRAM) \|
	\| CPU \| Intel Core i7-14700K (20 cores) \|
	\| RAM \| 48GB DDR5 \|
	\| Framework \| PyTorch 2.6+ / HuggingFace Transformers \|
	\| Precision \| bf16 mixed precision \|
	\| Total training time \| 49 minutes \|
	\| Experiment tracking \| Weights & Biases \|

	## Citation

	```bibtex
	@misc{glyph2026,
	title={GLYPH: High-Accuracy AI Text Detection with DeBERTa-v3},
	author={OGMatrix},
	year={2026},
	url={https://huggingface.co/ogmatrixllm/glyph}
	}
	```

	## Acknowledgments

	Training data incorporates the [RAID benchmark](https://huggingface.co/datasets/liamdugan/raid) (Dugan et al., ACL 2024), the [AI Text Detection Pile](https://huggingface.co/datasets/artem9k/ai-text-detection-pile), and the [NYT Multi-Model dataset](https://huggingface.co/datasets/gsingh1-py/train). Human text sources include arXiv, PubMed, Wikipedia, CC-News, Project Gutenberg, Reddit, StackExchange, Blog Authorship Corpus, PERSUADE, and Pile of Law. The base model is [DeBERTa-v3-base](https://huggingface.co/microsoft/deberta-v3-base) by Microsoft Research.