---
license: apache-2.0
language:
- en
tags:
- text-classification
- quality-filter
- web-content
- search
- qrater
datasets:
- custom
base_model: Qwen/Qwen3-Embedding-4B
pipeline_tag: text-classification
model-index:
- name: qrater-web-large-v1.0
  results:
  - task:
      type: text-classification
      name: Web Content Quality Classification
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.921
    - name: F1
      type: f1
      value: 0.867
---

# qrater-web-large-v1.0

A binary text classifier that distinguishes **clean, usable web content** from **noisy web pages** (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).

Built for filtering web search results at scale — drop it into any retrieval or RAG pipeline to keep only pages worth reading.

| Model | Params | Base | Speed | Val Acc | Val F1 |
|-------|--------|------|-------|---------|--------|
| **qrater-web-large-v1.0** | 4B | Qwen3-Embedding-4B | ~15 docs/s | 92.1% | 0.867 |
| [qrater-web-base-v1.0](https://huggingface.co/chonkie-ai/qrater-web-base-v1.0) | 0.6B | Qwen3-Embedding-0.6B | ~16 docs/s | 92.4% | 0.873 |
| [qrater-web-small-v1.0](https://huggingface.co/chonkie-ai/qrater-web-small-v1.0) | 210M | EuroBERT-210m | ~34 docs/s | 90.6% | 0.843 |

*Speed measured on a single A100-80GB with vLLM classify mode, max 4096 tokens.*

## What it does

Given a web page (as markdown or plain text), the model predicts:

- **clean** (label 1) — substantive, readable content suitable for AI consumption
- **dirty** (label 0) — noise, boilerplate, broken formatting, thin content

## Usage

### Transformers

```python
from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="chonkie-ai/qrater-web-large-v1.0",
    torch_dtype="bfloat16",
    device_map="auto",
)

result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]
```

### vLLM (recommended for throughput)

```python
from vllm import LLM

model = LLM(
    "chonkie-ai/qrater-web-large-v1.0",
    task="classify",
    dtype="bfloat16",
    max_model_len=4096,
)

outputs = model.classify(["your web page text here"])
probs = outputs[0].outputs.probs  # [prob_dirty, prob_clean]
```

## Training

- **Base model:** [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)
- **Training data:** 10,000 labeled web pages
  - 4,128 samples from live web search results, labeled by Claude
  - 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
  - Target distribution: ~30% clean / ~70% dirty
- **Hyperparameters:** 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
- **Hardware:** 4x A100-80GB with gradient checkpointing

## Label definition

A page is **clean** if:
- It contains substantive, original content (articles, tutorials, documentation, research papers)
- The main content is intact and readable after markdown conversion
- Minimal boilerplate relative to content

A page is **dirty** if:
- Dominated by navigation, ads, cookie notices, or login walls
- Thin or auto-generated content with little substance
- Broken formatting or encoding issues that make content unusable
- Primarily lists of links, product listings, or search result pages

## Evaluation

**Validation set** (1,000 held-out samples, same distribution as training):
- Accuracy: **92.1%**
- F1 (clean class): **0.867**

**Live web search results** (99 pages across 10 diverse queries):
- 30% classified clean — aligned with Claude baseline (~40%) and significantly more selective than Common Crawl-only training (67% clean)

## Smaller models

This model serves as the teacher for the smaller qrater-web models, which are trained via temperature-scaled KL-divergence distillation from the soft probability outputs of this model.

## Limitations

- **English-only** — trained exclusively on English web content
- **Max input: 4,096 tokens** — longer pages are truncated (the base model supports 40K but training used 4K)
- **Optimized for informational content** — may be less calibrated on creative writing, social media, or e-commerce pages
- **Binary classification** — does not grade quality on a spectrum

## Citation

```bibtex
@misc{qrater2026,
  title={qrater-web-large-v1.0: Web Content Quality Classifier},
  author={Bhavnick Minhas},
  year={2026},
  url={https://huggingface.co/chonkie-ai/qrater-web-large-v1.0}
}
```

## License

Apache 2.0