--- license: apache-2.0 language: - en tags: - text-classification - quality-filter - web-content - search - qrater datasets: - custom base_model: Qwen/Qwen3-Embedding-4B pipeline_tag: text-classification model-index: - name: qrater-web-large-v1.0 results: - task: type: text-classification name: Web Content Quality Classification metrics: - name: Accuracy type: accuracy value: 0.921 - name: F1 type: f1 value: 0.867 --- # qrater-web-large-v1.0 A binary text classifier that distinguishes **clean, usable web content** from **noisy web pages** (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.). Built for filtering web search results at scale — drop it into any retrieval or RAG pipeline to keep only pages worth reading. | Model | Params | Base | Speed | Val Acc | Val F1 | |-------|--------|------|-------|---------|--------| | **qrater-web-large-v1.0** | 4B | Qwen3-Embedding-4B | ~15 docs/s | 92.1% | 0.867 | | [qrater-web-base-v1.0](https://huggingface.co/chonkie-ai/qrater-web-base-v1.0) | 0.6B | Qwen3-Embedding-0.6B | ~16 docs/s | 92.4% | 0.873 | | [qrater-web-small-v1.0](https://huggingface.co/chonkie-ai/qrater-web-small-v1.0) | 210M | EuroBERT-210m | ~34 docs/s | 90.6% | 0.843 | *Speed measured on a single A100-80GB with vLLM classify mode, max 4096 tokens.* ## What it does Given a web page (as markdown or plain text), the model predicts: - **clean** (label 1) — substantive, readable content suitable for AI consumption - **dirty** (label 0) — noise, boilerplate, broken formatting, thin content ## Usage ### Transformers ```python from transformers import pipeline pipe = pipeline( "text-classification", model="chonkie-ai/qrater-web-large-v1.0", torch_dtype="bfloat16", device_map="auto", ) result = pipe("# How DNS Works\n\nDNS resolution starts when...") # [{'label': 'clean', 'score': 0.97}] ``` ### vLLM (recommended for throughput) ```python from vllm import LLM model = LLM( "chonkie-ai/qrater-web-large-v1.0", task="classify", dtype="bfloat16", max_model_len=4096, ) outputs = model.classify(["your web page text here"]) probs = outputs[0].outputs.probs # [prob_dirty, prob_clean] ``` ## Training - **Base model:** [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) - **Training data:** 10,000 labeled web pages - 4,128 samples from live web search results, labeled by Claude - 5,872 samples from Common Crawl, labeled by a 27B parameter classifier - Target distribution: ~30% clean / ~70% dirty - **Hyperparameters:** 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1 - **Hardware:** 4x A100-80GB with gradient checkpointing ## Label definition A page is **clean** if: - It contains substantive, original content (articles, tutorials, documentation, research papers) - The main content is intact and readable after markdown conversion - Minimal boilerplate relative to content A page is **dirty** if: - Dominated by navigation, ads, cookie notices, or login walls - Thin or auto-generated content with little substance - Broken formatting or encoding issues that make content unusable - Primarily lists of links, product listings, or search result pages ## Evaluation **Validation set** (1,000 held-out samples, same distribution as training): - Accuracy: **92.1%** - F1 (clean class): **0.867** **Live web search results** (99 pages across 10 diverse queries): - 30% classified clean — aligned with Claude baseline (~40%) and significantly more selective than Common Crawl-only training (67% clean) ## Smaller models This model serves as the teacher for the smaller qrater-web models, which are trained via temperature-scaled KL-divergence distillation from the soft probability outputs of this model. ## Limitations - **English-only** — trained exclusively on English web content - **Max input: 4,096 tokens** — longer pages are truncated (the base model supports 40K but training used 4K) - **Optimized for informational content** — may be less calibrated on creative writing, social media, or e-commerce pages - **Binary classification** — does not grade quality on a spectrum ## Citation ```bibtex @misc{qrater2026, title={qrater-web-large-v1.0: Web Content Quality Classifier}, author={Bhavnick Minhas}, year={2026}, url={https://huggingface.co/chonkie-ai/qrater-web-large-v1.0} } ``` ## License Apache 2.0