--- license: apache-2.0 library_name: clip pipeline_tag: image-classification tags: - image-classification - clip - openclip - phishing-detection - cybersecurity - security - pytorch - vision language: - en metrics: - accuracy - precision - recall - f1 model-index: - name: desant-phishing-inference results: - task: type: image-classification name: Phishing Screenshot Detection metrics: - type: accuracy value: 0.95 name: Accuracy - type: recall value: 0.93 name: Malicious Recall - type: f1 value: 0.94 name: F1 Score --- # Desant Phishing Detection Model **CLIP RN50x64-based binary classifier for detecting phishing web pages from screenshots.** Built by [Desant.ai](https://desant.ai) for real-time phishing protection in the [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp) platform. ## Model Description This model classifies web page screenshots as **SAFE** (Class 0) or **MALICIOUS/phishing** (Class 1). It uses OpenAI's CLIP RN50x64 as a frozen visual feature extractor, with a custom 3-layer MLP classifier head trained on thousands of real-world phishing and legitimate screenshots. > **Note:** The model is trained using [OpenAI's CLIP](https://github.com/openai/CLIP) (`clip-by-openai`) and is also compatible with [OpenCLIP](https://github.com/mlfoundations/open_clip) (`open_clip_torch`) for inference. The production backend uses OpenCLIP for serving. The model is designed to detect phishing login forms — fake pages that mimic legitimate services (banks, email providers, social media, etc.) to steal user credentials. ### Key Features - **High-resolution analysis**: 448x448 pixel input (4x more pixels than ViT-B/32) - **Real-world training data**: Sourced from PhishTank, OpenPhish, URLhaus, and AlienVault OTX - **Production-deployed**: Powers the [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp) and backend API used in the [Hugging Face Space demo](https://huggingface.co/spaces/desant-ai/desant-anti-phishing-inferencing) - **Fast inference**: ~50ms on GPU. ## Architecture ``` Input: Web page screenshot (any resolution) │ ▼ ┌──────────────────────────────────┐ │ Preprocessing │ │ • Aspect-ratio preserving │ │ resize to 448×448 │ │ • Mean color padding │ │ (CLIP mean: 123, 117, 104) │ │ • CLIP normalization │ │ mean=[0.481, 0.458, 0.408] │ │ std=[0.269, 0.261, 0.276] │ └──────────┬───────────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ CLIP RN50x64 Vision Encoder │ ← Frozen (pre-trained weights) │ ResNet-50 with 64× wider │ │ channels │ │ Output: 1024-dim feature vector │ └──────────┬───────────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ Classifier Head (trainable) │ │ │ │ Dropout(0.5) │ │ Linear(1024 → 512) + ReLU │ │ Dropout(0.3) │ │ Linear(512 → 128) + ReLU │ │ Dropout(0.2) │ │ Linear(128 → 2) │ │ │ │ Output: [safe_logit, mal_logit] │ └──────────┬───────────────────────┘ │ ▼ Softmax → Probabilities Class 0: SAFE Class 1: MALICIOUS (phishing) ``` ## Training Details ### Training Data | Source | Class | Description | |---|---|---| | PhishTank, OpenPhish, URLhaus, AlienVault OTX | **MALICIOUS** (Class 1) | Real phishing login form screenshots captured at 1920x941 | | Curated safe URLs | **SAFE** (Class 0) | Legitimate login pages, normal web pages | ### Training Configuration | Parameter | Value | |---|---| | Base model | CLIP RN50x64 (OpenAI CLIP, compatible with OpenCLIP) | | Input resolution | 448 × 448 pixels | | Original screenshot resolution | 1920 × 941 pixels | | Batch size | 32 (effective 64 with gradient accumulation) | | Gradient accumulation steps | 2 | | Max epochs | 25 | | Early stopping patience | 10 epochs | | Optimizer | AdamW (lr=1e-4, weight_decay=1e-4, betas=(0.9, 0.999)) | | LR scheduler | ReduceLROnPlateau (factor=0.5, patience=3) | | Loss function | CrossEntropyLoss (unweighted) | | Class balancing | WeightedRandomSampler | | Data split | 80% train / 20% validation | | Mixed precision | Enabled (AMP) | | CLIP encoder | Frozen (only classifier head is trained) | ### Data Augmentation | Augmentation | Details | |---|---| | Aspect-ratio preserving resize | Resize to 448x448 with CLIP mean color padding | | Random horizontal flip | p=0.5 | | Color jitter | brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1 | ### Preprocessing Pipeline 1. Load screenshot (PNG, 1920x941 original resolution) 2. Preserve aspect ratio, resize to fit 448x448 3. Pad with CLIP mean color `(123, 117, 104)` to fill 448x448 canvas 4. Convert to tensor `[0, 1]` 5. Normalize with CLIP statistics: `mean=[0.48145466, 0.4578275, 0.40821073]`, `std=[0.26862954, 0.26130258, 0.27577711]` ## Performance | Metric | Score | |---|---| | **Accuracy** | 92% | | **Malicious Recall** | 93% | | **Safe Precision** | 94% | | **False Positive Rate** | 2–6% | | **F1 Score** | ~0.94 | ### Inference Speed | Hardware | Inference Time | Preprocessing | |---|---|---| | NVIDIA RTX 4090 | ~30ms | ~20ms | | NVIDIA T4 | ~80ms | ~25ms | | CPU (i7-13700K) | ~500ms | ~30ms | ## Usage ### Quick Start (PyTorch) ```python import torch import torch.nn as nn import clip from PIL import Image # Define the classifier architecture (must match training) class CLIPClassifier(nn.Module): def __init__(self, clip_model, num_classes=2): super().__init__() self.clip_visual = clip_model.visual with torch.no_grad(): dummy = torch.randn(1, 3, 448, 448).float().to(next(clip_model.parameters()).device) features = self.clip_visual(dummy) feature_dim = features.shape[1] self.classifier = nn.Sequential( nn.Dropout(0.5), nn.Linear(feature_dim, 512), nn.ReLU(), nn.Dropout(0.3), nn.Linear(512, 128), nn.ReLU(), nn.Dropout(0.2), nn.Linear(128, num_classes), ) def forward(self, x): features = self.clip_visual(x.float()) return self.classifier(features) # Load CLIP base model device = "cuda" if torch.cuda.is_available() else "cpu" clip_model, _ = clip.load("RN50x64", device=device, jit=False) clip_model = clip_model.float() # Build classifier and load trained weights model = CLIPClassifier(clip_model, num_classes=2).to(device) state_dict = torch.load("model_1920x941_CLIP_RN50x64_best.pth", map_location=device) model.load_state_dict(state_dict) model.eval() # Preprocess a screenshot from torchvision import transforms def aspect_ratio_resize(image, target_size=(448, 448)): """Resize preserving aspect ratio with CLIP mean color padding.""" tw, th = target_size w, h = image.size scale = min(tw / w, th / h) nw, nh = int(w * scale), int(h * scale) resized = image.resize((nw, nh), Image.LANCZOS) pad_color = (int(0.48145466*255), int(0.4578275*255), int(0.40821073*255)) canvas = Image.new("RGB", (tw, th), pad_color) canvas.paste(resized, ((tw - nw) // 2, (th - nh) // 2)) return canvas preprocess = transforms.Compose([ transforms.Lambda(lambda img: aspect_ratio_resize(img, (448, 448))), transforms.ToTensor(), transforms.Normalize( mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711], ), ]) # Run inference image = Image.open("screenshot.png").convert("RGB") input_tensor = preprocess(image).unsqueeze(0).to(device) with torch.no_grad(): logits = model(input_tensor) probs = torch.softmax(logits, dim=1) safe_prob = probs[0][0].item() malicious_prob = probs[0][1].item() prediction = "MALICIOUS" if malicious_prob > 0.5 else "SAFE" print(f"Prediction: {prediction}") print(f"Safe probability: {safe_prob:.4f}") print(f"Malicious probability: {malicious_prob:.4f}") ``` ## Intended Use **Primary use case:** Real-time phishing detection in web browsers via the [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp) **Suitable for:** - Browser extensions that analyze page screenshots - Email security systems checking embedded links - Web crawlers classifying pages at scale - Security research on phishing detection **Not suitable for:** - General image classification (model is specialized for web page screenshots) - Detecting non-visual phishing attacks (e.g., homograph attacks without visual cues) - Replacing comprehensive security solutions (this is one layer of defense) ## Limitations - **Training bias**: Model is primarily trained on English-language phishing pages; performance may vary for other languages - **Evasion**: Sophisticated attackers may craft pages that visually differ from training data - **Screenshot dependency**: Requires a full-page screenshot; partial captures may reduce accuracy - **Resolution sensitivity**: Best performance with screenshots at or near 1920x941; very small or very large screenshots may see degraded accuracy - **Login form focus**: Model is optimized for detecting fake login forms specifically; other phishing types (e.g., fake payment pages without login fields) may be less reliably detected ## Ethical Considerations This model is designed for **defensive cybersecurity** — protecting users from phishing attacks. It should not be used to: - Create or improve phishing pages - Bypass existing security systems - Target or profile individuals ## Citation ```bibtex @software{desant_phishing_detection_2025, author = {Desant.ai}, title = {CLIP-based Phishing Screenshot Detection Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/desant-ai/desant-phishing-inference} } ``` ## Links - **Live Demo**: [Hugging Face Space](https://huggingface.co/spaces/desant-ai/desant-anti-phishing-inferencing) - **Chrome Extension**: [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp) - **Organization**: [Desant.ai](https://desant.ai)