---
license: apache-2.0
library_name: clip
pipeline_tag: image-classification
tags:
  - image-classification
  - clip
  - openclip
  - phishing-detection
  - cybersecurity
  - security
  - pytorch
  - vision
language:
  - en
metrics:
  - accuracy
  - precision
  - recall
  - f1
model-index:
  - name: desant-phishing-inference
    results:
      - task:
          type: image-classification
          name: Phishing Screenshot Detection
        metrics:
          - type: accuracy
            value: 0.95
            name: Accuracy
          - type: recall
            value: 0.93
            name: Malicious Recall
          - type: f1
            value: 0.94
            name: F1 Score
---

# Desant Phishing Detection Model

**CLIP RN50x64-based binary classifier for detecting phishing web pages from screenshots.**

Built by [Desant.ai](https://desant.ai) for real-time phishing protection in the [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp) platform.

## Model Description

This model classifies web page screenshots as **SAFE** (Class 0) or **MALICIOUS/phishing** (Class 1). It uses OpenAI's CLIP RN50x64 as a frozen visual feature extractor, with a custom 3-layer MLP classifier head trained on thousands of real-world phishing and legitimate screenshots.

> **Note:** The model is trained using [OpenAI's CLIP](https://github.com/openai/CLIP) (`clip-by-openai`) and is also compatible with [OpenCLIP](https://github.com/mlfoundations/open_clip) (`open_clip_torch`) for inference. The production backend uses OpenCLIP for serving.

The model is designed to detect phishing login forms — fake pages that mimic legitimate services (banks, email providers, social media, etc.) to steal user credentials.

### Key Features

- **High-resolution analysis**: 448x448 pixel input (4x more pixels than ViT-B/32)
- **Real-world training data**: Sourced from PhishTank, OpenPhish, URLhaus, and AlienVault OTX
- **Production-deployed**: Powers the [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp) and backend API used in the [Hugging Face Space demo](https://huggingface.co/spaces/desant-ai/desant-anti-phishing-inferencing)
- **Fast inference**: ~50ms on GPU.

## Architecture

```
Input: Web page screenshot (any resolution)
        │
        ▼
┌──────────────────────────────────┐
│  Preprocessing                   │
│  • Aspect-ratio preserving       │
│    resize to 448×448             │
│  • Mean color padding            │
│    (CLIP mean: 123, 117, 104)    │
│  • CLIP normalization            │
│    mean=[0.481, 0.458, 0.408]    │
│    std=[0.269, 0.261, 0.276]     │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  CLIP RN50x64 Vision Encoder     │  ← Frozen (pre-trained weights)
│  ResNet-50 with 64× wider        │
│  channels                        │
│  Output: 1024-dim feature vector  │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Classifier Head (trainable)     │
│                                  │
│  Dropout(0.5)                    │
│  Linear(1024 → 512) + ReLU      │
│  Dropout(0.3)                    │
│  Linear(512 → 128) + ReLU       │
│  Dropout(0.2)                    │
│  Linear(128 → 2)                 │
│                                  │
│  Output: [safe_logit, mal_logit] │
└──────────┬───────────────────────┘
           │
           ▼
     Softmax → Probabilities
     Class 0: SAFE
     Class 1: MALICIOUS (phishing)
```

## Training Details

### Training Data

| Source | Class | Description |
|---|---|---|
| PhishTank, OpenPhish, URLhaus, AlienVault OTX | **MALICIOUS** (Class 1) | Real phishing login form screenshots captured at 1920x941 |
| Curated safe URLs | **SAFE** (Class 0) | Legitimate login pages, normal web pages |

### Training Configuration

| Parameter | Value |
|---|---|
| Base model | CLIP RN50x64 (OpenAI CLIP, compatible with OpenCLIP) |
| Input resolution | 448 × 448 pixels |
| Original screenshot resolution | 1920 × 941 pixels |
| Batch size | 32 (effective 64 with gradient accumulation) |
| Gradient accumulation steps | 2 |
| Max epochs | 25 |
| Early stopping patience | 10 epochs |
| Optimizer | AdamW (lr=1e-4, weight_decay=1e-4, betas=(0.9, 0.999)) |
| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=3) |
| Loss function | CrossEntropyLoss (unweighted) |
| Class balancing | WeightedRandomSampler |
| Data split | 80% train / 20% validation |
| Mixed precision | Enabled (AMP) |
| CLIP encoder | Frozen (only classifier head is trained) |

### Data Augmentation

| Augmentation | Details |
|---|---|
| Aspect-ratio preserving resize | Resize to 448x448 with CLIP mean color padding |
| Random horizontal flip | p=0.5 |
| Color jitter | brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1 |

### Preprocessing Pipeline

1. Load screenshot (PNG, 1920x941 original resolution)
2. Preserve aspect ratio, resize to fit 448x448
3. Pad with CLIP mean color `(123, 117, 104)` to fill 448x448 canvas
4. Convert to tensor `[0, 1]`
5. Normalize with CLIP statistics: `mean=[0.48145466, 0.4578275, 0.40821073]`, `std=[0.26862954, 0.26130258, 0.27577711]`

## Performance

| Metric | Score |
|---|---|
| **Accuracy** | 92% |
| **Malicious Recall** | 93% |
| **Safe Precision** | 94% |
| **False Positive Rate** | 2–6% |
| **F1 Score** | ~0.94 |

### Inference Speed

| Hardware | Inference Time | Preprocessing |
|---|---|---|
| NVIDIA RTX 4090 | ~30ms | ~20ms |
| NVIDIA T4 | ~80ms | ~25ms |
| CPU (i7-13700K) | ~500ms | ~30ms |

## Usage

### Quick Start (PyTorch)

```python
import torch
import torch.nn as nn
import clip
from PIL import Image

# Define the classifier architecture (must match training)
class CLIPClassifier(nn.Module):
    def __init__(self, clip_model, num_classes=2):
        super().__init__()
        self.clip_visual = clip_model.visual
        with torch.no_grad():
            dummy = torch.randn(1, 3, 448, 448).float().to(next(clip_model.parameters()).device)
            features = self.clip_visual(dummy)
            feature_dim = features.shape[1]
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        features = self.clip_visual(x.float())
        return self.classifier(features)

# Load CLIP base model
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, _ = clip.load("RN50x64", device=device, jit=False)
clip_model = clip_model.float()

# Build classifier and load trained weights
model = CLIPClassifier(clip_model, num_classes=2).to(device)
state_dict = torch.load("model_1920x941_CLIP_RN50x64_best.pth", map_location=device)
model.load_state_dict(state_dict)
model.eval()

# Preprocess a screenshot
from torchvision import transforms

def aspect_ratio_resize(image, target_size=(448, 448)):
    """Resize preserving aspect ratio with CLIP mean color padding."""
    tw, th = target_size
    w, h = image.size
    scale = min(tw / w, th / h)
    nw, nh = int(w * scale), int(h * scale)
    resized = image.resize((nw, nh), Image.LANCZOS)
    pad_color = (int(0.48145466*255), int(0.4578275*255), int(0.40821073*255))
    canvas = Image.new("RGB", (tw, th), pad_color)
    canvas.paste(resized, ((tw - nw) // 2, (th - nh) // 2))
    return canvas

preprocess = transforms.Compose([
    transforms.Lambda(lambda img: aspect_ratio_resize(img, (448, 448))),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.48145466, 0.4578275, 0.40821073],
        std=[0.26862954, 0.26130258, 0.27577711],
    ),
])

# Run inference
image = Image.open("screenshot.png").convert("RGB")
input_tensor = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(input_tensor)
    probs = torch.softmax(logits, dim=1)
    safe_prob = probs[0][0].item()
    malicious_prob = probs[0][1].item()
    prediction = "MALICIOUS" if malicious_prob > 0.5 else "SAFE"

print(f"Prediction: {prediction}")
print(f"Safe probability:      {safe_prob:.4f}")
print(f"Malicious probability: {malicious_prob:.4f}")
```

## Intended Use

**Primary use case:** Real-time phishing detection in web browsers via the [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp)

**Suitable for:**
- Browser extensions that analyze page screenshots
- Email security systems checking embedded links
- Web crawlers classifying pages at scale
- Security research on phishing detection

**Not suitable for:**
- General image classification (model is specialized for web page screenshots)
- Detecting non-visual phishing attacks (e.g., homograph attacks without visual cues)
- Replacing comprehensive security solutions (this is one layer of defense)

## Limitations

- **Training bias**: Model is primarily trained on English-language phishing pages; performance may vary for other languages
- **Evasion**: Sophisticated attackers may craft pages that visually differ from training data
- **Screenshot dependency**: Requires a full-page screenshot; partial captures may reduce accuracy
- **Resolution sensitivity**: Best performance with screenshots at or near 1920x941; very small or very large screenshots may see degraded accuracy
- **Login form focus**: Model is optimized for detecting fake login forms specifically; other phishing types (e.g., fake payment pages without login fields) may be less reliably detected

## Ethical Considerations

This model is designed for **defensive cybersecurity** — protecting users from phishing attacks. It should not be used to:
- Create or improve phishing pages
- Bypass existing security systems
- Target or profile individuals

## Citation

```bibtex
@software{desant_phishing_detection_2025,
  author = {Desant.ai},
  title = {CLIP-based Phishing Screenshot Detection Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/desant-ai/desant-phishing-inference}
}
```

## Links

- **Live Demo**: [Hugging Face Space](https://huggingface.co/spaces/desant-ai/desant-anti-phishing-inferencing)
- **Chrome Extension**: [Desant Phishing Detectior Chrome extension](https://chromewebstore.google.com/detail/desant-phishing-detector/alfnmlahonkioonhdghhdnflnoeeegdp)
- **Organization**: [Desant.ai](https://desant.ai)