---
base_model: google/paligemma2-3b-pt-224
library_name: peft
pipeline_tag: image-text-to-text
license: mit
language:
- en
tags:
- paligemma
- paligemma2
- lora
- peft
- transformers
- multimodal
- hate-speech-detection
- multi-label-classification
- vision-language-model
- mmhs150k
- content-moderation
- social-media
- meme-classification
datasets:
- mmhs150k
metrics:
- f1
model-index:
- name: paligemma2-3b-mmhs150k-lora
  results:
  - task:
      type: image-text-to-text
      name: Multi-Modal Hate Speech Detection
    dataset:
      name: MMHS150K
      type: mmhs150k
    metrics:
    - name: F1 Micro (Test)
      type: f1
      value: 0.5404
    - name: F1 Macro (Test)
      type: f1
      value: 0.4896
    - name: F1 Micro (Validation)
      type: f1
      value: 0.5378
    - name: Subset Accuracy (Validation)
      type: accuracy
      value: 0.4338
---

# PaliGemma 2 LoRA Adapter for Multi-Modal Hateful Content Classification

<div align="center">

[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue.svg)](https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-orange)](https://huggingface.co/Amirhossein75/paligemma2-3b-mmhs150k-lora)

</div>

## 🎯 Model Overview

This is a **LoRA (Low-Rank Adaptation) adapter** fine-tuned on top of [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) for **multi-label hateful content detection** on paired **text + image** data using the MMHS150K dataset.

### ✨ Key Features

- **Multi-Modal Understanding**: Processes both text and images simultaneously for context-aware classification
- **Multi-Label Classification**: Can detect multiple types of hate speech in a single sample
- **Generative Approach**: Uses generative classification instead of traditional classification heads
- **Efficient Fine-Tuning**: LoRA adapter with only ~24MB of trainable parameters
- **JSON Output**: Generates structured JSON arrays for easy downstream processing

This model uses **generative classification**: instead of training a dedicated classification head, the model generates a strict JSON array of labels (e.g., `["racist", "sexist"]`).

## Model Details

### Model Description

Given an image and its associated text, the model outputs a JSON array containing zero or more labels from a fixed label set. The model is trained to classify hateful memes and social media content into multiple hate speech categories.

**High-level flow:**
1. Build a strict "return JSON only" prompt listing allowed labels.
2. Feed (text + image) to the VLM.
3. Generate a short response.
4. Parse the first JSON array found (best-effort JSON extraction).
5. Convert labels into multi-hot predictions and compute multi-label metrics.

| Property | Value |
|----------|-------|
| **Developed by** | [Amirhossein Yousefi](https://github.com/amirhossein-yousefi) |
| **Model type** | Vision-Language Model (VLM) with LoRA adapter |
| **Language(s)** | English |
| **License** | MIT |
| **Base model** | [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) |
| **Parameters (Base)** | 3B |
| **Parameters (Adapter)** | ~24MB |
| **Input** | Text + Image (224×224) |
| **Output** | JSON array of hate speech labels |

### Model Sources

| Resource | Link |
|----------|------|
| **Repository** | [github.com/amirhossein-yousefi/text_image_multi_modal_vlm](https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm) |
| **Base Model** | [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) |
| **Dataset** | [MMHS150K](https://gombru.github.io/2019/10/09/MMHS/) |

### 🏷️ Label Classes

The model classifies content into the following **5 hate speech categories**:

| Label | Description | Examples |
|-------|-------------|----------|
| `racist` | Content with racial discrimination | Slurs, stereotypes, dehumanization based on race/ethnicity |
| `sexist` | Content with gender-based discrimination | Misogyny, gender stereotypes, harassment based on gender |
| `homophobe` | Content with anti-LGBTQ+ discrimination | Slurs, stereotypes targeting LGBTQ+ individuals |
| `religion` | Content with religious discrimination | Attacks on religious groups, religious stereotypes |
| `otherhate` | Other forms of hateful content | Hate not covered by above categories |

## Uses

### ✅ Direct Use

This model is intended for detecting and classifying hateful content in multimodal (text + image) social media posts, memes, and similar content. It can be used for:

- **Content moderation systems** - Automated flagging of potentially harmful content
- **Research on hate speech detection** - Academic studies on multi-modal hate speech
- **Social media analysis** - Understanding patterns of hateful content
- **Dataset annotation assistance** - Semi-automated labeling of hate speech datasets
- **Educational purposes** - Understanding how VLMs can be applied to content moderation

### ⚠️ Out-of-Scope Use

- **Production moderation without human review:** This model should not be the sole decision-maker for content removal.
- **Non-English content:** The model is trained on English data only.
- **Single-modality analysis:** Best results are achieved with both text and image inputs.
- **Real-time high-stakes decisions:** The model may produce errors and should not be used for legal or high-stakes decisions without human oversight.
- **Surveillance or censorship:** This model should not be used for mass surveillance or unjust censorship.

## Bias, Risks, and Limitations

### Known Limitations

- **Dataset Bias:** The model is trained on MMHS150K dataset which may contain biases present in the original annotations.
- **Cultural Context:** Performance may vary across different types of hateful content and cultural contexts.
- **Error Rate:** The model may produce false positives/negatives and should be used with human oversight.
- **JSON Parsing:** Generated JSON output may occasionally be malformed and require robust parsing.
- **Temporal Bias:** The model may not recognize new slurs, memes, or evolving hate speech patterns.
- **Image Quality:** Performance may degrade on low-quality, distorted, or heavily edited images.

### Recommendations

- ✅ Always use human review for critical content moderation decisions.
- ✅ Validate model outputs against your specific use case before deployment.
- ✅ Consider the cultural and contextual limitations of the training data.
- ✅ Implement robust JSON parsing with fallback mechanisms.
- ✅ Regularly evaluate model performance on new data distributions.
- ✅ Combine with other moderation signals for production systems.

## 🚀 How to Get Started with the Model

### Installation

```bash
pip install transformers peft torch pillow accelerate
```

### Quick Start - Load the Model

```python
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
import torch

# Model identifiers
BASE_MODEL = "google/paligemma2-3b-pt-224"
LORA_ADAPTER = "Amirhossein75/paligemma2-3b-mmhs150k-lora"

# Load the base model
base_model = AutoModelForImageTextToText.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto"  # or "cpu" for CPU-only inference
)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)

# Load the processor
processor = AutoProcessor.from_pretrained(BASE_MODEL)

print("✅ Model loaded successfully!")
```

### Full Inference Example

```python
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel

# Load base model and adapter
BASE_MODEL = "google/paligemma2-3b-pt-224"
LORA_ADAPTER = "Amirhossein75/paligemma2-3b-mmhs150k-lora"

processor = AutoProcessor.from_pretrained(BASE_MODEL)
base_model = AutoModelForImageTextToText.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)

# Prepare input
image = Image.open("path/to/image.jpg").convert("RGB")
text = "Some text to analyze"

# Create prompt
class_names = ["racist", "sexist", "homophobe", "religion", "otherhate"]
prompt = f"Classify the following text and image into zero or more of these labels: {class_names}. Return ONLY a JSON array of applicable labels. Text: {text}"

# Generate
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)  # e.g., ["racist", "sexist"]
```

> **Note:** This is a LoRA adapter and requires loading the base model first. You cannot use `AutoModel.from_pretrained()` directly on the adapter.

### Batch Inference

```python
import json
import re

def parse_json_labels(response: str) -> list:
    """Extract JSON array from model response with fallback."""
    try:
        # Try to find JSON array in response
        match = re.search(r'\[.*?\]', response)
        if match:
            return json.loads(match.group())
    except json.JSONDecodeError:
        pass
    return []

def classify_batch(model, processor, images, texts, class_names):
    """Classify a batch of image-text pairs."""
    results = []
    for image, text in zip(images, texts):
        prompt = f"Classify the following text and image into zero or more of these labels: {class_names}. Return ONLY a JSON array of applicable labels. Text: {text}"
        inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=64)
        response = processor.decode(outputs[0], skip_special_tokens=True)
        labels = parse_json_labels(response)
        results.append(labels)
    return results
```

## Training Details

### 📊 Training Data

**MMHS150K (Multi-Modal Hate Speech)** - A large-scale dataset for multi-modal hate speech detection containing ~150K tweets with associated images.

| Split | Samples | Description |
|-------|---------|-------------|
| Train | ~135,000 | Training samples |
| Validation | 5,000 | Validation samples |
| Test | ~10,000 | Held-out test samples |

**Dataset structure:**
- `train.csv`, `val.csv`, `test.csv` with columns: `text`, `image_path`, `labels`
- Labels are multi-hot encoded for: racist, sexist, homophobe, religion, otherhate

**Data Source:** Twitter/X posts with associated images, annotated for hate speech categories.

### Training Procedure

#### 🖥️ Hardware Used

| Component | Specification |
|-----------|---------------|
| **GPU** | NVIDIA A100 (40GB/80GB HBM2e) |
| **Platform** | Google Colab Pro |
| **GPU Memory** | 40GB+ |
| **Precision** | bf16 (Brain Float 16) mixed precision |
| **CUDA Version** | 11.8+ |

> **Note:** The NVIDIA A100 is a data center GPU based on the Ampere architecture, offering 40GB or 80GB of HBM2e memory with 1.6TB/s bandwidth. It provides excellent performance for large VLM fine-tuning tasks.

#### ⚙️ Training Hyperparameters

| Parameter | Value |
|-----------|-------|
| **Training regime** | bf16 mixed precision |
| **Optimizer** | AdamW |
| **Learning rate** | 2e-4 |
| **Batch size** | 4 (with gradient accumulation) |
| **Epochs** | 1 |
| **Max sequence length** | 512 |
| **Warmup steps** | 100 |

#### 🔧 LoRA Configuration

| Parameter | Value |
|-----------|-------|
| **LoRA rank (r)** | 4 |
| **LoRA alpha** | 32 |
| **LoRA dropout** | 0.05 |
| **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Task type** | CAUSAL_LM |
| **Bias** | none |
| **Trainable parameters** | ~24MB |

#### ⏱️ Training Time & Throughput

| Metric | Value |
|--------|-------|
| **Validation time** | 458.13s (0:07:38) |
| **Validation throughput** | 10.914 samples/s |
| **Epochs completed** | 1.0 |
| **Final validation loss** | 0.3525 |

## 📈 Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

| Dataset | Samples | Description |
|---------|---------|-------------|
| **Validation set** | 5,000 | MMHS150K validation split |
| **Test set** | ~10,000 | MMHS150K test split |

#### Metrics Explained

| Metric | Description | Interpretation |
|--------|-------------|----------------|
| **F1 Micro** | Micro-averaged F1 score across all labels | Higher is better. Gives equal weight to each sample. |
| **F1 Macro** | Macro-averaged F1 score (unweighted mean) | Higher is better. Gives equal weight to each class. |
| **Subset Accuracy** | Exact match accuracy | Higher is better. All labels must match exactly. |
| **Hamming Loss** | Fraction of incorrectly predicted labels | Lower is better. Measures per-label errors. |

### 📊 Results

#### This Model's Performance

| Split | F1 Micro | F1 Macro | Subset Accuracy | Hamming Loss |
|-------|----------|----------|-----------------|--------------|
| **Validation** | 0.5378 | 0.5000 | 0.4338 | 0.1422 |
| **Test** | 0.5404 | 0.4896 | – | – |

#### Comparison with Other Models in the Project

| Model | Hardware | Split | F1 Micro | F1 Macro | Subset Acc | Hamming Loss |
|-------|----------|-------|----------|----------|------------|--------------|
| **Qwen2-VL 2B + LoRA** | RTX 3080 (16GB) | Validation | 0.6172 | 0.5077 | 0.4366 | 0.14276 |
| **PaliGemma 2 3B + LoRA** (this model) | A100 | Validation | 0.5378 | 0.5000 | 0.4338 | 0.14220 |
| **Qwen2-VL 2B + LoRA** | RTX 3080 (16GB) | Test | 0.6110 | 0.4992 | – | – |
| **PaliGemma 2 3B + LoRA** (this model) | A100 | Test | 0.5404 | 0.4896 | – | – |

> **Note:** The Qwen2-VL model was trained on a local Windows machine with NVIDIA GeForce RTX 3080 Laptop GPU (16GB VRAM), NVIDIA driver 581.57, and CUDA 13.0.

## 🔧 Technical Specifications

### Model Architecture and Objective

| Component | Description |
|-----------|-------------|
| **Base Model** | PaliGemma 2 (3B parameters) - a vision-language model by Google |
| **Architecture** | Transformer-based VLM with SigLIP vision encoder |
| **Vision Encoder** | SigLIP-So400m/14 |
| **Text Decoder** | Gemma 2B |
| **Image Resolution** | 224 × 224 pixels |
| **Adapter** | LoRA (Low-Rank Adaptation) |
| **Objective** | Generative multi-label classification via JSON array generation |

### Compute Infrastructure

#### Hardware

| Component | Training | Inference (Recommended) |
|-----------|----------|------------------------|
| **GPU** | NVIDIA A100 (40GB) | Any GPU with 8GB+ VRAM |
| **Platform** | Google Colab Pro | Local / Cloud |
| **Precision** | bf16 | fp16 / bf16 |
| **Memory** | 40GB+ GPU RAM | 8GB+ GPU RAM |

#### Software

| Package | Version |
|---------|---------|
| **Python** | 3.8+ |
| **Transformers** | 4.40+ |
| **PEFT** | 0.17.1 |
| **PyTorch** | 2.0+ |
| **Accelerate** | 0.27+ |
| **Pillow** | 9.0+ |

## 📚 Citation

If you use this model, please cite:

**BibTeX:**

```bibtex
@misc{yousefi2024paligemma-hatespeech,
  author = {Yousefi, Amirhossein},
  title = {Multi-Modal Vision-Language Models for Hateful Content Classification},
  year = {2024},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm}},
  note = {PaliGemma 2 LoRA adapter for MMHS150K hate speech detection}
}
```

**APA:**

Yousefi, A. (2024). *Multi-Modal Vision-Language Models for Hateful Content Classification*. GitHub. https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm

## 📖 More Information

For more details on training, evaluation, and usage, see the [GitHub repository](https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm).

### Related Models

- [Qwen2-VL 2B MMHS150K LoRA](https://huggingface.co/Amirhossein75/qwen2-vl-2b-mmhs150k-lora) - Alternative VLM fine-tuned on the same dataset

## 👤 Model Card Authors

[Amirhossein Yousefi](https://github.com/amirhossein-yousefi)

## 📧 Model Card Contact

- **GitHub:** [amirhossein-yousefi](https://github.com/amirhossein-yousefi)
- **Hugging Face:** [Amirhossein75](https://huggingface.co/Amirhossein75)

---

### Framework Versions

| Framework | Version |
|-----------|---------|
| PEFT | 0.17.1 |
| Transformers | 4.40+ |
| PyTorch | 2.0+ |