---
base_model: google/paligemma2-3b-pt-224
library_name: peft
pipeline_tag: image-text-to-text
license: mit
language:
- en
tags:
- paligemma
- paligemma2
- lora
- peft
- transformers
- multimodal
- hate-speech-detection
- multi-label-classification
- vision-language-model
- mmhs150k
- content-moderation
- social-media
- meme-classification
datasets:
- mmhs150k
metrics:
- f1
model-index:
- name: paligemma2-3b-mmhs150k-lora
results:
- task:
type: image-text-to-text
name: Multi-Modal Hate Speech Detection
dataset:
name: MMHS150K
type: mmhs150k
metrics:
- name: F1 Micro (Test)
type: f1
value: 0.5404
- name: F1 Macro (Test)
type: f1
value: 0.4896
- name: F1 Micro (Validation)
type: f1
value: 0.5378
- name: Subset Accuracy (Validation)
type: accuracy
value: 0.4338
---
# PaliGemma 2 LoRA Adapter for Multi-Modal Hateful Content Classification
[](https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
[](https://huggingface.co/Amirhossein75/paligemma2-3b-mmhs150k-lora)
## 🎯 Model Overview
This is a **LoRA (Low-Rank Adaptation) adapter** fine-tuned on top of [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) for **multi-label hateful content detection** on paired **text + image** data using the MMHS150K dataset.
### ✨ Key Features
- **Multi-Modal Understanding**: Processes both text and images simultaneously for context-aware classification
- **Multi-Label Classification**: Can detect multiple types of hate speech in a single sample
- **Generative Approach**: Uses generative classification instead of traditional classification heads
- **Efficient Fine-Tuning**: LoRA adapter with only ~24MB of trainable parameters
- **JSON Output**: Generates structured JSON arrays for easy downstream processing
This model uses **generative classification**: instead of training a dedicated classification head, the model generates a strict JSON array of labels (e.g., `["racist", "sexist"]`).
## Model Details
### Model Description
Given an image and its associated text, the model outputs a JSON array containing zero or more labels from a fixed label set. The model is trained to classify hateful memes and social media content into multiple hate speech categories.
**High-level flow:**
1. Build a strict "return JSON only" prompt listing allowed labels.
2. Feed (text + image) to the VLM.
3. Generate a short response.
4. Parse the first JSON array found (best-effort JSON extraction).
5. Convert labels into multi-hot predictions and compute multi-label metrics.
| Property | Value |
|----------|-------|
| **Developed by** | [Amirhossein Yousefi](https://github.com/amirhossein-yousefi) |
| **Model type** | Vision-Language Model (VLM) with LoRA adapter |
| **Language(s)** | English |
| **License** | MIT |
| **Base model** | [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) |
| **Parameters (Base)** | 3B |
| **Parameters (Adapter)** | ~24MB |
| **Input** | Text + Image (224×224) |
| **Output** | JSON array of hate speech labels |
### Model Sources
| Resource | Link |
|----------|------|
| **Repository** | [github.com/amirhossein-yousefi/text_image_multi_modal_vlm](https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm) |
| **Base Model** | [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) |
| **Dataset** | [MMHS150K](https://gombru.github.io/2019/10/09/MMHS/) |
### 🏷️ Label Classes
The model classifies content into the following **5 hate speech categories**:
| Label | Description | Examples |
|-------|-------------|----------|
| `racist` | Content with racial discrimination | Slurs, stereotypes, dehumanization based on race/ethnicity |
| `sexist` | Content with gender-based discrimination | Misogyny, gender stereotypes, harassment based on gender |
| `homophobe` | Content with anti-LGBTQ+ discrimination | Slurs, stereotypes targeting LGBTQ+ individuals |
| `religion` | Content with religious discrimination | Attacks on religious groups, religious stereotypes |
| `otherhate` | Other forms of hateful content | Hate not covered by above categories |
## Uses
### ✅ Direct Use
This model is intended for detecting and classifying hateful content in multimodal (text + image) social media posts, memes, and similar content. It can be used for:
- **Content moderation systems** - Automated flagging of potentially harmful content
- **Research on hate speech detection** - Academic studies on multi-modal hate speech
- **Social media analysis** - Understanding patterns of hateful content
- **Dataset annotation assistance** - Semi-automated labeling of hate speech datasets
- **Educational purposes** - Understanding how VLMs can be applied to content moderation
### ⚠️ Out-of-Scope Use
- **Production moderation without human review:** This model should not be the sole decision-maker for content removal.
- **Non-English content:** The model is trained on English data only.
- **Single-modality analysis:** Best results are achieved with both text and image inputs.
- **Real-time high-stakes decisions:** The model may produce errors and should not be used for legal or high-stakes decisions without human oversight.
- **Surveillance or censorship:** This model should not be used for mass surveillance or unjust censorship.
## Bias, Risks, and Limitations
### Known Limitations
- **Dataset Bias:** The model is trained on MMHS150K dataset which may contain biases present in the original annotations.
- **Cultural Context:** Performance may vary across different types of hateful content and cultural contexts.
- **Error Rate:** The model may produce false positives/negatives and should be used with human oversight.
- **JSON Parsing:** Generated JSON output may occasionally be malformed and require robust parsing.
- **Temporal Bias:** The model may not recognize new slurs, memes, or evolving hate speech patterns.
- **Image Quality:** Performance may degrade on low-quality, distorted, or heavily edited images.
### Recommendations
- ✅ Always use human review for critical content moderation decisions.
- ✅ Validate model outputs against your specific use case before deployment.
- ✅ Consider the cultural and contextual limitations of the training data.
- ✅ Implement robust JSON parsing with fallback mechanisms.
- ✅ Regularly evaluate model performance on new data distributions.
- ✅ Combine with other moderation signals for production systems.
## 🚀 How to Get Started with the Model
### Installation
```bash
pip install transformers peft torch pillow accelerate
```
### Quick Start - Load the Model
```python
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
import torch
# Model identifiers
BASE_MODEL = "google/paligemma2-3b-pt-224"
LORA_ADAPTER = "Amirhossein75/paligemma2-3b-mmhs150k-lora"
# Load the base model
base_model = AutoModelForImageTextToText.from_pretrained(
BASE_MODEL,
torch_dtype=torch.float16,
device_map="auto" # or "cpu" for CPU-only inference
)
# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
# Load the processor
processor = AutoProcessor.from_pretrained(BASE_MODEL)
print("✅ Model loaded successfully!")
```
### Full Inference Example
```python
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
# Load base model and adapter
BASE_MODEL = "google/paligemma2-3b-pt-224"
LORA_ADAPTER = "Amirhossein75/paligemma2-3b-mmhs150k-lora"
processor = AutoProcessor.from_pretrained(BASE_MODEL)
base_model = AutoModelForImageTextToText.from_pretrained(
BASE_MODEL,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
# Prepare input
image = Image.open("path/to/image.jpg").convert("RGB")
text = "Some text to analyze"
# Create prompt
class_names = ["racist", "sexist", "homophobe", "religion", "otherhate"]
prompt = f"Classify the following text and image into zero or more of these labels: {class_names}. Return ONLY a JSON array of applicable labels. Text: {text}"
# Generate
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result) # e.g., ["racist", "sexist"]
```
> **Note:** This is a LoRA adapter and requires loading the base model first. You cannot use `AutoModel.from_pretrained()` directly on the adapter.
### Batch Inference
```python
import json
import re
def parse_json_labels(response: str) -> list:
"""Extract JSON array from model response with fallback."""
try:
# Try to find JSON array in response
match = re.search(r'\[.*?\]', response)
if match:
return json.loads(match.group())
except json.JSONDecodeError:
pass
return []
def classify_batch(model, processor, images, texts, class_names):
"""Classify a batch of image-text pairs."""
results = []
for image, text in zip(images, texts):
prompt = f"Classify the following text and image into zero or more of these labels: {class_names}. Return ONLY a JSON array of applicable labels. Text: {text}"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
response = processor.decode(outputs[0], skip_special_tokens=True)
labels = parse_json_labels(response)
results.append(labels)
return results
```
## Training Details
### 📊 Training Data
**MMHS150K (Multi-Modal Hate Speech)** - A large-scale dataset for multi-modal hate speech detection containing ~150K tweets with associated images.
| Split | Samples | Description |
|-------|---------|-------------|
| Train | ~135,000 | Training samples |
| Validation | 5,000 | Validation samples |
| Test | ~10,000 | Held-out test samples |
**Dataset structure:**
- `train.csv`, `val.csv`, `test.csv` with columns: `text`, `image_path`, `labels`
- Labels are multi-hot encoded for: racist, sexist, homophobe, religion, otherhate
**Data Source:** Twitter/X posts with associated images, annotated for hate speech categories.
### Training Procedure
#### 🖥️ Hardware Used
| Component | Specification |
|-----------|---------------|
| **GPU** | NVIDIA A100 (40GB/80GB HBM2e) |
| **Platform** | Google Colab Pro |
| **GPU Memory** | 40GB+ |
| **Precision** | bf16 (Brain Float 16) mixed precision |
| **CUDA Version** | 11.8+ |
> **Note:** The NVIDIA A100 is a data center GPU based on the Ampere architecture, offering 40GB or 80GB of HBM2e memory with 1.6TB/s bandwidth. It provides excellent performance for large VLM fine-tuning tasks.
#### ⚙️ Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| **Training regime** | bf16 mixed precision |
| **Optimizer** | AdamW |
| **Learning rate** | 2e-4 |
| **Batch size** | 4 (with gradient accumulation) |
| **Epochs** | 1 |
| **Max sequence length** | 512 |
| **Warmup steps** | 100 |
#### 🔧 LoRA Configuration
| Parameter | Value |
|-----------|-------|
| **LoRA rank (r)** | 4 |
| **LoRA alpha** | 32 |
| **LoRA dropout** | 0.05 |
| **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Task type** | CAUSAL_LM |
| **Bias** | none |
| **Trainable parameters** | ~24MB |
#### ⏱️ Training Time & Throughput
| Metric | Value |
|--------|-------|
| **Validation time** | 458.13s (0:07:38) |
| **Validation throughput** | 10.914 samples/s |
| **Epochs completed** | 1.0 |
| **Final validation loss** | 0.3525 |
## 📈 Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
| Dataset | Samples | Description |
|---------|---------|-------------|
| **Validation set** | 5,000 | MMHS150K validation split |
| **Test set** | ~10,000 | MMHS150K test split |
#### Metrics Explained
| Metric | Description | Interpretation |
|--------|-------------|----------------|
| **F1 Micro** | Micro-averaged F1 score across all labels | Higher is better. Gives equal weight to each sample. |
| **F1 Macro** | Macro-averaged F1 score (unweighted mean) | Higher is better. Gives equal weight to each class. |
| **Subset Accuracy** | Exact match accuracy | Higher is better. All labels must match exactly. |
| **Hamming Loss** | Fraction of incorrectly predicted labels | Lower is better. Measures per-label errors. |
### 📊 Results
#### This Model's Performance
| Split | F1 Micro | F1 Macro | Subset Accuracy | Hamming Loss |
|-------|----------|----------|-----------------|--------------|
| **Validation** | 0.5378 | 0.5000 | 0.4338 | 0.1422 |
| **Test** | 0.5404 | 0.4896 | – | – |
#### Comparison with Other Models in the Project
| Model | Hardware | Split | F1 Micro | F1 Macro | Subset Acc | Hamming Loss |
|-------|----------|-------|----------|----------|------------|--------------|
| **Qwen2-VL 2B + LoRA** | RTX 3080 (16GB) | Validation | 0.6172 | 0.5077 | 0.4366 | 0.14276 |
| **PaliGemma 2 3B + LoRA** (this model) | A100 | Validation | 0.5378 | 0.5000 | 0.4338 | 0.14220 |
| **Qwen2-VL 2B + LoRA** | RTX 3080 (16GB) | Test | 0.6110 | 0.4992 | – | – |
| **PaliGemma 2 3B + LoRA** (this model) | A100 | Test | 0.5404 | 0.4896 | – | – |
> **Note:** The Qwen2-VL model was trained on a local Windows machine with NVIDIA GeForce RTX 3080 Laptop GPU (16GB VRAM), NVIDIA driver 581.57, and CUDA 13.0.
## 🔧 Technical Specifications
### Model Architecture and Objective
| Component | Description |
|-----------|-------------|
| **Base Model** | PaliGemma 2 (3B parameters) - a vision-language model by Google |
| **Architecture** | Transformer-based VLM with SigLIP vision encoder |
| **Vision Encoder** | SigLIP-So400m/14 |
| **Text Decoder** | Gemma 2B |
| **Image Resolution** | 224 × 224 pixels |
| **Adapter** | LoRA (Low-Rank Adaptation) |
| **Objective** | Generative multi-label classification via JSON array generation |
### Compute Infrastructure
#### Hardware
| Component | Training | Inference (Recommended) |
|-----------|----------|------------------------|
| **GPU** | NVIDIA A100 (40GB) | Any GPU with 8GB+ VRAM |
| **Platform** | Google Colab Pro | Local / Cloud |
| **Precision** | bf16 | fp16 / bf16 |
| **Memory** | 40GB+ GPU RAM | 8GB+ GPU RAM |
#### Software
| Package | Version |
|---------|---------|
| **Python** | 3.8+ |
| **Transformers** | 4.40+ |
| **PEFT** | 0.17.1 |
| **PyTorch** | 2.0+ |
| **Accelerate** | 0.27+ |
| **Pillow** | 9.0+ |
## 📚 Citation
If you use this model, please cite:
**BibTeX:**
```bibtex
@misc{yousefi2024paligemma-hatespeech,
author = {Yousefi, Amirhossein},
title = {Multi-Modal Vision-Language Models for Hateful Content Classification},
year = {2024},
publisher = {GitHub},
howpublished = {\url{https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm}},
note = {PaliGemma 2 LoRA adapter for MMHS150K hate speech detection}
}
```
**APA:**
Yousefi, A. (2024). *Multi-Modal Vision-Language Models for Hateful Content Classification*. GitHub. https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm
## 📖 More Information
For more details on training, evaluation, and usage, see the [GitHub repository](https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm).
### Related Models
- [Qwen2-VL 2B MMHS150K LoRA](https://huggingface.co/Amirhossein75/qwen2-vl-2b-mmhs150k-lora) - Alternative VLM fine-tuned on the same dataset
## 👤 Model Card Authors
[Amirhossein Yousefi](https://github.com/amirhossein-yousefi)
## 📧 Model Card Contact
- **GitHub:** [amirhossein-yousefi](https://github.com/amirhossein-yousefi)
- **Hugging Face:** [Amirhossein75](https://huggingface.co/Amirhossein75)
---
### Framework Versions
| Framework | Version |
|-----------|---------|
| PEFT | 0.17.1 |
| Transformers | 4.40+ |
| PyTorch | 2.0+ |