"😐 ViT Facial Expression Recognition (9-Class Baseline Model)

This repository hosts a Vision Transformer (ViT)–based facial expression recognition model trained using an iterative fine-tuning strategy. The model was developed by further training LaurenGurgiolo/vit-micro-facial-expressions, which itself was fine-tuned from mo-thecreator/vit-Facial-Expression-Recognition.

The objective of this model is to classify facial images into nine distinct facial expression categories using robust transformer-based visual representations.

📌 Model Details

Base model: mo-thecreator/vit-Facial-Expression-Recognition

Intermediate model: LaurenGurgiolo/vit-micro-facial-expressions

Architecture: Vision Transformer (ViT)

Task: Facial Expression Classification

Final model type: Iteratively fine-tuned baseline model

📂 Dataset 9_Facial_Expressions Dataset

Source: LaurenGurgiolo/9_Facial_Expressions

Task: Multi-class facial expression classification

Classes: 9 facial expression categories

This dataset was used to further refine the intermediate ViT model through iterative training.

🧠 Training Methodology Iterative Fine-Tuning (Baseline Model)

The LaurenGurgiolo/vit-micro-facial-expressions model was iteratively fine-tuned on the 9_Facial_Expressions dataset, allowing the model to progressively integrate new facial expression patterns.

Training Configuration:

Batch size: 16

Epochs: 10

Learning rate: 2e-5

Warmup steps: 500

Scheduler: Cosine learning rate with restarts (2 cycles)

Weight decay: 0.01

This iterative training procedure achieved a final accuracy of 75%, which is designated as the baseline performance.

Non-Iterative Fine-Tuning (Comparison Model)

For comparison, the pretrained mo-thecreator/vit-Facial-Expression-Recognition model was directly fine-tuned on the 9_Facial_Expressions dataset without iterative training.

Training approach: Single-stage fine-tuning

Final accuracy: 66%

This result is substantially lower than the iterative baseline, highlighting the effectiveness of sequential learning.

📊 Results Summary Training Strategy Accuracy Iterative fine-tuning 75% Non-iterative fine-tuning 66%

Figure: Training and validation performance across 10 epochs, illustrating stable convergence and improved generalization under iterative training.

🧠 Why Iterative Training?

Iterative training is a sequential learning methodology in which a facial recognition model is trained across multiple datasets over time. This approach enables:

Progressive knowledge refinement

Improved generalization to unseen facial variations

Enhanced feature discrimination

By exposing the model to increasingly diverse data distributions, iterative training improves adaptability to novel conditions (Mohan, 2024).

🧬 Architecture Choice

A Vision Transformer (ViT) architecture was selected due to its strong performance in facial recognition tasks. ViTs have demonstrated superior accuracy and generalization compared to convolutional neural networks (CNNs) by leveraging global self-attention mechanisms.

🚀 Usage Example from transformers import AutoImageProcessor, AutoModelForImageClassification from PIL import Image import torch

processor = AutoImageProcessor.from_pretrained("your-username/your-model-name") model = AutoModelForImageClassification.from_pretrained("your-username/your-model-name")

image = Image.open("face.jpg") inputs = processor(images=image, return_tensors="pt")

with torch.no_grad(): outputs = model(**inputs)

predicted_label = outputs.logits.argmax(dim=-1).item() print(predicted_label)

⚠️ Limitations

Performance may be affected by:

Low-resolution images

Occlusions or extreme facial poses

Unbalanced class distributions

Emotion classification remains inherently subjective.

📜 License & Attribution

Base model: mo-thecreator/vit-Facial-Expression-Recognition

Datasets: LaurenGurgiolo/9_Facial_Expressions

Please consult the original model and dataset licenses on Hugging Face before use.

🙌 Acknowledgements

Hugging Face for model hosting and tools

Dataset contributors

Prior research on Vision Transformers and iterative learning strategies"

Downloads last month: 1

Safetensors

Model size

85.8M params

Tensor type

F32

Model tree for LaurenGurgiolo/VIT_finetuned_9emotions

Base model

mo-thecreator/vit-Facial-Expression-Recognition

Finetuned

LaurenGurgiolo/vit-micro-facial-expressions

Finetuned

(1)

this model

LaurenGurgiolo
/

VIT_finetuned_9emotions

Model tree for LaurenGurgiolo/VIT_finetuned_9emotions

Dataset used to train LaurenGurgiolo/VIT_finetuned_9emotions

Space using LaurenGurgiolo/VIT_finetuned_9emotions 1