---
library_name: sklearn
tags:
  - spam-detection
  - scikit-learn
  - ensemble
  - tfidf
  - lime
  - shap
  - nlp
  - text-classification
license: mit
pipeline_tag: text-classification
---

# Spam Email Classifier — sklearn Voting Ensemble (Gradio)

**ENGT 375 — Applied Machine Learning | Spring 2026 | ODU**

> **Disclaimer:** This model was created as a student project for Applied Machine Learning at Old Dominion University. It is intended for **educational and research purposes only** and should not be used as a sole spam/phishing filter in production. Classification accuracy may vary, and the model may produce incorrect or misleading results. Always use established email security tools for real-world spam filtering.

A voting ensemble classifier (Random Forest + Logistic Regression + SVM) for spam email detection, with LIME and SHAP explainability support.

## Model Details

- **Architecture:** VotingClassifier (soft voting)
  - Random Forest
  - Logistic Regression
  - Calibrated LinearSVC
- **Features:** TF-IDF (text) + 24 hand-crafted metadata features
- **Framework:** scikit-learn
- **Task:** Binary classification (spam / ham)

## Files

| File | Purpose |
|------|---------|
| `voting_model.joblib` | Trained VotingClassifier ensemble (145MB) |
| `tfidf_vectorizer.joblib` | Fitted TF-IDF vectorizer |
| `meta_scaler.joblib` | MinMaxScaler for metadata features |
| `feature_names.joblib` | Feature name list for explainability |
| `optimal_threshold.joblib` | Calibrated decision threshold |
| `training_sample.joblib` | Sample of training data for LIME/SHAP |
| `training_report.json` | Training metrics and classification report |

## Usage

```python
import joblib
from utils import preprocess_text, compute_metadata_features

model = joblib.load("voting_model.joblib")
tfidf = joblib.load("tfidf_vectorizer.joblib")
scaler = joblib.load("meta_scaler.joblib")
threshold = joblib.load("optimal_threshold.joblib")

email = "Congratulations! You've won a free iPhone!"
text_features = tfidf.transform([preprocess_text(email)])
meta_features = scaler.transform([compute_metadata_features(email)])
features = hstack([text_features, csr_matrix(meta_features)])

proba = model.predict_proba(features)[0][1]
label = "SPAM" if proba >= threshold else "HAM"
```

## Training Data

- [VoltageVagabond/spam-email-dataset](https://huggingface.co/datasets/VoltageVagabond/spam-email-dataset)
- Sources: Kaggle 190K spam/ham + GitHub email-dataset

## Interactive Demo

- [Gradio Space](https://huggingface.co/spaces/VoltageVagabond/spam-classifier-gradio)