--- library_name: sklearn tags: - spam-detection - scikit-learn - ensemble - tfidf - lime - shap - nlp - text-classification license: mit pipeline_tag: text-classification --- # Spam Email Classifier — sklearn Voting Ensemble (Gradio) **ENGT 375 — Applied Machine Learning | Spring 2026 | ODU** > **Disclaimer:** This model was created as a student project for Applied Machine Learning at Old Dominion University. It is intended for **educational and research purposes only** and should not be used as a sole spam/phishing filter in production. Classification accuracy may vary, and the model may produce incorrect or misleading results. Always use established email security tools for real-world spam filtering. A voting ensemble classifier (Random Forest + Logistic Regression + SVM) for spam email detection, with LIME and SHAP explainability support. ## Model Details - **Architecture:** VotingClassifier (soft voting) - Random Forest - Logistic Regression - Calibrated LinearSVC - **Features:** TF-IDF (text) + 24 hand-crafted metadata features - **Framework:** scikit-learn - **Task:** Binary classification (spam / ham) ## Files | File | Purpose | |------|---------| | `voting_model.joblib` | Trained VotingClassifier ensemble (145MB) | | `tfidf_vectorizer.joblib` | Fitted TF-IDF vectorizer | | `meta_scaler.joblib` | MinMaxScaler for metadata features | | `feature_names.joblib` | Feature name list for explainability | | `optimal_threshold.joblib` | Calibrated decision threshold | | `training_sample.joblib` | Sample of training data for LIME/SHAP | | `training_report.json` | Training metrics and classification report | ## Usage ```python import joblib from utils import preprocess_text, compute_metadata_features model = joblib.load("voting_model.joblib") tfidf = joblib.load("tfidf_vectorizer.joblib") scaler = joblib.load("meta_scaler.joblib") threshold = joblib.load("optimal_threshold.joblib") email = "Congratulations! You've won a free iPhone!" text_features = tfidf.transform([preprocess_text(email)]) meta_features = scaler.transform([compute_metadata_features(email)]) features = hstack([text_features, csr_matrix(meta_features)]) proba = model.predict_proba(features)[0][1] label = "SPAM" if proba >= threshold else "HAM" ``` ## Training Data - [VoltageVagabond/spam-email-dataset](https://huggingface.co/datasets/VoltageVagabond/spam-email-dataset) - Sources: Kaggle 190K spam/ham + GitHub email-dataset ## Interactive Demo - [Gradio Space](https://huggingface.co/spaces/VoltageVagabond/spam-classifier-gradio)