DistilBERT SMS Spam Classifier

Model Description

This model is a fine‑tuned version of DistilBERT‑base‑uncased for binary classification of SMS messages as ham (0) or spam (1).
It was trained on the UCI SMS Spam Collection, which contains 5,574 labeled SMS messages.

⚠️ This model was created for educational and research purposes only. It is not intended for production use.
It demonstrates a complete fine‑tuning pipeline for an imbalanced text classification task, including baseline comparison and evaluation.

  • Developed by: lorcannrauzduel
  • Model type: Transformer encoder (DistilBERT) with a classification head
  • Language: English
  • License: Apache 2.0 (same as DistilBERT)
  • Finetuned from model: distilbert-base-uncased

Uses

Direct Use (Research / Experimentation)

You can use this model directly with the Hugging Face transformers library to classify any English SMS.

from transformers import pipeline

classifier = pipeline("text-classification", model="lorcannrauzduel/distilbert-sms-spam")
result = classifier("WINNER!! You've won a free iPhone!")
print(result)  # [{'label': 'spam', 'score': 0.998}]

Or with a custom function:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "lorcannrauzduel/distilbert-sms-spam"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        logits = model(**inputs).logits
    return model.config.id2label[logits.argmax().item()]

print(predict("See you at the meeting tomorrow"))  # ham

Out‑of‑Scope Use

  • The model is not intended for languages other than English.
  • It may not generalise well to very long messages (>128 tokens are truncated).
  • The dataset was collected in the early 2010s; modern spam patterns may not be captured.
  • Not suitable for any commercial or critical application.

Bias, Risks, and Limitations

  • The training set is imbalanced (87% ham, 13% spam). While the F1‑score is high, the model could still slightly favour the majority class.
  • No explicit bias evaluation was performed.
  • The model may fail on messages containing unusual characters or emojis not seen during training.

How to Get Started

from transformers import pipeline

classifier = pipeline("text-classification", model="lorcannrauzduel/distilbert-sms-spam")
print(classifier("Congratulations! You've been selected as a winner."))

Training Details

Training Data

  • Dataset: ucirvine/sms_spam (5,574 SMS)
  • Split:
    • Train: 4,026 (72%)
    • Validation: 711 (13%)
    • Test: 837 (15%)
    • Stratified by label to preserve the 87/13 class ratio.

Training Procedure

  • Tokenizer: distilbert-base-uncased with max_length=128 (padding & truncation)
  • Hardware: NVIDIA Tesla T4 (Google Colab)
  • Training regime: fp16 mixed precision

Hyperparameters

Parameter Value
Learning rate 2e-5
Batch size (train) 16
Batch size (eval) 32
Number of epochs 3
Warmup ratio 0.1
Weight decay 0.01
Optimizer AdamW
Loss function Cross‑entropy

Speeds, Sizes, Times

  • Total parameters: 66,955,010 (66.9M) – all trainable
  • Training time: ≈ 1 minute on T4 GPU
  • Model size (PyTorch safetensors): ~268 MB

Evaluation

Testing Data

The test set contains 837 messages (15% of the full dataset), with the same class proportion as the original.

Metrics

  • Accuracy – overall correctness
  • Weighted F1‑score – primary metric due to class imbalance

Results

Model Accuracy F1 (weighted)
Random baseline (untrained head) 70.3% 0.721
Fine‑tuned DistilBERT 99.0% 0.990

Per‑class performance (test set):

Class Precision Recall F1-score
Ham 0.99 0.99 0.99
Spam 0.96 0.96 0.96

Environmental Impact

  • Hardware Type: NVIDIA Tesla T4 (Google Colab)
  • Hours used: < 0.1 hours (training only)
  • Cloud Provider: Google Colab
  • Compute Region: US (default)
  • Carbon Emitted: Negligible (< 0.01 kg CO₂eq)

Acknowledgements

  • The Hugging Face team for transformers, datasets, and evaluate.
  • The DistilBERT paper by Sanh et al. (2019).
  • The UCI SMS Spam Collection dataset.

License

Apache 2.0 (same as the original DistilBERT).


Model card created by lorcannrauzduel for research and experimentation purposes.

Downloads last month
60
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train lorcannrauzduel/distilbert-sms-spam