DistilBERT SMS Spam Classifier

Model Description

This model is a fine‑tuned version of DistilBERT‑base‑uncased for binary classification of SMS messages as ham (0) or spam (1).
It was trained on the UCI SMS Spam Collection, which contains 5,574 labeled SMS messages.

⚠️ This model was created for educational and research purposes only. It is not intended for production use.
It demonstrates a complete fine‑tuning pipeline for an imbalanced text classification task, including baseline comparison and evaluation.

Developed by: lorcannrauzduel
Model type: Transformer encoder (DistilBERT) with a classification head
Language: English
License: Apache 2.0 (same as DistilBERT)
Finetuned from model: distilbert-base-uncased

Uses

Direct Use (Research / Experimentation)

You can use this model directly with the Hugging Face transformers library to classify any English SMS.

from transformers import pipeline

classifier = pipeline("text-classification", model="lorcannrauzduel/distilbert-sms-spam")
result = classifier("WINNER!! You've won a free iPhone!")
print(result)  # [{'label': 'spam', 'score': 0.998}]

Or with a custom function:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "lorcannrauzduel/distilbert-sms-spam"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        logits = model(**inputs).logits
    return model.config.id2label[logits.argmax().item()]

print(predict("See you at the meeting tomorrow"))  # ham

Out‑of‑Scope Use

The model is not intended for languages other than English.
It may not generalise well to very long messages (>128 tokens are truncated).
The dataset was collected in the early 2010s; modern spam patterns may not be captured.
Not suitable for any commercial or critical application.

Bias, Risks, and Limitations

The training set is imbalanced (87% ham, 13% spam). While the F1‑score is high, the model could still slightly favour the majority class.
No explicit bias evaluation was performed.
The model may fail on messages containing unusual characters or emojis not seen during training.

How to Get Started

from transformers import pipeline

classifier = pipeline("text-classification", model="lorcannrauzduel/distilbert-sms-spam")
print(classifier("Congratulations! You've been selected as a winner."))

Training Details

Training Data

Dataset: ucirvine/sms_spam (5,574 SMS)
Split:
- Train: 4,026 (72%)
- Validation: 711 (13%)
- Test: 837 (15%)
- Stratified by label to preserve the 87/13 class ratio.

Training Procedure

Tokenizer: distilbert-base-uncased with max_length=128 (padding & truncation)
Hardware: NVIDIA Tesla T4 (Google Colab)
Training regime: fp16 mixed precision

Hyperparameters

Parameter	Value
Learning rate	2e-5
Batch size (train)	16
Batch size (eval)	32
Number of epochs	3
Warmup ratio	0.1
Weight decay	0.01
Optimizer	AdamW
Loss function	Cross‑entropy

Speeds, Sizes, Times

Total parameters: 66,955,010 (66.9M) – all trainable
Training time: ≈ 1 minute on T4 GPU
Model size (PyTorch safetensors): ~268 MB

Evaluation

Testing Data

The test set contains 837 messages (15% of the full dataset), with the same class proportion as the original.

Metrics

Accuracy – overall correctness
Weighted F1‑score – primary metric due to class imbalance

Results

Model	Accuracy	F1 (weighted)
Random baseline (untrained head)	70.3%	0.721
Fine‑tuned DistilBERT	99.0%	0.990

Per‑class performance (test set):

Class	Precision	Recall	F1-score
Ham	0.99	0.99	0.99
Spam	0.96	0.96	0.96

Environmental Impact

Hardware Type: NVIDIA Tesla T4 (Google Colab)
Hours used: < 0.1 hours (training only)
Cloud Provider: Google Colab
Compute Region: US (default)
Carbon Emitted: Negligible (< 0.01 kg CO₂eq)

Acknowledgements

The Hugging Face team for transformers, datasets, and evaluate.
The DistilBERT paper by Sanh et al. (2019).
The UCI SMS Spam Collection dataset.

License

Apache 2.0 (same as the original DistilBERT).

Model card created by lorcannrauzduel for research and experimentation purposes.

Downloads last month: 60

Safetensors

Model size

67M params

Tensor type

F32

lorcannrauzduel
/

distilbert-sms-spam