Instructions to use lorcannrauzduel/distilbert-sms-spam with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lorcannrauzduel/distilbert-sms-spam with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="lorcannrauzduel/distilbert-sms-spam")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("lorcannrauzduel/distilbert-sms-spam") model = AutoModelForSequenceClassification.from_pretrained("lorcannrauzduel/distilbert-sms-spam") - Notebooks
- Google Colab
- Kaggle
DistilBERT SMS Spam Classifier
Model Description
This model is a fine‑tuned version of DistilBERT‑base‑uncased for binary classification of SMS messages as ham (0) or spam (1).
It was trained on the UCI SMS Spam Collection, which contains 5,574 labeled SMS messages.
⚠️ This model was created for educational and research purposes only. It is not intended for production use.
It demonstrates a complete fine‑tuning pipeline for an imbalanced text classification task, including baseline comparison and evaluation.
- Developed by: lorcannrauzduel
- Model type: Transformer encoder (DistilBERT) with a classification head
- Language: English
- License: Apache 2.0 (same as DistilBERT)
- Finetuned from model:
distilbert-base-uncased
Uses
Direct Use (Research / Experimentation)
You can use this model directly with the Hugging Face transformers library to classify any English SMS.
from transformers import pipeline
classifier = pipeline("text-classification", model="lorcannrauzduel/distilbert-sms-spam")
result = classifier("WINNER!! You've won a free iPhone!")
print(result) # [{'label': 'spam', 'score': 0.998}]
Or with a custom function:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "lorcannrauzduel/distilbert-sms-spam"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
logits = model(**inputs).logits
return model.config.id2label[logits.argmax().item()]
print(predict("See you at the meeting tomorrow")) # ham
Out‑of‑Scope Use
- The model is not intended for languages other than English.
- It may not generalise well to very long messages (>128 tokens are truncated).
- The dataset was collected in the early 2010s; modern spam patterns may not be captured.
- Not suitable for any commercial or critical application.
Bias, Risks, and Limitations
- The training set is imbalanced (87% ham, 13% spam). While the F1‑score is high, the model could still slightly favour the majority class.
- No explicit bias evaluation was performed.
- The model may fail on messages containing unusual characters or emojis not seen during training.
How to Get Started
from transformers import pipeline
classifier = pipeline("text-classification", model="lorcannrauzduel/distilbert-sms-spam")
print(classifier("Congratulations! You've been selected as a winner."))
Training Details
Training Data
- Dataset: ucirvine/sms_spam (5,574 SMS)
- Split:
- Train: 4,026 (72%)
- Validation: 711 (13%)
- Test: 837 (15%)
- Stratified by label to preserve the 87/13 class ratio.
Training Procedure
- Tokenizer:
distilbert-base-uncasedwithmax_length=128(padding & truncation) - Hardware: NVIDIA Tesla T4 (Google Colab)
- Training regime: fp16 mixed precision
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Batch size (train) | 16 |
| Batch size (eval) | 32 |
| Number of epochs | 3 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Loss function | Cross‑entropy |
Speeds, Sizes, Times
- Total parameters: 66,955,010 (66.9M) – all trainable
- Training time: ≈ 1 minute on T4 GPU
- Model size (PyTorch safetensors): ~268 MB
Evaluation
Testing Data
The test set contains 837 messages (15% of the full dataset), with the same class proportion as the original.
Metrics
- Accuracy – overall correctness
- Weighted F1‑score – primary metric due to class imbalance
Results
| Model | Accuracy | F1 (weighted) |
|---|---|---|
| Random baseline (untrained head) | 70.3% | 0.721 |
| Fine‑tuned DistilBERT | 99.0% | 0.990 |
Per‑class performance (test set):
| Class | Precision | Recall | F1-score |
|---|---|---|---|
| Ham | 0.99 | 0.99 | 0.99 |
| Spam | 0.96 | 0.96 | 0.96 |
Environmental Impact
- Hardware Type: NVIDIA Tesla T4 (Google Colab)
- Hours used: < 0.1 hours (training only)
- Cloud Provider: Google Colab
- Compute Region: US (default)
- Carbon Emitted: Negligible (< 0.01 kg CO₂eq)
Acknowledgements
- The Hugging Face team for
transformers,datasets, andevaluate. - The DistilBERT paper by Sanh et al. (2019).
- The UCI SMS Spam Collection dataset.
License
Apache 2.0 (same as the original DistilBERT).
Model card created by lorcannrauzduel for research and experimentation purposes.
- Downloads last month
- 60