---
language: en
tags:
- text-classification
- youtube-comments
- fitness
- roberta
datasets:
- Krat6s/fitness-youtube-comments
metrics:
- accuracy
- f1
---

# Fitness YouTube Comment Classifier — RoBERTa

Fine-tuned `roberta-base` that classifies YouTube comments from fitness influencer videos into 5 categories: `fitness`, `nutrition`, `motivational`, `challenge`, `product`.

Part of a three-experiment study measuring the effect of data volume and model size on a self-scraped fitness influencer comment dataset.

---

## Quick Start

```python
from transformers import pipeline

classifier = pipeline(
    'text-classification',
    model='Krat6s/fitness-comment-classifier-roberta'
)

classifier("This protein shake changed my life, amazing with oat milk")
# [{'label': 'nutrition', 'score': 0.956}]

classifier("I've been doing this workout for 30 days and I can see abs forming!")
# [{'label': 'fitness', 'score': 0.965}]
```

---

## Model Description

- **Base model:** `roberta-base` (FacebookAI, 125M parameters)
- **Task:** Multi-class text classification (5 classes)
- **Domain:** YouTube comments from fitness influencer channels
- **Language:** English (non-English comments present in dataset but not handled)

---

## Dataset

Self-scraped YouTube comments collected via the YouTube Data API v3 for MSc dissertation research on fitness influencer sentiment and thematic analysis.

- **Total dataset:** 92,223 comments across 94 fitness influencer channels
- **Top channels:** Noel Deyzel, Browney, Jeff Nippard, Renaissance Periodization, ATHLEAN-X
- **HuggingFace dataset:** [Krat6s/fitness-youtube-comments](https://huggingface.co/datasets/Krat6s/fitness-youtube-comments)

### Class Distribution (Full Dataset)

| Class | Count |
|-------|-------|
| challenge | 20,923 |
| nutrition | 20,506 |
| fitness | 19,990 |
| motivational | 19,928 |
| product | 10,749 |

---

## Training

### Data Splits (20,000 row stratified sample)

| Split | Size |
|-------|------|
| Train | 14,000 |
| Validation | 3,000 |
| Test | 3,000 |

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning rate | 2e-5 |
| Epochs | 3 |
| Batch size (train) | 16 |
| Batch size (eval) | 32 |
| Max sequence length | 128 |
| Warmup steps | 50 |
| Weight decay | 0.01 |
| Optimizer | AdamW |

### Training Curve

| Epoch | Train Loss | Val Loss | Accuracy | F1 |
|-------|-----------|----------|----------|----|
| 1 | 2.495 | 2.126 | 0.592 | 0.595 |
| 2 | 1.934 | 2.059 | 0.607 | 0.609 |
| 3 | 1.638 | 2.102 | 0.614 | 0.614 |

**Hardware:** Kaggle T4 x2 GPU
**Training time:** 643 seconds (~10.7 minutes)

---

## Evaluation Results (Test Set — 3,000 samples)

### Overall

| Metric | Score |
|--------|-------|
| Accuracy | 62.5% |
| F1 (weighted) | 62.5% |

### Per-Class

| Class | Precision | Recall | F1 | Support |
|-------|-----------|--------|----|---------|
| challenge | 0.62 | 0.58 | 0.60 | 685 |
| fitness | 0.63 | 0.67 | 0.65 | 647 |
| motivational | 0.56 | 0.66 | 0.61 | 641 |
| nutrition | 0.69 | 0.65 | 0.67 | 671 |
| product | 0.65 | 0.53 | 0.58 | 356 |

### Baseline Comparisons

| Model | Accuracy |
|-------|----------|
| Majority class baseline | 22.8% |
| Pretrained RoBERTa (no fine-tuning) | 21.6% |
| Fine-tuned RoBERTa (this model) | 62.5% |
| Improvement over baseline | +39.7pp |
| Improvement from fine-tuning | +40.9pp |

---

## Experiment Comparison — Data Scaling + Model Scaling

Three experiments run on the same dataset and evaluation pipeline, changing one variable at a time.

| Model | Parameters | Training Data | Accuracy | F1 | Train Time |
|-------|-----------|--------------|----------|----|------------|
| DistilBERT | 66M | 5,000 rows | 53.6% | 53.8% | 81s |
| DistilBERT | 66M | 20,000 rows | 60.4% | 60.4% | 327s |
| RoBERTa (this model) | 125M | 20,000 rows | 62.5% | 62.5% | 643s |

**Key findings:**
- Data scaling (5K → 20K rows): +6.8pp accuracy, 4x training time
- Model scaling (DistilBERT → RoBERTa): +2.1pp accuracy, 2x training time
- Data volume had a larger impact than model size on this task

---

## Per-Class F1 Across All Experiments

| Class | DistilBERT 5K | DistilBERT 20K | RoBERTa 20K |
|-------|--------------|----------------|-------------|
| challenge | 0.48 | 0.60 | 0.60 |
| fitness | 0.54 | 0.63 | 0.65 |
| motivational | 0.51 | 0.58 | 0.61 |
| nutrition | 0.62 | 0.63 | 0.67 |
| product | 0.54 | 0.56 | 0.58 |

---

## Inference Examples

| Comment | Predicted | Confidence |
|---------|-----------|------------|
| "This protein shake recipe changed my life, tastes amazing with oat milk" | nutrition | 95.6% |
| "I've been doing this workout for 30 days and I can see abs forming!" | fitness | 96.5% |
| "Never give up on your dreams, the grind is worth it" | motivational | 86.0% |
| "Is this pre-workout worth buying? I've heard mixed reviews" | product | 90.6% |
| "Day 7 of the squat challenge complete 🔥" | fitness ✗ | 89.3% |

Note: the final example is a known failure case. "Day 7 of the squat challenge" is correctly a challenge comment, but RoBERTa predicts fitness at high confidence. "Squat" has strong fitness associations in the training data. This illustrates a known failure mode of larger models — higher confidence on incorrect predictions. DistilBERT correctly predicted challenge here at lower confidence (50.8%).

---

## Limitations

**Challenge/motivational confusion** persists across all three model variants. 129 challenge comments were predicted as motivational in the test set despite the larger model and more training data. This is a label ambiguity problem intrinsic to the task — challenge and motivational videos share workout encouragement language. The confusion is not resolvable by more data or a larger model without incorporating video title or metadata alongside the comment text.

**Product class underrepresentation** — product has roughly half the examples of other classes. F1 of 0.58 is the lowest across classes despite competitive precision (0.65), driven by low recall (0.53) — the model misses nearly half of actual product comments.

**High-confidence errors** — RoBERTa's stronger language associations produce higher confidence scores overall, including on incorrect predictions. The challenge → fitness misclassification at 89.3% confidence is an example.

**Non-English comments** — approximately 15% of the dataset contains non-English comments. These produce unreliable predictions.

---

## Next Steps

- YouTuber-stratified train/test split — train on 80 channels, test on 14 held-out channels to measure generalisation to unseen creators
- Sentiment classification using human-labelled subset to replace VADER dissertation baseline
- Incorporate video title as additional input feature to resolve challenge/motivational ambiguity

---

## Related Models

- [Krat6s/fitness-comment-classifier](https://huggingface.co/Krat6s/fitness-comment-classifier) — DistilBERT version trained on 20K rows (60.4% accuracy)

---

## Citation

```
Dataset: Self-scraped YouTube comments from 94 fitness influencer channels
Collected via YouTube Data API v3 for MSc dissertation research
HuggingFace dataset: Krat6s/fitness-youtube-comments
```