File size: 11,035 Bytes

5329d67

---
language:
- en
license: mit
library_name: transformers
tags:
- distilbert
- text-classification
- multi-task
- interpretability
- trust-and-safety
- content-moderation
- tiktok
- manipulation-detection
datasets:
- custom
metrics:
- f1
- mae
- r2
base_model: distilbert-base-uncased
pipeline_tag: text-classification
model-index:
- name: lucid-distilbert
  results:
  - task:
      type: text-classification
      name: Multi-label manipulation tactic classification
    metrics:
      - type: f1
        value: 0.334
        name: Macro F1
      - type: accuracy
        value: 0.904
        name: Macro accuracy
  - task:
      type: text-regression
      name: Composite manipulation severity
    metrics:
      - type: mae
        value: 5.90
        name: Composite MAE (0–100 scale)
      - type: r_squared
        value: 0.368
        name: Composite R²
---

# LUCID — DistilBERT for Short-Form Video Manipulation Detection

> *"You're not addicted. You're being engineered. See how."*

`lucid-distilbert` is a fine-tuned DistilBERT classifier that scores short-form social video text (TikTok captions + transcripts + on-screen overlay OCR) along **six research-grounded psychological manipulation dimensions**:

| Dimension | Academic grounding |
|---|---|
| Outrage Bait | Crockett 2017; Brady et al. 2017, 2021 |
| FOMO Trigger | Przybylski et al. 2013; Cialdini 2009 |
| Engagement Bait | Meta 2017; Munger 2020; Mathur et al. 2019 |
| Emotional Manipulation | Cialdini et al. 1987; Small et al. 2007; Kramer et al. 2014 |
| Curiosity Gap | Loewenstein 1994; Blom & Hansen 2015; Scott 2021 |
| Dopamine Design | Skinner 1953; Alter 2017; Montag et al. 2019 |

The model has two parallel output heads on a shared `[CLS]` representation:
- A **regression head** predicting the 0–100 composite Scroll Trap Score (sigmoid × 100).
- A **multi-label head** with 6 binary classifiers — one per dimension — each returning `P(tactic present)`.

Per-dimension probabilities are trained with binary cross-entropy against rubric severity labels binarized at severity ≥ 1. The composite regression head is trained with MSE against rubric-derived ground truth.

---

## Intended use

**Primary use case.** Research / educational tool for analyzing short-form video content at the post level. Given a fused text stream from a single TikTok-style post (caption + audio transcript + on-screen text), return a severity score per manipulation dimension and an aggregate 0–100 composite.

**Users this was built for.** Trust & Safety practitioners, platform policy researchers, media literacy educators, and end users who want vocabulary for what a specific post is doing to their attention.

**Not intended for.**
- Individual creator moderation or takedowns. The model scores *posts*, not *intent*; using it to judge whether a specific creator is acting in bad faith would misread the labels.
- Demographic profiling of creators or audiences.
- Any high-stakes automated decision without human review.
- Content in languages or cultural contexts other than English-language, predominantly US/UK social-media discourse. Manipulation norms are culturally situated; applying the model outside its training distribution requires rubric reconstruction.

---

## Training data

**Total labeled corpus: 3,527 items.**

| Source | Approx. size | Purpose |
|---|---|---|
| Webis Clickbait Corpus 2017 | ~2,000 | Pretraining-style signal; continuous severity |
| Stop Clickbait 2016 | ~1,500 | Weak supervision; binary clickbait |
| TikTok (yt-dlp scrape) | ~200 | In-domain evaluation + demo gallery |

### Labeling — LLM-as-judge with human validation

Because existing datasets carry only binary clickbait labels, we used **Claude Sonnet 4.5** (Anthropic) as a scalable labeling oracle, prompted with the 6-dimension rubric above (full text in [repo `docs/RUBRIC.md`](https://github.com/lindsaygross/Lucid/blob/main/docs/RUBRIC.md)) and 8 few-shot examples per severity level. This approach is explicitly in the lineage of **Constitutional AI / RLAIF** (Bai et al. 2022) — an LLM prompted with human-written principles produces training labels for a smaller supervised model.

We validate Claude's labels against a **100-post human gold set** hand-labeled by the author, reporting per-dimension **Spearman rank correlation** and **Krippendorff's α (ordinal)** as agreement metrics.

*Agreement numbers appear in the companion technical report once gold-set labeling completes.*

---

## Training

- **Base model.** `distilbert-base-uncased` (Sanh et al. 2019), 66M parameters.
- **Fine-tuning.** Full fine-tune (no layer freezing). Dual heads attached to the `[CLS]` pooled representation.
- **Optimizer.** AdamW, `lr=2e-5`, `weight_decay=0.01`.
- **Schedule.** Linear LR with 10% warmup, 4 epochs.
- **Batch size.** 32.
- **Max sequence length.** 256 tokens.
- **Loss.** `MSE(composite) + 1.0 × BCEWithLogitsLoss(dimensions)`.
- **Hardware.** Single NVIDIA H100 via Duke Colab credits. Training completed in ~2 minutes.
- **Checkpoint selection.** Best epoch by validation composite MAE; saved state is from epoch 4 with val MAE=5.88.

### Reproducibility

- Full training notebook: [`notebooks/train_lucid.ipynb`](https://github.com/lindsaygross/Lucid/blob/main/notebooks/train_lucid.ipynb)
- Training script (CPU fallback): [`scripts/train_deep.py`](https://github.com/lindsaygross/Lucid/blob/main/scripts/train_deep.py)
- Random seed: 42
- Splits: 70/15/15 stratified on composite-score bins

---

## Evaluation

Held-out test split of **529 items** (stratified 15% of corpus).

### Test-set metrics

| Metric | Value |
|---|---|
| Macro F1 (per-dim binary, threshold ≥1) | **0.334** |
| Macro accuracy (per-dim binary) | **0.904** |
| Composite MAE (0–100 scale) | **5.90** |
| Composite RMSE | **7.12** |
| Composite R² | **+0.368** |

### How to interpret

- **Positive composite R²** (+0.368) means the model explains real variance in the composite score beyond a constant mean predictor. For comparison, the naive keyword-matching baseline has R²=−0.594 and the classical (TF-IDF + XGBoost) baseline has R²=−1.462. Deep is the only model that beats the mean.
- The macro F1 of 0.334 is lower than the classical baseline's 0.425. This reflects an intentional calibration difference: the deep model's per-dim probabilities are softer, producing fewer firings but better-calibrated confidences. See [the technical report](https://github.com/lindsaygross/Lucid/blob/main/docs/REPORT.md) §6 for the full per-dimension breakdown.

### Noise robustness

Character-level noise injection on 100 test items (seed=7), mean |Δ score| on the 0–100 composite:

| Noise rate | Mean Δ | Median Δ | Max Δ |
|---|---|---|---|
| 5% | 4.2 | 2.0 | 26 |
| 10% | 5.4 | 4.0 | 27 |
| 20% | 7.7 | 5.5 | 37 |
| 35% | 10.2 | 9.0 | 32 |

At realistic OCR / transcription noise levels (5–10%), the composite Scroll Trap Score shifts ~4–5 points on a 0–100 scale — *graceful degradation*, suggesting the model has learned semantic rather than surface-lexical features.

---

## Usage

### Via HuggingFace `transformers`

This model has a custom multi-output head (`composite_head` + `dimension_head`), so it cannot be loaded with `AutoModelForSequenceClassification`. Use the repo's inference module:

```python
from backend.inference.deep import DeepPredictor

predictor = DeepPredictor(hf_repo="lindsaygross32/lucid-distilbert")
pred = predictor.predict("DON'T SCROLL! HANG ON! HANG ON!! I have one question...")

print(pred.scroll_trap_score)
# 28
print(pred.dimension_scores)
# {'outrage_bait': 0.11, 'fomo_trigger': 0.23, 'engagement_bait': 0.29,
#  'emotional_manipulation': 0.04, 'curiosity_gap': 0.68, 'dopamine_design': 0.25}
```

### Per-dimension token attribution (Integrated Gradients)

```python
pred, per_dim_tokens = predictor.explain(
    "DON'T SCROLL! HANG ON! Will you be my friend?",
    top_k=8,
)
# per_dim_tokens["engagement_bait"] -> [
#   {"token": "you", "position": 9, "attribution": +0.34},
#   {"token": "question", "position": 14, "attribution": +0.26},
#   ...
# ]
```

Integrated Gradients (Sundararajan, Taly, Yan 2017) produces signed per-token attributions. Positive attribution → token pushes the head toward "tactic present," negative → toward absent.

### Live demo

https://lucid-seven-pied.vercel.app

---

## Limitations and ethical considerations

1. **Intent vs. effect.** The model measures tactic *presence*, not creator *intent*. A post using emotional appeals to raise money for a sick family member scores higher on Emotional Manipulation — but that is not a judgment of bad faith. Any downstream tooling built on top of this model must preserve that distinction.

2. **Cultural and linguistic scope.** Training data is English-language, predominantly US-origin social content. Manipulation norms vary across cultures; the model should not be used on non-English content or in cultures with meaningfully different rhetorical conventions without rubric reconstruction.

3. **Labeling source bias.** Our labels come from a single LLM judge (Claude Sonnet 4.5) validated against a single human annotator. A world where many systems use the same LLM as judge risks *correlated labeling errors*. Multi-model, multi-annotator labeling would be the right long-term direction.

4. **Small corpus.** 3,527 total items is modest for a 6-way multi-label task. Expect higher variance than reported on new distributions.

5. **Format–content confounds.** The classical baseline over-fires on listicle-format text because training data (Stop Clickbait) conflates listicle *format* with clickbait *manipulation*. The deep model is more robust but the underlying confound is not fully eliminated.

6. **Creator-level aggregation risk.** This model scores *posts*. Rolling scores up to the creator level (e.g., "creator X's average Scroll Trap Score") creates harassment vectors and should not be done without additional review.

7. **Not a safety classifier.** This is an *educational* tool for surfacing rhetorical moves, not a hate-speech / harm detector. It explicitly says nothing about whether content is harmful or false.

---

## Citation

If you use `lucid-distilbert` in academic work, please cite:

```bibtex
@misc{gross2026lucid,
  title        = {LUCID: Multimodal Detection of Short-Form Video Manipulation Tactics},
  author       = {Lindsay Gross},
  year         = {2026},
  howpublished = {\url{https://github.com/lindsaygross/Lucid}},
  note         = {Duke AIPI 540 final project},
}
```

Academic grounding for the 6-dimension rubric is documented in full in [`docs/RUBRIC.md`](https://github.com/lindsaygross/Lucid/blob/main/docs/RUBRIC.md).

---

## License

MIT. See the [LICENSE](https://github.com/lindsaygross/Lucid/blob/main/LICENSE) in the repo.

---

## Contact

Lindsay Gross — Duke AIPI, Spring 2026 — background in Trust & Safety.

Issues / collaboration: [github.com/lindsaygross/Lucid/issues](https://github.com/lindsaygross/Lucid/issues).