File size: 11,035 Bytes
5329d67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
language:
- en
license: mit
library_name: transformers
tags:
- distilbert
- text-classification
- multi-task
- interpretability
- trust-and-safety
- content-moderation
- tiktok
- manipulation-detection
datasets:
- custom
metrics:
- f1
- mae
- r2
base_model: distilbert-base-uncased
pipeline_tag: text-classification
model-index:
- name: lucid-distilbert
  results:
  - task:
      type: text-classification
      name: Multi-label manipulation tactic classification
    metrics:
      - type: f1
        value: 0.334
        name: Macro F1
      - type: accuracy
        value: 0.904
        name: Macro accuracy
  - task:
      type: text-regression
      name: Composite manipulation severity
    metrics:
      - type: mae
        value: 5.90
        name: Composite MAE (0–100 scale)
      - type: r_squared
        value: 0.368
        name: Composite 
---

# LUCID — DistilBERT for Short-Form Video Manipulation Detection

> *"You're not addicted. You're being engineered. See how."*

`lucid-distilbert` is a fine-tuned DistilBERT classifier that scores short-form social video text (TikTok captions + transcripts + on-screen overlay OCR) along **six research-grounded psychological manipulation dimensions**:

| Dimension | Academic grounding |
|---|---|
| Outrage Bait | Crockett 2017; Brady et al. 2017, 2021 |
| FOMO Trigger | Przybylski et al. 2013; Cialdini 2009 |
| Engagement Bait | Meta 2017; Munger 2020; Mathur et al. 2019 |
| Emotional Manipulation | Cialdini et al. 1987; Small et al. 2007; Kramer et al. 2014 |
| Curiosity Gap | Loewenstein 1994; Blom & Hansen 2015; Scott 2021 |
| Dopamine Design | Skinner 1953; Alter 2017; Montag et al. 2019 |

The model has two parallel output heads on a shared `[CLS]` representation:
- A **regression head** predicting the 0–100 composite Scroll Trap Score (sigmoid × 100).
- A **multi-label head** with 6 binary classifiers — one per dimension — each returning `P(tactic present)`.

Per-dimension probabilities are trained with binary cross-entropy against rubric severity labels binarized at severity ≥ 1. The composite regression head is trained with MSE against rubric-derived ground truth.

---

## Intended use

**Primary use case.** Research / educational tool for analyzing short-form video content at the post level. Given a fused text stream from a single TikTok-style post (caption + audio transcript + on-screen text), return a severity score per manipulation dimension and an aggregate 0–100 composite.

**Users this was built for.** Trust & Safety practitioners, platform policy researchers, media literacy educators, and end users who want vocabulary for what a specific post is doing to their attention.

**Not intended for.**
- Individual creator moderation or takedowns. The model scores *posts*, not *intent*; using it to judge whether a specific creator is acting in bad faith would misread the labels.
- Demographic profiling of creators or audiences.
- Any high-stakes automated decision without human review.
- Content in languages or cultural contexts other than English-language, predominantly US/UK social-media discourse. Manipulation norms are culturally situated; applying the model outside its training distribution requires rubric reconstruction.

---

## Training data

**Total labeled corpus: 3,527 items.**

| Source | Approx. size | Purpose |
|---|---|---|
| Webis Clickbait Corpus 2017 | ~2,000 | Pretraining-style signal; continuous severity |
| Stop Clickbait 2016 | ~1,500 | Weak supervision; binary clickbait |
| TikTok (yt-dlp scrape) | ~200 | In-domain evaluation + demo gallery |

### Labeling — LLM-as-judge with human validation

Because existing datasets carry only binary clickbait labels, we used **Claude Sonnet 4.5** (Anthropic) as a scalable labeling oracle, prompted with the 6-dimension rubric above (full text in [repo `docs/RUBRIC.md`](https://github.com/lindsaygross/Lucid/blob/main/docs/RUBRIC.md)) and 8 few-shot examples per severity level. This approach is explicitly in the lineage of **Constitutional AI / RLAIF** (Bai et al. 2022) — an LLM prompted with human-written principles produces training labels for a smaller supervised model.

We validate Claude's labels against a **100-post human gold set** hand-labeled by the author, reporting per-dimension **Spearman rank correlation** and **Krippendorff's α (ordinal)** as agreement metrics.

*Agreement numbers appear in the companion technical report once gold-set labeling completes.*

---

## Training

- **Base model.** `distilbert-base-uncased` (Sanh et al. 2019), 66M parameters.
- **Fine-tuning.** Full fine-tune (no layer freezing). Dual heads attached to the `[CLS]` pooled representation.
- **Optimizer.** AdamW, `lr=2e-5`, `weight_decay=0.01`.
- **Schedule.** Linear LR with 10% warmup, 4 epochs.
- **Batch size.** 32.
- **Max sequence length.** 256 tokens.
- **Loss.** `MSE(composite) + 1.0 × BCEWithLogitsLoss(dimensions)`.
- **Hardware.** Single NVIDIA H100 via Duke Colab credits. Training completed in ~2 minutes.
- **Checkpoint selection.** Best epoch by validation composite MAE; saved state is from epoch 4 with val MAE=5.88.

### Reproducibility

- Full training notebook: [`notebooks/train_lucid.ipynb`](https://github.com/lindsaygross/Lucid/blob/main/notebooks/train_lucid.ipynb)
- Training script (CPU fallback): [`scripts/train_deep.py`](https://github.com/lindsaygross/Lucid/blob/main/scripts/train_deep.py)
- Random seed: 42
- Splits: 70/15/15 stratified on composite-score bins

---

## Evaluation

Held-out test split of **529 items** (stratified 15% of corpus).

### Test-set metrics

| Metric | Value |
|---|---|
| Macro F1 (per-dim binary, threshold ≥1) | **0.334** |
| Macro accuracy (per-dim binary) | **0.904** |
| Composite MAE (0–100 scale) | **5.90** |
| Composite RMSE | **7.12** |
| Composite R² | **+0.368** |

### How to interpret

- **Positive composite R²** (+0.368) means the model explains real variance in the composite score beyond a constant mean predictor. For comparison, the naive keyword-matching baseline has R²=−0.594 and the classical (TF-IDF + XGBoost) baseline has R²=−1.462. Deep is the only model that beats the mean.
- The macro F1 of 0.334 is lower than the classical baseline's 0.425. This reflects an intentional calibration difference: the deep model's per-dim probabilities are softer, producing fewer firings but better-calibrated confidences. See [the technical report](https://github.com/lindsaygross/Lucid/blob/main/docs/REPORT.md) §6 for the full per-dimension breakdown.

### Noise robustness

Character-level noise injection on 100 test items (seed=7), mean |Δ score| on the 0–100 composite:

| Noise rate | Mean Δ | Median Δ | Max Δ |
|---|---|---|---|
| 5% | 4.2 | 2.0 | 26 |
| 10% | 5.4 | 4.0 | 27 |
| 20% | 7.7 | 5.5 | 37 |
| 35% | 10.2 | 9.0 | 32 |

At realistic OCR / transcription noise levels (5–10%), the composite Scroll Trap Score shifts ~4–5 points on a 0–100 scale — *graceful degradation*, suggesting the model has learned semantic rather than surface-lexical features.

---

## Usage

### Via HuggingFace `transformers`

This model has a custom multi-output head (`composite_head` + `dimension_head`), so it cannot be loaded with `AutoModelForSequenceClassification`. Use the repo's inference module:

```python
from backend.inference.deep import DeepPredictor

predictor = DeepPredictor(hf_repo="lindsaygross32/lucid-distilbert")
pred = predictor.predict("DON'T SCROLL! HANG ON! HANG ON!! I have one question...")

print(pred.scroll_trap_score)
# 28
print(pred.dimension_scores)
# {'outrage_bait': 0.11, 'fomo_trigger': 0.23, 'engagement_bait': 0.29,
#  'emotional_manipulation': 0.04, 'curiosity_gap': 0.68, 'dopamine_design': 0.25}
```

### Per-dimension token attribution (Integrated Gradients)

```python
pred, per_dim_tokens = predictor.explain(
    "DON'T SCROLL! HANG ON! Will you be my friend?",
    top_k=8,
)
# per_dim_tokens["engagement_bait"] -> [
#   {"token": "you", "position": 9, "attribution": +0.34},
#   {"token": "question", "position": 14, "attribution": +0.26},
#   ...
# ]
```

Integrated Gradients (Sundararajan, Taly, Yan 2017) produces signed per-token attributions. Positive attribution → token pushes the head toward "tactic present," negative → toward absent.

### Live demo

https://lucid-seven-pied.vercel.app

---

## Limitations and ethical considerations

1. **Intent vs. effect.** The model measures tactic *presence*, not creator *intent*. A post using emotional appeals to raise money for a sick family member scores higher on Emotional Manipulation — but that is not a judgment of bad faith. Any downstream tooling built on top of this model must preserve that distinction.

2. **Cultural and linguistic scope.** Training data is English-language, predominantly US-origin social content. Manipulation norms vary across cultures; the model should not be used on non-English content or in cultures with meaningfully different rhetorical conventions without rubric reconstruction.

3. **Labeling source bias.** Our labels come from a single LLM judge (Claude Sonnet 4.5) validated against a single human annotator. A world where many systems use the same LLM as judge risks *correlated labeling errors*. Multi-model, multi-annotator labeling would be the right long-term direction.

4. **Small corpus.** 3,527 total items is modest for a 6-way multi-label task. Expect higher variance than reported on new distributions.

5. **Format–content confounds.** The classical baseline over-fires on listicle-format text because training data (Stop Clickbait) conflates listicle *format* with clickbait *manipulation*. The deep model is more robust but the underlying confound is not fully eliminated.

6. **Creator-level aggregation risk.** This model scores *posts*. Rolling scores up to the creator level (e.g., "creator X's average Scroll Trap Score") creates harassment vectors and should not be done without additional review.

7. **Not a safety classifier.** This is an *educational* tool for surfacing rhetorical moves, not a hate-speech / harm detector. It explicitly says nothing about whether content is harmful or false.

---

## Citation

If you use `lucid-distilbert` in academic work, please cite:

```bibtex
@misc{gross2026lucid,
  title        = {LUCID: Multimodal Detection of Short-Form Video Manipulation Tactics},
  author       = {Lindsay Gross},
  year         = {2026},
  howpublished = {\url{https://github.com/lindsaygross/Lucid}},
  note         = {Duke AIPI 540 final project},
}
```

Academic grounding for the 6-dimension rubric is documented in full in [`docs/RUBRIC.md`](https://github.com/lindsaygross/Lucid/blob/main/docs/RUBRIC.md).

---

## License

MIT. See the [LICENSE](https://github.com/lindsaygross/Lucid/blob/main/LICENSE) in the repo.

---

## Contact

Lindsay Gross — Duke AIPI, Spring 2026 — background in Trust & Safety.

Issues / collaboration: [github.com/lindsaygross/Lucid/issues](https://github.com/lindsaygross/Lucid/issues).