# MMM-Diffusion: Marketing Mix Modeling via Dual-Denoiser Diffusion

A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.

## Architecture

```
                    ┌─────────────────────────────────────────────┐
                    │         MMM-Diffusion Architecture          │
                    │  (Adapted from Kimodo/GMD Dual-Denoiser)    │
                    └─────────────────────────────────────────────┘

┌──────────────────┐   ┌──────────────────────────────────────────┐
│  CONDITIONING     │   │  STAGE 1: Campaign/Geo Denoiser          │
│                   │   │  (≈ Kimodo Root Denoiser)                │
│  • Media Spend    │──▶│                                          │
│    (5 channels)   │   │  Denoises aggregate-level patterns       │
│  • Controls       │   │  from non-marketing vars + total sales   │
│    (3 variables)  │   │                                          │
│  • Total Sales    │   │  Transformer Encoder (4 layers, d=128)   │
│                   │   └──────────────┬───────────────────────────┘
└──────────────────┘                   │ Campaign Context
                                       ▼
                    ┌──────────────────────────────────────────────┐
                    │  STAGE 2: Channel Denoiser                   │
                    │  (≈ Kimodo Body Denoiser)                    │
                    │                                              │
                    │  Denoises per-channel time-varying β_t       │
                    │  conditioned on Stage 1 output + media spend │
                    │                                              │
                    │  Cross-Attention + Transformer (6 layers)    │
                    │                                              │
                    │  CONSTRAINT ENFORCEMENT:                     │
                    │  • Log-space for media (exp → always ≥ 0)    │
                    │  • PhysDiff-style projection every K steps   │
                    │  • Soft sign penalty loss                    │
                    └──────────────┬───────────────────────────────┘
                                   │
                                   ▼
                    ┌──────────────────────────────────────────────┐
                    │  OUTPUT: Time-Varying Coefficients           │
                    │                                              │
                    │  β_TV(t), β_Digital(t), β_Social(t),         │
                    │  β_Print(t), β_Radio(t)  [all ≥ 0]          │
                    │  β_Seasonality(t), β_Trend(t),               │
                    │  β_CompetitorPrice(t)  [unconstrained]       │
                    │                                              │
                    │  → Sales Decomposition:                      │
                    │    Sales_t = base + Σ β_m(t)·Hill(Adstock(x))│
                    │            + Σ β_c(t)·ctrl_c(t) + noise      │
                    └──────────────────────────────────────────────┘
```

## Kimodo → MMM Mapping

| Kimodo (Motion Generation)     | MMM-Diffusion (Marketing)                     |
|-------------------------------|------------------------------------------------|
| Text prompts                  | Media spend, non-marketing vars, total sales   |
| Motion/position constraints   | Sign constraints (β_media ≥ 0) + prior bounds  |
| Root denoiser (trajectory)    | Campaign/Geo denoiser (aggregate patterns)     |
| Body denoiser (joint angles)  | Channel denoiser (per-channel coefficients)    |
| Skeleton positions/rotations  | Time-varying coefficients for decomposition    |
| Foot contact constraints      | Media positivity constraint                    |
| Velocity loss                 | Temporal smoothness loss                       |

## Key Design Decisions

### Constraint Enforcement (3 mechanisms, belt-and-suspenders)

1. **Log-space reparametrization**: Media coefficients are modeled in log-space during training. At decode time, `exp()` guarantees positivity. This is the primary mechanism.

2. **PhysDiff-style projection**: During reverse diffusion sampling, every K=10 steps the denoised x̂₀ is projected into the feasible region (clamped to valid ranges). Based on [PhysDiff](https://arxiv.org/abs/2212.02500).

3. **Soft sign penalty**: Training loss includes `L_sign = ReLU(-β_media - threshold)²` to discourage extreme negative values in log-space.

### x₀-prediction (not ε-prediction)

Following MDM and GMD, the model predicts the clean data x₀ directly rather than the noise ε. This enables:
- Constraint projection at each denoising step (operating on meaningful coefficient values)
- Geometric auxiliary losses (sales reconstruction, temporal smoothness)

### Dual-Denoiser Hierarchy

Stage 1 captures **aggregate macro patterns** (overall media effectiveness, seasonality), while Stage 2 specializes in **per-channel coefficient dynamics** conditioned on those patterns. This hierarchical decomposition mirrors the Kimodo root→body split.

## Training Data

Synthetic MMM data generated with realistic patterns:
- **5 media channels**: TV, Digital, Social, Print, Radio
- **3 control variables**: Seasonality, Trend, Competitor Price
- **Adstock transformation**: Geometric decay with α ~ Beta(2,2)
- **Hill saturation**: With EC50 ~ LogNormal and slope ~ Uniform[0.5, 3]
- **Time-varying coefficients**: Ornstein-Uhlenbeck random walk with mean reversion
- **500 training scenarios**, 104 weeks each

## Losses

```
L_total = L_campaign + L_channel + 0.1·L_smooth + 0.01·L_sign

L_campaign = MSE(agg_pred, agg_target)       — Stage 1 x₀-prediction
L_channel  = MSE(coeff_pred, coeff_target)    — Stage 2 x₀-prediction  
L_smooth   = MSE(Δcoeff_pred, Δcoeff_target)  — Temporal smoothness (≈ velocity loss)
L_sign     = ReLU(-β_media_log - 5)           — Soft positivity
```

## Results (PoC, CPU training, 30 epochs)

- **Final training loss**: 0.129
- **Media positivity constraint**: ✅ 100% satisfied (all generated media coefficients > 0)
- **Model size**: 2.7M parameters
- **Generation time**: ~2.6s per scenario (200 diffusion steps on CPU)

## Usage

```python
from mmm_diffusion import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset

# Generate synthetic data
gen = MMMDataGenerator(n_weeks=104, seed=42)
samples = gen.generate_dataset(100)

# Build model
model = MMMDiffusionModel(n_media=5, n_ctrl=3, T_diff=200)

# Train
dataset = MMMDiffusionDataset(samples, normalize=True)
# ... (see mmm_diffusion.py for full training loop)

# Generate coefficients for new conditioning data
conditioning = ...  # (1, T, 9) tensor: [media_spend, controls, total_sales]
coefficients = model.sample(conditioning, n_steps=200)
decoded = dataset.decode_coefficients(coefficients)
# decoded[:, :, :5] are GUARANTEED positive (media channels)
```

## Files

- `mmm_diffusion.py` — Full implementation (data generation, model, training, evaluation, visualization)
- `mmm_diffusion_model.pt` — Trained model checkpoint (PoC, 30 epochs on CPU)
- `training_history.png` — Training loss curves
- `coeff_comparison.png` — True vs predicted coefficients on validation sample
- `sales_decomposition.png` — Sales decomposition visualization

## References

- **GMD** (arxiv:2305.12577) — Two-stage trajectory + body diffusion (closest public analog to Kimodo)
- **MDM** (arxiv:2209.14916) — Transformer denoiser, x₀-prediction, geometric losses
- **PhysDiff** (arxiv:2212.02500) — Physics-based constraint projection during denoising
- **PDM** (arxiv:2402.03559) — Projected diffusion for hard constraint satisfaction
- **NNN** (arxiv:2504.06212) — Neural network MMM architecture (Google)
- **TabDDPM** (arxiv:2209.15421) — Diffusion models for tabular data

## License

MIT