# MMM-Diffusion: Marketing Mix Modeling via Dual-Denoiser Diffusion

A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.

## v2 Fixes (from v1)

### Problem 1: Sales Alignment (predicted sales didn't match total sales)
**Root cause**: v1 had `loss_sales = 0.0` — no gradient signal for sales reconstruction.
**Fix**: Added differentiable sales reconstruction loss (`L_sales`) that flows through coefficient → contribution → total sales path. Uses warmup schedule (first 25% of epochs focus on core coefficient denoising, then sales loss ramps in).

### Problem 2: Coefficients Too Smooth (compared to GT)
**Root cause**: Smoothness loss weight (0.1) was too high relative to reconstruction loss, and GT coefficient volatility was too low (OU process σ=0.05).
**Fixes**:
1. **Spectral loss** (`L_spectral`): Log-magnitude FFT loss that penalizes frequency spectrum differences, with higher weights on high frequencies to fight smoothing
2. **Multi-scale temporal loss**: Matches 1st AND 2nd order temporal derivatives (velocity + acceleration)
3. **Higher GT volatility**: Increased OU volatility (0.05→0.12 for media, 0.03→0.08 for controls) + regime-change jumps
4. **Contribution matching loss**: Directly matches predicted channel-level contributions to GT
5. **Reduced smoothness weight**: 0.1 → 0.05
6. **Loss warmup**: Core denoising trained first, auxiliary losses ramped in after 25% of training

## Architecture

```
┌──────────────────┐   ┌──────────────────────────────────────────┐
│  CONDITIONING     │   │  STAGE 1: Campaign/Geo Denoiser          │
│                   │   │  (≈ Kimodo Root Denoiser)                │
│  • Media Spend    │──▶│  Transformer (4 layers, d=192)           │
│    (5 channels)   │   │  Denoises aggregate patterns             │
│  • Controls       │   │  from controls + total sales             │
│    (3 variables)  │   └──────────────┬───────────────────────────┘
│  • Total Sales    │                  │ Campaign Context
└──────────────────┘                   ▼
                    ┌──────────────────────────────────────────────┐
                    │  STAGE 2: Channel Denoiser                   │
                    │  (≈ Kimodo Body Denoiser)                    │
                    │  Cross-Attention + Transformer (6 layers, d=256)│
                    │                                              │
                    │  CONSTRAINTS:                                │
                    │  • Log-space for media (exp → always ≥ 0)    │
                    │  • PhysDiff-style projection every K steps   │
                    │  • Soft sign penalty loss                    │
                    └──────────────┬───────────────────────────────┘
                                   ▼
                    ┌──────────────────────────────────────────────┐
                    │  OUTPUT: Time-Varying Coefficients (T, 8)    │
                    │  β_TV, β_Digital, β_Social, β_Print, β_Radio │
                    │  β_Seasonality, β_Trend, β_CompetitorPrice   │
                    │  → Sales = base + Σ β_m·Hill(Adstock(x))     │
                    │          + Σ β_c·ctrl + noise                │
                    └──────────────────────────────────────────────┘
```

## Kimodo → MMM Mapping

| Kimodo (Motion Generation)     | MMM-Diffusion (Marketing)                     |
|-------------------------------|------------------------------------------------|
| Text prompts                  | Media spend, non-marketing vars, total sales   |
| Motion/position constraints   | Sign constraints (β_media ≥ 0) + prior bounds  |
| Root denoiser (trajectory)    | Campaign/Geo denoiser (aggregate patterns)     |
| Body denoiser (joint angles)  | Channel denoiser (per-channel coefficients)    |
| Skeleton positions/rotations  | Time-varying coefficients for decomposition    |
| Foot contact constraints      | Media positivity constraint                    |
| Velocity loss                 | Multi-scale temporal loss                      |

## Losses (v2)

```
L_total = L_campaign + 2·L_channel + 0.5·L_spectral + 0.1·L_temporal
        + aux_weight · (0.2·L_sales + 0.2·L_contrib) + 0.01·L_sign

where aux_weight ramps from 0→1 after 25% warmup

L_campaign  = MSE(agg_pred, agg_target)           — Stage 1 x₀-prediction
L_channel   = MSE(coeff_pred, coeff_target)        — Stage 2 x₀-prediction (PRIMARY)
L_spectral  = MSE(log|FFT(pred)|, log|FFT(target)|) — Frequency preservation
L_temporal  = MSE(Δ¹pred, Δ¹target) + 0.5·MSE(Δ²pred, Δ²target) — Multi-scale
L_sales     = MSE(pred_sales/scale, actual_sales/scale)  — Sales reconstruction
L_contrib   = MSE(pred_contrib, true_contrib)      — Channel contribution matching
L_sign      = ReLU(-β_media_log - 5)               — Soft positivity
```

## Sampling

Two samplers available:
- **DDPM** (500 steps): Stochastic, well-calibrated temporal variation
- **DDIM** (50-100 steps): 5-10x faster, deterministic (eta=0)

## Results (v2, GPU training, 150 epochs)

| Metric | v1 (CPU, 30 epochs) | v2 (GPU, 150 epochs) |
|--------|---------------------|----------------------|
| Final training loss | 0.129 | 0.68 |
| Channel loss | — | 0.14 |
| Media positivity | ✅ 100% | ✅ 100% |
| Temporal variation ratio | 0.2-0.4 (too smooth) | **0.3-1.2** (calibrated) |
| MAPE (fitted base) | — | **7.0%** |
| Model size | 2.7M | 7.2M |

**Key improvement**: Temporal variation ratio (pred_std / GT_std) improved from 0.2-0.4 to 0.3-1.2, meaning predicted coefficients now exhibit realistic temporal dynamics instead of being over-smoothed.

**Note on coefficient correlation**: Per-channel correlation with GT is near zero. This is expected — the MMM coefficient recovery problem is fundamentally ill-posed (many coefficient combinations produce similar sales). The diffusion model generates *plausible* coefficient trajectories conditioned on the input data, not deterministic point estimates. For practical use, ensemble multiple samples for uncertainty quantification.

## Files

- `mmm_diffusion_v2.py` — Full v2 implementation with all fixes
- `mmm_diffusion_v2.pt` — Best model checkpoint (v2, 150 epochs on GPU)
- `training_history_v2.png` — Training loss curves (all 7 loss components)
- `coeff_comparison_v2.png` — True vs predicted time-varying coefficients
- `sales_decomposition_v2.png` — Sales decomposition with R² and MAPE
- `mmm_diffusion.py` — Original v1 implementation (kept for reference)
- `mmm_diffusion_model.pt` — v1 model checkpoint

## Usage

```python
from mmm_diffusion_v2 import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
import torch

# Generate data
gen = MMMDataGenerator(n_weeks=104, seed=42)
samples = gen.generate_dataset(100)
dataset = MMMDiffusionDataset(samples, normalize=True)

# Build and load model
model = MMMDiffusionModel(n_media=5, n_ctrl=3, d_model_campaign=192,
                          d_model_channel=256, n_layers_campaign=4,
                          n_layers_channel=6, T_diff=500)
ckpt = torch.load('mmm_diffusion_model_v2.pt', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()

# Generate coefficients (DDPM)
conditioning = ...  # (1, T, 9) [media_spend, controls, total_sales]
coefficients = model.sample(conditioning, n_steps=500)
decoded = dataset.decode_coefficients(coefficients)
# decoded[:, :, :5] guaranteed positive (media channels)

# Or faster with DDIM (50 steps)
coefficients = model.sample_ddim(conditioning, n_steps=50, eta=0.0)
```

## References

- **GMD** (arxiv:2305.12577) — Two-stage trajectory + body diffusion
- **MDM** (arxiv:2209.14916) — Transformer denoiser, x₀-prediction
- **PhysDiff** (arxiv:2212.02500) — Physics-based constraint projection
- **PDM** (arxiv:2402.03559) — Projected diffusion for hard constraints
- **NNN** (arxiv:2504.06212) — Neural network MMM (Google)
- **Improved DDPM** (arxiv:2102.09672) — Cosine noise schedule
- **DDIM** (arxiv:2010.02502) — Deterministic sampling

## License

MIT