sujimenon
/

mmm-diffusion

Model card Files Files and versions

xet

Community

sujimenon commited on Apr 24

Commit

a1d427a

verified ·

1 Parent(s): 0d85d4b

v2: Updated README with fixes and results

Browse files

Files changed (1) hide show

README.md +93 -95

README.md CHANGED Viewed

@@ -2,52 +2,51 @@
 A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
 ## Architecture
 ```
-                    ┌─────────────────────────────────────────────┐
-                    │         MMM-Diffusion Architecture          │
-                    │  (Adapted from Kimodo/GMD Dual-Denoiser)    │
-                    └─────────────────────────────────────────────┘
 ┌──────────────────┐   ┌──────────────────────────────────────────┐
 │  CONDITIONING     │   │  STAGE 1: Campaign/Geo Denoiser          │
 │                   │   │  (≈ Kimodo Root Denoiser)                │
-│  • Media Spend    │──▶│                                          │
-│    (5 channels)   │   │  Denoises aggregate-level patterns       │
-│  • Controls       │   │  from non-marketing vars + total sales   │
-│    (3 variables)  │   │                                          │
-│  • Total Sales    │   │  Transformer Encoder (4 layers, d=128)   │
-│                   │   └──────────────┬───────────────────────────┘
-└──────────────────┘                   │ Campaign Context
-                                       ▼
                     ┌──────────────────────────────────────────────┐
                     │  STAGE 2: Channel Denoiser                   │
                     │  (≈ Kimodo Body Denoiser)                    │
                     │                                              │
-                    │  Denoises per-channel time-varying β_t       │
-                    │  conditioned on Stage 1 output + media spend │
-                    │                                              │
-                    │  Cross-Attention + Transformer (6 layers)    │
-                    │                                              │
-                    │  CONSTRAINT ENFORCEMENT:                     │
                     │  • Log-space for media (exp → always ≥ 0)    │
                     │  • PhysDiff-style projection every K steps   │
                     │  • Soft sign penalty loss                    │
                     └──────────────┬───────────────────────────────┘
-                                   │
                                    ▼
                     ┌──────────────────────────────────────────────┐
-                    │  OUTPUT: Time-Varying Coefficients           │
-                    │                                              │
-                    │  β_TV(t), β_Digital(t), β_Social(t),         │
-                    │  β_Print(t), β_Radio(t)  [all ≥ 0]          │
-                    │  β_Seasonality(t), β_Trend(t),               │
-                    │  β_CompetitorPrice(t)  [unconstrained]       │
-                    │                                              │
-                    │  → Sales Decomposition:                      │
-                    │    Sales_t = base + Σ β_m(t)·Hill(Adstock(x))│
-                    │            + Σ β_c(t)·ctrl_c(t) + noise      │
                     └──────────────────────────────────────────────┘
 ```
@@ -61,95 +60,94 @@ A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts
 | Body denoiser (joint angles)  | Channel denoiser (per-channel coefficients)    |
 | Skeleton positions/rotations  | Time-varying coefficients for decomposition    |
 | Foot contact constraints      | Media positivity constraint                    |
-| Velocity loss                 | Temporal smoothness loss                       |
-## Key Design Decisions
-### Constraint Enforcement (3 mechanisms, belt-and-suspenders)
-1. **Log-space reparametrization**: Media coefficients are modeled in log-space during training. At decode time, `exp()` guarantees positivity. This is the primary mechanism.
-2. **PhysDiff-style projection**: During reverse diffusion sampling, every K=10 steps the denoised x̂₀ is projected into the feasible region (clamped to valid ranges). Based on [PhysDiff](https://arxiv.org/abs/2212.02500).
-3. **Soft sign penalty**: Training loss includes `L_sign = ReLU(-β_media - threshold)²` to discourage extreme negative values in log-space.
-### x₀-prediction (not ε-prediction)
-Following MDM and GMD, the model predicts the clean data x₀ directly rather than the noise ε. This enables:
-- Constraint projection at each denoising step (operating on meaningful coefficient values)
-- Geometric auxiliary losses (sales reconstruction, temporal smoothness)
-### Dual-Denoiser Hierarchy
-Stage 1 captures **aggregate macro patterns** (overall media effectiveness, seasonality), while Stage 2 specializes in **per-channel coefficient dynamics** conditioned on those patterns. This hierarchical decomposition mirrors the Kimodo root→body split.
-## Training Data
-Synthetic MMM data generated with realistic patterns:
-- **5 media channels**: TV, Digital, Social, Print, Radio
-- **3 control variables**: Seasonality, Trend, Competitor Price
-- **Adstock transformation**: Geometric decay with α ~ Beta(2,2)
-- **Hill saturation**: With EC50 ~ LogNormal and slope ~ Uniform[0.5, 3]
-- **Time-varying coefficients**: Ornstein-Uhlenbeck random walk with mean reversion
-- **500 training scenarios**, 104 weeks each
-## Losses
-```
-L_total = L_campaign + L_channel + 0.1·L_smooth + 0.01·L_sign
-L_campaign = MSE(agg_pred, agg_target)       — Stage 1 x₀-prediction
-L_channel  = MSE(coeff_pred, coeff_target)    — Stage 2 x₀-prediction
-L_smooth   = MSE(Δcoeff_pred, Δcoeff_target)  — Temporal smoothness (≈ velocity loss)
-L_sign     = ReLU(-β_media_log - 5)           — Soft positivity
-```
-## Results (PoC, CPU training, 30 epochs)
-- **Final training loss**: 0.129
-- **Media positivity constraint**: ✅ 100% satisfied (all generated media coefficients > 0)
-- **Model size**: 2.7M parameters
-- **Generation time**: ~2.6s per scenario (200 diffusion steps on CPU)
 ## Usage
 ```python
-from mmm_diffusion import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
-# Generate synthetic data
 gen = MMMDataGenerator(n_weeks=104, seed=42)
 samples = gen.generate_dataset(100)
-# Build model
-model = MMMDiffusionModel(n_media=5, n_ctrl=3, T_diff=200)
-# Train
 dataset = MMMDiffusionDataset(samples, normalize=True)
-# ... (see mmm_diffusion.py for full training loop)
-# Generate coefficients for new conditioning data
-conditioning = ...  # (1, T, 9) tensor: [media_spend, controls, total_sales]
-coefficients = model.sample(conditioning, n_steps=200)
 decoded = dataset.decode_coefficients(coefficients)
-# decoded[:, :, :5] are GUARANTEED positive (media channels)
-```
-## Files
-- `mmm_diffusion.py` — Full implementation (data generation, model, training, evaluation, visualization)
-- `mmm_diffusion_model.pt` — Trained model checkpoint (PoC, 30 epochs on CPU)
-- `training_history.png` — Training loss curves
-- `coeff_comparison.png` — True vs predicted coefficients on validation sample
-- `sales_decomposition.png` — Sales decomposition visualization
 ## References
-- **GMD** (arxiv:2305.12577) — Two-stage trajectory + body diffusion (closest public analog to Kimodo)
-- **MDM** (arxiv:2209.14916) — Transformer denoiser, x₀-prediction, geometric losses
-- **PhysDiff** (arxiv:2212.02500) — Physics-based constraint projection during denoising
-- **PDM** (arxiv:2402.03559) — Projected diffusion for hard constraint satisfaction
-- **NNN** (arxiv:2504.06212) — Neural network MMM architecture (Google)
-- **TabDDPM** (arxiv:2209.15421) — Diffusion models for tabular data
 ## License

 A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
+## v2 Fixes (from v1)
+### Problem 1: Sales Alignment (predicted sales didn't match total sales)
+**Root cause**: v1 had `loss_sales = 0.0` — no gradient signal for sales reconstruction.
+**Fix**: Added differentiable sales reconstruction loss (`L_sales`) that flows through coefficient → contribution → total sales path. Uses warmup schedule (first 25% of epochs focus on core coefficient denoising, then sales loss ramps in).
+### Problem 2: Coefficients Too Smooth (compared to GT)
+**Root cause**: Smoothness loss weight (0.1) was too high relative to reconstruction loss, and GT coefficient volatility was too low (OU process σ=0.05).
+**Fixes**:
+1. **Spectral loss** (`L_spectral`): Log-magnitude FFT loss that penalizes frequency spectrum differences, with higher weights on high frequencies to fight smoothing
+2. **Multi-scale temporal loss**: Matches 1st AND 2nd order temporal derivatives (velocity + acceleration)
+3. **Higher GT volatility**: Increased OU volatility (0.05→0.12 for media, 0.03→0.08 for controls) + regime-change jumps
+4. **Contribution matching loss**: Directly matches predicted channel-level contributions to GT
+5. **Reduced smoothness weight**: 0.1 → 0.05
+6. **Loss warmup**: Core denoising trained first, auxiliary losses ramped in after 25% of training
 ## Architecture
 ```
 ┌──────────────────┐   ┌──────────────────────────────────────────┐
 │  CONDITIONING     │   │  STAGE 1: Campaign/Geo Denoiser          │
 │                   │   │  (≈ Kimodo Root Denoiser)                │
+│  • Media Spend    │──▶│  Transformer (4 layers, d=192)           │
+│    (5 channels)   │   │  Denoises aggregate patterns             │
+│  • Controls       │   │  from controls + total sales             │
+│    (3 variables)  │   └──────────────┬───────────────────────────┘
+│  • Total Sales    │                  │ Campaign Context
+└──────────────────┘                   ▼
                     ┌──────────────────────────────────────────────┐
                     │  STAGE 2: Channel Denoiser                   │
                     │  (≈ Kimodo Body Denoiser)                    │
+                    │  Cross-Attention + Transformer (6 layers, d=256)│
                     │                                              │
+                    │  CONSTRAINTS:                                │
                     │  • Log-space for media (exp → always ≥ 0)    │
                     │  • PhysDiff-style projection every K steps   │
                     │  • Soft sign penalty loss                    │
                     └──────────────┬───────────────────────────────┘
                                    ▼
                     ┌──────────────────────────────────────────────┐
+                    │  OUTPUT: Time-Varying Coefficients (T, 8)    │
+                    │  β_TV, β_Digital, β_Social, β_Print, β_Radio │
+                    │  β_Seasonality, β_Trend, β_CompetitorPrice   │
+                    │  → Sales = base + Σ β_m·Hill(Adstock(x))     │
+                    │          + Σ β_c·ctrl + noise                │
                     └──────────────────────────────────────────────┘
 ```
 | Body denoiser (joint angles)  | Channel denoiser (per-channel coefficients)    |
 | Skeleton positions/rotations  | Time-varying coefficients for decomposition    |
 | Foot contact constraints      | Media positivity constraint                    |
+| Velocity loss                 | Multi-scale temporal loss                      |
+## Losses (v2)
+```
+L_total = L_campaign + 2·L_channel + 0.5·L_spectral + 0.1·L_temporal
+        + aux_weight · (0.2·L_sales + 0.2·L_contrib) + 0.01·L_sign
+where aux_weight ramps from 0→1 after 25% warmup
+L_campaign  = MSE(agg_pred, agg_target)           — Stage 1 x₀-prediction
+L_channel   = MSE(coeff_pred, coeff_target)        — Stage 2 x₀-prediction (PRIMARY)
+L_spectral  = MSE(log|FFT(pred)|, log|FFT(target)|) — Frequency preservation
+L_temporal  = MSE(Δ¹pred, Δ¹target) + 0.5·MSE(Δ²pred, Δ²target) — Multi-scale
+L_sales     = MSE(pred_sales/scale, actual_sales/scale)  — Sales reconstruction
+L_contrib   = MSE(pred_contrib, true_contrib)      — Channel contribution matching
+L_sign      = ReLU(-β_media_log - 5)               — Soft positivity
+```
+## Sampling
+Two samplers available:
+- **DDPM** (500 steps): Stochastic, well-calibrated temporal variation
+- **DDIM** (50-100 steps): 5-10x faster, deterministic (eta=0)
+## Results (v2, GPU training, 150 epochs)
+| Metric | v1 (CPU, 30 epochs) | v2 (GPU, 150 epochs) |
+|--------|---------------------|----------------------|
+| Final training loss | 0.129 | 0.68 |
+| Channel loss | — | 0.14 |
+| Media positivity | ✅ 100% | ✅ 100% |
+| Temporal variation ratio | 0.2-0.4 (too smooth) | **0.3-1.2** (calibrated) |
+| MAPE (fitted base) | — | **7.0%** |
+| Model size | 2.7M | 7.2M |
+**Key improvement**: Temporal variation ratio (pred_std / GT_std) improved from 0.2-0.4 to 0.3-1.2, meaning predicted coefficients now exhibit realistic temporal dynamics instead of being over-smoothed.
+**Note on coefficient correlation**: Per-channel correlation with GT is near zero. This is expected — the MMM coefficient recovery problem is fundamentally ill-posed (many coefficient combinations produce similar sales). The diffusion model generates *plausible* coefficient trajectories conditioned on the input data, not deterministic point estimates. For practical use, ensemble multiple samples for uncertainty quantification.
+## Files
+- `mmm_diffusion_v2.py` — Full v2 implementation with all fixes
+- `mmm_diffusion_v2.pt` — Best model checkpoint (v2, 150 epochs on GPU)
+- `training_history_v2.png` — Training loss curves (all 7 loss components)
+- `coeff_comparison_v2.png` — True vs predicted time-varying coefficients
+- `sales_decomposition_v2.png` — Sales decomposition with R² and MAPE
+- `mmm_diffusion.py` — Original v1 implementation (kept for reference)
+- `mmm_diffusion_model.pt` — v1 model checkpoint
 ## Usage
 ```python
+from mmm_diffusion_v2 import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
+import torch
+# Generate data
 gen = MMMDataGenerator(n_weeks=104, seed=42)
 samples = gen.generate_dataset(100)
 dataset = MMMDiffusionDataset(samples, normalize=True)
+# Build and load model
+model = MMMDiffusionModel(n_media=5, n_ctrl=3, d_model_campaign=192,
+                          d_model_channel=256, n_layers_campaign=4,
+                          n_layers_channel=6, T_diff=500)
+ckpt = torch.load('mmm_diffusion_model_v2.pt', weights_only=False)
+model.load_state_dict(ckpt['model_state_dict'])
+model.eval()
+# Generate coefficients (DDPM)
+conditioning = ...  # (1, T, 9) [media_spend, controls, total_sales]
+coefficients = model.sample(conditioning, n_steps=500)
 decoded = dataset.decode_coefficients(coefficients)
+# decoded[:, :, :5] guaranteed positive (media channels)
+# Or faster with DDIM (50 steps)
+coefficients = model.sample_ddim(conditioning, n_steps=50, eta=0.0)
+```
 ## References
+- **GMD** (arxiv:2305.12577) — Two-stage trajectory + body diffusion
+- **MDM** (arxiv:2209.14916) — Transformer denoiser, x₀-prediction
+- **PhysDiff** (arxiv:2212.02500) — Physics-based constraint projection
+- **PDM** (arxiv:2402.03559) — Projected diffusion for hard constraints
+- **NNN** (arxiv:2504.06212) — Neural network MMM (Google)
+- **Improved DDPM** (arxiv:2102.09672) — Cosine noise schedule
+- **DDIM** (arxiv:2010.02502) — Deterministic sampling
 ## License