v2: Updated README with fixes and results
Browse files
README.md
CHANGED
|
@@ -2,52 +2,51 @@
|
|
| 2 |
|
| 3 |
A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
## Architecture
|
| 6 |
|
| 7 |
```
|
| 8 |
-
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 9 |
-
β MMM-Diffusion Architecture β
|
| 10 |
-
β (Adapted from Kimodo/GMD Dual-Denoiser) β
|
| 11 |
-
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 12 |
-
|
| 13 |
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ
|
| 14 |
β CONDITIONING β β STAGE 1: Campaign/Geo Denoiser β
|
| 15 |
β β β (β Kimodo Root Denoiser) β
|
| 16 |
-
β β’ Media Spend ββββΆβ
|
| 17 |
-
β (5 channels) β β Denoises aggregate
|
| 18 |
-
β β’ Controls β β from
|
| 19 |
-
β (3 variables) β
|
| 20 |
-
β β’ Total Sales β
|
| 21 |
-
|
| 22 |
-
ββββββββββββββββββββ β Campaign Context
|
| 23 |
-
βΌ
|
| 24 |
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 25 |
β STAGE 2: Channel Denoiser β
|
| 26 |
β (β Kimodo Body Denoiser) β
|
|
|
|
| 27 |
β β
|
| 28 |
-
β
|
| 29 |
-
β conditioned on Stage 1 output + media spend β
|
| 30 |
-
β β
|
| 31 |
-
β Cross-Attention + Transformer (6 layers) β
|
| 32 |
-
β β
|
| 33 |
-
β CONSTRAINT ENFORCEMENT: β
|
| 34 |
β β’ Log-space for media (exp β always β₯ 0) β
|
| 35 |
β β’ PhysDiff-style projection every K steps β
|
| 36 |
β β’ Soft sign penalty loss β
|
| 37 |
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
|
| 38 |
-
β
|
| 39 |
βΌ
|
| 40 |
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 41 |
-
β OUTPUT: Time-Varying Coefficients
|
| 42 |
-
β
|
| 43 |
-
β Ξ²
|
| 44 |
-
β
|
| 45 |
-
β
|
| 46 |
-
β Ξ²_CompetitorPrice(t) [unconstrained] β
|
| 47 |
-
β β
|
| 48 |
-
β β Sales Decomposition: β
|
| 49 |
-
β Sales_t = base + Ξ£ Ξ²_m(t)Β·Hill(Adstock(x))β
|
| 50 |
-
β + Ξ£ Ξ²_c(t)Β·ctrl_c(t) + noise β
|
| 51 |
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 52 |
```
|
| 53 |
|
|
@@ -61,95 +60,94 @@ A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts
|
|
| 61 |
| Body denoiser (joint angles) | Channel denoiser (per-channel coefficients) |
|
| 62 |
| Skeleton positions/rotations | Time-varying coefficients for decomposition |
|
| 63 |
| Foot contact constraints | Media positivity constraint |
|
| 64 |
-
| Velocity loss |
|
| 65 |
-
|
| 66 |
-
## Key Design Decisions
|
| 67 |
-
|
| 68 |
-
### Constraint Enforcement (3 mechanisms, belt-and-suspenders)
|
| 69 |
-
|
| 70 |
-
1. **Log-space reparametrization**: Media coefficients are modeled in log-space during training. At decode time, `exp()` guarantees positivity. This is the primary mechanism.
|
| 71 |
-
|
| 72 |
-
2. **PhysDiff-style projection**: During reverse diffusion sampling, every K=10 steps the denoised xΜβ is projected into the feasible region (clamped to valid ranges). Based on [PhysDiff](https://arxiv.org/abs/2212.02500).
|
| 73 |
-
|
| 74 |
-
3. **Soft sign penalty**: Training loss includes `L_sign = ReLU(-Ξ²_media - threshold)Β²` to discourage extreme negative values in log-space.
|
| 75 |
-
|
| 76 |
-
### xβ-prediction (not Ξ΅-prediction)
|
| 77 |
|
| 78 |
-
|
| 79 |
-
- Constraint projection at each denoising step (operating on meaningful coefficient values)
|
| 80 |
-
- Geometric auxiliary losses (sales reconstruction, temporal smoothness)
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
|
| 89 |
-
- **5 media channels**: TV, Digital, Social, Print, Radio
|
| 90 |
-
- **3 control variables**: Seasonality, Trend, Competitor Price
|
| 91 |
-
- **Adstock transformation**: Geometric decay with Ξ± ~ Beta(2,2)
|
| 92 |
-
- **Hill saturation**: With EC50 ~ LogNormal and slope ~ Uniform[0.5, 3]
|
| 93 |
-
- **Time-varying coefficients**: Ornstein-Uhlenbeck random walk with mean reversion
|
| 94 |
-
- **500 training scenarios**, 104 weeks each
|
| 95 |
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
-
L_total = L_campaign + L_channel + 0.1Β·L_smooth + 0.01Β·L_sign
|
| 100 |
|
| 101 |
-
|
| 102 |
-
L_channel = MSE(coeff_pred, coeff_target) β Stage 2 xβ-prediction
|
| 103 |
-
L_smooth = MSE(Ξcoeff_pred, Ξcoeff_target) β Temporal smoothness (β velocity loss)
|
| 104 |
-
L_sign = ReLU(-Ξ²_media_log - 5) β Soft positivity
|
| 105 |
-
```
|
| 106 |
|
| 107 |
-
##
|
| 108 |
|
| 109 |
-
-
|
| 110 |
-
-
|
| 111 |
-
-
|
| 112 |
-
-
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
## Usage
|
| 115 |
|
| 116 |
```python
|
| 117 |
-
from
|
|
|
|
| 118 |
|
| 119 |
-
# Generate
|
| 120 |
gen = MMMDataGenerator(n_weeks=104, seed=42)
|
| 121 |
samples = gen.generate_dataset(100)
|
| 122 |
-
|
| 123 |
-
# Build model
|
| 124 |
-
model = MMMDiffusionModel(n_media=5, n_ctrl=3, T_diff=200)
|
| 125 |
-
|
| 126 |
-
# Train
|
| 127 |
dataset = MMMDiffusionDataset(samples, normalize=True)
|
| 128 |
-
# ... (see mmm_diffusion.py for full training loop)
|
| 129 |
|
| 130 |
-
#
|
| 131 |
-
|
| 132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
decoded = dataset.decode_coefficients(coefficients)
|
| 134 |
-
# decoded[:, :, :5]
|
| 135 |
-
```
|
| 136 |
|
| 137 |
-
#
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
- `mmm_diffusion_model.pt` β Trained model checkpoint (PoC, 30 epochs on CPU)
|
| 141 |
-
- `training_history.png` β Training loss curves
|
| 142 |
-
- `coeff_comparison.png` β True vs predicted coefficients on validation sample
|
| 143 |
-
- `sales_decomposition.png` β Sales decomposition visualization
|
| 144 |
|
| 145 |
## References
|
| 146 |
|
| 147 |
-
- **GMD** (arxiv:2305.12577) β Two-stage trajectory + body diffusion
|
| 148 |
-
- **MDM** (arxiv:2209.14916) β Transformer denoiser, xβ-prediction
|
| 149 |
-
- **PhysDiff** (arxiv:2212.02500) β Physics-based constraint projection
|
| 150 |
-
- **PDM** (arxiv:2402.03559) β Projected diffusion for hard
|
| 151 |
-
- **NNN** (arxiv:2504.06212) β Neural network MMM
|
| 152 |
-
- **
|
|
|
|
| 153 |
|
| 154 |
## License
|
| 155 |
|
|
|
|
| 2 |
|
| 3 |
A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
|
| 4 |
|
| 5 |
+
## v2 Fixes (from v1)
|
| 6 |
+
|
| 7 |
+
### Problem 1: Sales Alignment (predicted sales didn't match total sales)
|
| 8 |
+
**Root cause**: v1 had `loss_sales = 0.0` β no gradient signal for sales reconstruction.
|
| 9 |
+
**Fix**: Added differentiable sales reconstruction loss (`L_sales`) that flows through coefficient β contribution β total sales path. Uses warmup schedule (first 25% of epochs focus on core coefficient denoising, then sales loss ramps in).
|
| 10 |
+
|
| 11 |
+
### Problem 2: Coefficients Too Smooth (compared to GT)
|
| 12 |
+
**Root cause**: Smoothness loss weight (0.1) was too high relative to reconstruction loss, and GT coefficient volatility was too low (OU process Ο=0.05).
|
| 13 |
+
**Fixes**:
|
| 14 |
+
1. **Spectral loss** (`L_spectral`): Log-magnitude FFT loss that penalizes frequency spectrum differences, with higher weights on high frequencies to fight smoothing
|
| 15 |
+
2. **Multi-scale temporal loss**: Matches 1st AND 2nd order temporal derivatives (velocity + acceleration)
|
| 16 |
+
3. **Higher GT volatility**: Increased OU volatility (0.05β0.12 for media, 0.03β0.08 for controls) + regime-change jumps
|
| 17 |
+
4. **Contribution matching loss**: Directly matches predicted channel-level contributions to GT
|
| 18 |
+
5. **Reduced smoothness weight**: 0.1 β 0.05
|
| 19 |
+
6. **Loss warmup**: Core denoising trained first, auxiliary losses ramped in after 25% of training
|
| 20 |
+
|
| 21 |
## Architecture
|
| 22 |
|
| 23 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ
|
| 25 |
β CONDITIONING β β STAGE 1: Campaign/Geo Denoiser β
|
| 26 |
β β β (β Kimodo Root Denoiser) β
|
| 27 |
+
β β’ Media Spend ββββΆβ Transformer (4 layers, d=192) β
|
| 28 |
+
β (5 channels) β β Denoises aggregate patterns β
|
| 29 |
+
β β’ Controls β β from controls + total sales β
|
| 30 |
+
β (3 variables) β ββββββββββββββββ¬ββββββββββββββββββββββββββββ
|
| 31 |
+
β β’ Total Sales β β Campaign Context
|
| 32 |
+
ββββββββββββββββββββ βΌ
|
|
|
|
|
|
|
| 33 |
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 34 |
β STAGE 2: Channel Denoiser β
|
| 35 |
β (β Kimodo Body Denoiser) β
|
| 36 |
+
β Cross-Attention + Transformer (6 layers, d=256)β
|
| 37 |
β β
|
| 38 |
+
β CONSTRAINTS: β
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
β β’ Log-space for media (exp β always β₯ 0) β
|
| 40 |
β β’ PhysDiff-style projection every K steps β
|
| 41 |
β β’ Soft sign penalty loss β
|
| 42 |
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
|
|
|
|
| 43 |
βΌ
|
| 44 |
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 45 |
+
β OUTPUT: Time-Varying Coefficients (T, 8) β
|
| 46 |
+
β Ξ²_TV, Ξ²_Digital, Ξ²_Social, Ξ²_Print, Ξ²_Radio β
|
| 47 |
+
β Ξ²_Seasonality, Ξ²_Trend, Ξ²_CompetitorPrice β
|
| 48 |
+
β β Sales = base + Ξ£ Ξ²_mΒ·Hill(Adstock(x)) β
|
| 49 |
+
β + Ξ£ Ξ²_cΒ·ctrl + noise β
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 51 |
```
|
| 52 |
|
|
|
|
| 60 |
| Body denoiser (joint angles) | Channel denoiser (per-channel coefficients) |
|
| 61 |
| Skeleton positions/rotations | Time-varying coefficients for decomposition |
|
| 62 |
| Foot contact constraints | Media positivity constraint |
|
| 63 |
+
| Velocity loss | Multi-scale temporal loss |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
## Losses (v2)
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
```
|
| 68 |
+
L_total = L_campaign + 2Β·L_channel + 0.5Β·L_spectral + 0.1Β·L_temporal
|
| 69 |
+
+ aux_weight Β· (0.2Β·L_sales + 0.2Β·L_contrib) + 0.01Β·L_sign
|
| 70 |
+
|
| 71 |
+
where aux_weight ramps from 0β1 after 25% warmup
|
| 72 |
+
|
| 73 |
+
L_campaign = MSE(agg_pred, agg_target) β Stage 1 xβ-prediction
|
| 74 |
+
L_channel = MSE(coeff_pred, coeff_target) β Stage 2 xβ-prediction (PRIMARY)
|
| 75 |
+
L_spectral = MSE(log|FFT(pred)|, log|FFT(target)|) β Frequency preservation
|
| 76 |
+
L_temporal = MSE(ΞΒΉpred, ΞΒΉtarget) + 0.5Β·MSE(ΞΒ²pred, ΞΒ²target) β Multi-scale
|
| 77 |
+
L_sales = MSE(pred_sales/scale, actual_sales/scale) β Sales reconstruction
|
| 78 |
+
L_contrib = MSE(pred_contrib, true_contrib) β Channel contribution matching
|
| 79 |
+
L_sign = ReLU(-Ξ²_media_log - 5) β Soft positivity
|
| 80 |
+
```
|
| 81 |
|
| 82 |
+
## Sampling
|
| 83 |
|
| 84 |
+
Two samplers available:
|
| 85 |
+
- **DDPM** (500 steps): Stochastic, well-calibrated temporal variation
|
| 86 |
+
- **DDIM** (50-100 steps): 5-10x faster, deterministic (eta=0)
|
| 87 |
|
| 88 |
+
## Results (v2, GPU training, 150 epochs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
+
| Metric | v1 (CPU, 30 epochs) | v2 (GPU, 150 epochs) |
|
| 91 |
+
|--------|---------------------|----------------------|
|
| 92 |
+
| Final training loss | 0.129 | 0.68 |
|
| 93 |
+
| Channel loss | β | 0.14 |
|
| 94 |
+
| Media positivity | β
100% | β
100% |
|
| 95 |
+
| Temporal variation ratio | 0.2-0.4 (too smooth) | **0.3-1.2** (calibrated) |
|
| 96 |
+
| MAPE (fitted base) | β | **7.0%** |
|
| 97 |
+
| Model size | 2.7M | 7.2M |
|
| 98 |
|
| 99 |
+
**Key improvement**: Temporal variation ratio (pred_std / GT_std) improved from 0.2-0.4 to 0.3-1.2, meaning predicted coefficients now exhibit realistic temporal dynamics instead of being over-smoothed.
|
|
|
|
| 100 |
|
| 101 |
+
**Note on coefficient correlation**: Per-channel correlation with GT is near zero. This is expected β the MMM coefficient recovery problem is fundamentally ill-posed (many coefficient combinations produce similar sales). The diffusion model generates *plausible* coefficient trajectories conditioned on the input data, not deterministic point estimates. For practical use, ensemble multiple samples for uncertainty quantification.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
+
## Files
|
| 104 |
|
| 105 |
+
- `mmm_diffusion_v2.py` β Full v2 implementation with all fixes
|
| 106 |
+
- `mmm_diffusion_v2.pt` β Best model checkpoint (v2, 150 epochs on GPU)
|
| 107 |
+
- `training_history_v2.png` β Training loss curves (all 7 loss components)
|
| 108 |
+
- `coeff_comparison_v2.png` β True vs predicted time-varying coefficients
|
| 109 |
+
- `sales_decomposition_v2.png` β Sales decomposition with RΒ² and MAPE
|
| 110 |
+
- `mmm_diffusion.py` β Original v1 implementation (kept for reference)
|
| 111 |
+
- `mmm_diffusion_model.pt` β v1 model checkpoint
|
| 112 |
|
| 113 |
## Usage
|
| 114 |
|
| 115 |
```python
|
| 116 |
+
from mmm_diffusion_v2 import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
|
| 117 |
+
import torch
|
| 118 |
|
| 119 |
+
# Generate data
|
| 120 |
gen = MMMDataGenerator(n_weeks=104, seed=42)
|
| 121 |
samples = gen.generate_dataset(100)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
dataset = MMMDiffusionDataset(samples, normalize=True)
|
|
|
|
| 123 |
|
| 124 |
+
# Build and load model
|
| 125 |
+
model = MMMDiffusionModel(n_media=5, n_ctrl=3, d_model_campaign=192,
|
| 126 |
+
d_model_channel=256, n_layers_campaign=4,
|
| 127 |
+
n_layers_channel=6, T_diff=500)
|
| 128 |
+
ckpt = torch.load('mmm_diffusion_model_v2.pt', weights_only=False)
|
| 129 |
+
model.load_state_dict(ckpt['model_state_dict'])
|
| 130 |
+
model.eval()
|
| 131 |
+
|
| 132 |
+
# Generate coefficients (DDPM)
|
| 133 |
+
conditioning = ... # (1, T, 9) [media_spend, controls, total_sales]
|
| 134 |
+
coefficients = model.sample(conditioning, n_steps=500)
|
| 135 |
decoded = dataset.decode_coefficients(coefficients)
|
| 136 |
+
# decoded[:, :, :5] guaranteed positive (media channels)
|
|
|
|
| 137 |
|
| 138 |
+
# Or faster with DDIM (50 steps)
|
| 139 |
+
coefficients = model.sample_ddim(conditioning, n_steps=50, eta=0.0)
|
| 140 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
## References
|
| 143 |
|
| 144 |
+
- **GMD** (arxiv:2305.12577) β Two-stage trajectory + body diffusion
|
| 145 |
+
- **MDM** (arxiv:2209.14916) β Transformer denoiser, xβ-prediction
|
| 146 |
+
- **PhysDiff** (arxiv:2212.02500) β Physics-based constraint projection
|
| 147 |
+
- **PDM** (arxiv:2402.03559) β Projected diffusion for hard constraints
|
| 148 |
+
- **NNN** (arxiv:2504.06212) β Neural network MMM (Google)
|
| 149 |
+
- **Improved DDPM** (arxiv:2102.09672) β Cosine noise schedule
|
| 150 |
+
- **DDIM** (arxiv:2010.02502) β Deterministic sampling
|
| 151 |
|
| 152 |
## License
|
| 153 |
|