sujimenon commited on
Commit
a1d427a
Β·
verified Β·
1 Parent(s): 0d85d4b

v2: Updated README with fixes and results

Browse files
Files changed (1) hide show
  1. README.md +93 -95
README.md CHANGED
@@ -2,52 +2,51 @@
2
 
3
  A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ## Architecture
6
 
7
  ```
8
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9
- β”‚ MMM-Diffusion Architecture β”‚
10
- β”‚ (Adapted from Kimodo/GMD Dual-Denoiser) β”‚
11
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
12
-
13
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
14
  β”‚ CONDITIONING β”‚ β”‚ STAGE 1: Campaign/Geo Denoiser β”‚
15
  β”‚ β”‚ β”‚ (β‰ˆ Kimodo Root Denoiser) β”‚
16
- β”‚ β€’ Media Spend │──▢│ β”‚
17
- β”‚ (5 channels) β”‚ β”‚ Denoises aggregate-level patterns β”‚
18
- β”‚ β€’ Controls β”‚ β”‚ from non-marketing vars + total sales β”‚
19
- β”‚ (3 variables) β”‚ β”‚ β”‚
20
- β”‚ β€’ Total Sales β”‚ β”‚ Transformer Encoder (4 layers, d=128) β”‚
21
- β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
22
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Campaign Context
23
- β–Ό
24
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
  β”‚ STAGE 2: Channel Denoiser β”‚
26
  β”‚ (β‰ˆ Kimodo Body Denoiser) β”‚
 
27
  β”‚ β”‚
28
- β”‚ Denoises per-channel time-varying Ξ²_t β”‚
29
- β”‚ conditioned on Stage 1 output + media spend β”‚
30
- β”‚ β”‚
31
- β”‚ Cross-Attention + Transformer (6 layers) β”‚
32
- β”‚ β”‚
33
- β”‚ CONSTRAINT ENFORCEMENT: β”‚
34
  β”‚ β€’ Log-space for media (exp β†’ always β‰₯ 0) β”‚
35
  β”‚ β€’ PhysDiff-style projection every K steps β”‚
36
  β”‚ β€’ Soft sign penalty loss β”‚
37
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
38
- β”‚
39
  β–Ό
40
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
41
- β”‚ OUTPUT: Time-Varying Coefficients β”‚
42
- β”‚ β”‚
43
- β”‚ Ξ²_TV(t), Ξ²_Digital(t), Ξ²_Social(t), β”‚
44
- β”‚ Ξ²_Print(t), Ξ²_Radio(t) [all β‰₯ 0] β”‚
45
- β”‚ Ξ²_Seasonality(t), Ξ²_Trend(t), β”‚
46
- β”‚ Ξ²_CompetitorPrice(t) [unconstrained] β”‚
47
- β”‚ β”‚
48
- β”‚ β†’ Sales Decomposition: β”‚
49
- β”‚ Sales_t = base + Ξ£ Ξ²_m(t)Β·Hill(Adstock(x))β”‚
50
- β”‚ + Ξ£ Ξ²_c(t)Β·ctrl_c(t) + noise β”‚
51
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
52
  ```
53
 
@@ -61,95 +60,94 @@ A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts
61
  | Body denoiser (joint angles) | Channel denoiser (per-channel coefficients) |
62
  | Skeleton positions/rotations | Time-varying coefficients for decomposition |
63
  | Foot contact constraints | Media positivity constraint |
64
- | Velocity loss | Temporal smoothness loss |
65
-
66
- ## Key Design Decisions
67
-
68
- ### Constraint Enforcement (3 mechanisms, belt-and-suspenders)
69
-
70
- 1. **Log-space reparametrization**: Media coefficients are modeled in log-space during training. At decode time, `exp()` guarantees positivity. This is the primary mechanism.
71
-
72
- 2. **PhysDiff-style projection**: During reverse diffusion sampling, every K=10 steps the denoised xΜ‚β‚€ is projected into the feasible region (clamped to valid ranges). Based on [PhysDiff](https://arxiv.org/abs/2212.02500).
73
-
74
- 3. **Soft sign penalty**: Training loss includes `L_sign = ReLU(-Ξ²_media - threshold)Β²` to discourage extreme negative values in log-space.
75
-
76
- ### xβ‚€-prediction (not Ξ΅-prediction)
77
 
78
- Following MDM and GMD, the model predicts the clean data xβ‚€ directly rather than the noise Ξ΅. This enables:
79
- - Constraint projection at each denoising step (operating on meaningful coefficient values)
80
- - Geometric auxiliary losses (sales reconstruction, temporal smoothness)
81
 
82
- ### Dual-Denoiser Hierarchy
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
- Stage 1 captures **aggregate macro patterns** (overall media effectiveness, seasonality), while Stage 2 specializes in **per-channel coefficient dynamics** conditioned on those patterns. This hierarchical decomposition mirrors the Kimodo root→body split.
85
 
86
- ## Training Data
 
 
87
 
88
- Synthetic MMM data generated with realistic patterns:
89
- - **5 media channels**: TV, Digital, Social, Print, Radio
90
- - **3 control variables**: Seasonality, Trend, Competitor Price
91
- - **Adstock transformation**: Geometric decay with Ξ± ~ Beta(2,2)
92
- - **Hill saturation**: With EC50 ~ LogNormal and slope ~ Uniform[0.5, 3]
93
- - **Time-varying coefficients**: Ornstein-Uhlenbeck random walk with mean reversion
94
- - **500 training scenarios**, 104 weeks each
95
 
96
- ## Losses
 
 
 
 
 
 
 
97
 
98
- ```
99
- L_total = L_campaign + L_channel + 0.1Β·L_smooth + 0.01Β·L_sign
100
 
101
- L_campaign = MSE(agg_pred, agg_target) β€” Stage 1 xβ‚€-prediction
102
- L_channel = MSE(coeff_pred, coeff_target) β€” Stage 2 xβ‚€-prediction
103
- L_smooth = MSE(Ξ”coeff_pred, Ξ”coeff_target) β€” Temporal smoothness (β‰ˆ velocity loss)
104
- L_sign = ReLU(-Ξ²_media_log - 5) β€” Soft positivity
105
- ```
106
 
107
- ## Results (PoC, CPU training, 30 epochs)
108
 
109
- - **Final training loss**: 0.129
110
- - **Media positivity constraint**: βœ… 100% satisfied (all generated media coefficients > 0)
111
- - **Model size**: 2.7M parameters
112
- - **Generation time**: ~2.6s per scenario (200 diffusion steps on CPU)
 
 
 
113
 
114
  ## Usage
115
 
116
  ```python
117
- from mmm_diffusion import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
 
118
 
119
- # Generate synthetic data
120
  gen = MMMDataGenerator(n_weeks=104, seed=42)
121
  samples = gen.generate_dataset(100)
122
-
123
- # Build model
124
- model = MMMDiffusionModel(n_media=5, n_ctrl=3, T_diff=200)
125
-
126
- # Train
127
  dataset = MMMDiffusionDataset(samples, normalize=True)
128
- # ... (see mmm_diffusion.py for full training loop)
129
 
130
- # Generate coefficients for new conditioning data
131
- conditioning = ... # (1, T, 9) tensor: [media_spend, controls, total_sales]
132
- coefficients = model.sample(conditioning, n_steps=200)
 
 
 
 
 
 
 
 
133
  decoded = dataset.decode_coefficients(coefficients)
134
- # decoded[:, :, :5] are GUARANTEED positive (media channels)
135
- ```
136
 
137
- ## Files
138
-
139
- - `mmm_diffusion.py` β€” Full implementation (data generation, model, training, evaluation, visualization)
140
- - `mmm_diffusion_model.pt` β€” Trained model checkpoint (PoC, 30 epochs on CPU)
141
- - `training_history.png` β€” Training loss curves
142
- - `coeff_comparison.png` β€” True vs predicted coefficients on validation sample
143
- - `sales_decomposition.png` β€” Sales decomposition visualization
144
 
145
  ## References
146
 
147
- - **GMD** (arxiv:2305.12577) β€” Two-stage trajectory + body diffusion (closest public analog to Kimodo)
148
- - **MDM** (arxiv:2209.14916) β€” Transformer denoiser, xβ‚€-prediction, geometric losses
149
- - **PhysDiff** (arxiv:2212.02500) β€” Physics-based constraint projection during denoising
150
- - **PDM** (arxiv:2402.03559) β€” Projected diffusion for hard constraint satisfaction
151
- - **NNN** (arxiv:2504.06212) β€” Neural network MMM architecture (Google)
152
- - **TabDDPM** (arxiv:2209.15421) β€” Diffusion models for tabular data
 
153
 
154
  ## License
155
 
 
2
 
3
  A generative diffusion model for **Marketing Mix Modeling (MMM)** that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
4
 
5
+ ## v2 Fixes (from v1)
6
+
7
+ ### Problem 1: Sales Alignment (predicted sales didn't match total sales)
8
+ **Root cause**: v1 had `loss_sales = 0.0` β€” no gradient signal for sales reconstruction.
9
+ **Fix**: Added differentiable sales reconstruction loss (`L_sales`) that flows through coefficient β†’ contribution β†’ total sales path. Uses warmup schedule (first 25% of epochs focus on core coefficient denoising, then sales loss ramps in).
10
+
11
+ ### Problem 2: Coefficients Too Smooth (compared to GT)
12
+ **Root cause**: Smoothness loss weight (0.1) was too high relative to reconstruction loss, and GT coefficient volatility was too low (OU process Οƒ=0.05).
13
+ **Fixes**:
14
+ 1. **Spectral loss** (`L_spectral`): Log-magnitude FFT loss that penalizes frequency spectrum differences, with higher weights on high frequencies to fight smoothing
15
+ 2. **Multi-scale temporal loss**: Matches 1st AND 2nd order temporal derivatives (velocity + acceleration)
16
+ 3. **Higher GT volatility**: Increased OU volatility (0.05β†’0.12 for media, 0.03β†’0.08 for controls) + regime-change jumps
17
+ 4. **Contribution matching loss**: Directly matches predicted channel-level contributions to GT
18
+ 5. **Reduced smoothness weight**: 0.1 β†’ 0.05
19
+ 6. **Loss warmup**: Core denoising trained first, auxiliary losses ramped in after 25% of training
20
+
21
  ## Architecture
22
 
23
  ```
 
 
 
 
 
24
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
  β”‚ CONDITIONING β”‚ β”‚ STAGE 1: Campaign/Geo Denoiser β”‚
26
  β”‚ β”‚ β”‚ (β‰ˆ Kimodo Root Denoiser) β”‚
27
+ β”‚ β€’ Media Spend │──▢│ Transformer (4 layers, d=192) β”‚
28
+ β”‚ (5 channels) β”‚ β”‚ Denoises aggregate patterns β”‚
29
+ β”‚ β€’ Controls β”‚ β”‚ from controls + total sales β”‚
30
+ β”‚ (3 variables) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
31
+ β”‚ β€’ Total Sales β”‚ β”‚ Campaign Context
32
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό
 
 
33
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
34
  β”‚ STAGE 2: Channel Denoiser β”‚
35
  β”‚ (β‰ˆ Kimodo Body Denoiser) β”‚
36
+ β”‚ Cross-Attention + Transformer (6 layers, d=256)β”‚
37
  β”‚ β”‚
38
+ β”‚ CONSTRAINTS: β”‚
 
 
 
 
 
39
  β”‚ β€’ Log-space for media (exp β†’ always β‰₯ 0) β”‚
40
  β”‚ β€’ PhysDiff-style projection every K steps β”‚
41
  β”‚ β€’ Soft sign penalty loss β”‚
42
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 
43
  β–Ό
44
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
45
+ β”‚ OUTPUT: Time-Varying Coefficients (T, 8) β”‚
46
+ β”‚ Ξ²_TV, Ξ²_Digital, Ξ²_Social, Ξ²_Print, Ξ²_Radio β”‚
47
+ β”‚ Ξ²_Seasonality, Ξ²_Trend, Ξ²_CompetitorPrice β”‚
48
+ β”‚ β†’ Sales = base + Ξ£ Ξ²_mΒ·Hill(Adstock(x)) β”‚
49
+ β”‚ + Ξ£ Ξ²_cΒ·ctrl + noise β”‚
 
 
 
 
 
50
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
51
  ```
52
 
 
60
  | Body denoiser (joint angles) | Channel denoiser (per-channel coefficients) |
61
  | Skeleton positions/rotations | Time-varying coefficients for decomposition |
62
  | Foot contact constraints | Media positivity constraint |
63
+ | Velocity loss | Multi-scale temporal loss |
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
+ ## Losses (v2)
 
 
66
 
67
+ ```
68
+ L_total = L_campaign + 2Β·L_channel + 0.5Β·L_spectral + 0.1Β·L_temporal
69
+ + aux_weight Β· (0.2Β·L_sales + 0.2Β·L_contrib) + 0.01Β·L_sign
70
+
71
+ where aux_weight ramps from 0β†’1 after 25% warmup
72
+
73
+ L_campaign = MSE(agg_pred, agg_target) β€” Stage 1 xβ‚€-prediction
74
+ L_channel = MSE(coeff_pred, coeff_target) β€” Stage 2 xβ‚€-prediction (PRIMARY)
75
+ L_spectral = MSE(log|FFT(pred)|, log|FFT(target)|) β€” Frequency preservation
76
+ L_temporal = MSE(Δ¹pred, Δ¹target) + 0.5Β·MSE(Δ²pred, Δ²target) β€” Multi-scale
77
+ L_sales = MSE(pred_sales/scale, actual_sales/scale) β€” Sales reconstruction
78
+ L_contrib = MSE(pred_contrib, true_contrib) β€” Channel contribution matching
79
+ L_sign = ReLU(-Ξ²_media_log - 5) β€” Soft positivity
80
+ ```
81
 
82
+ ## Sampling
83
 
84
+ Two samplers available:
85
+ - **DDPM** (500 steps): Stochastic, well-calibrated temporal variation
86
+ - **DDIM** (50-100 steps): 5-10x faster, deterministic (eta=0)
87
 
88
+ ## Results (v2, GPU training, 150 epochs)
 
 
 
 
 
 
89
 
90
+ | Metric | v1 (CPU, 30 epochs) | v2 (GPU, 150 epochs) |
91
+ |--------|---------------------|----------------------|
92
+ | Final training loss | 0.129 | 0.68 |
93
+ | Channel loss | β€” | 0.14 |
94
+ | Media positivity | βœ… 100% | βœ… 100% |
95
+ | Temporal variation ratio | 0.2-0.4 (too smooth) | **0.3-1.2** (calibrated) |
96
+ | MAPE (fitted base) | β€” | **7.0%** |
97
+ | Model size | 2.7M | 7.2M |
98
 
99
+ **Key improvement**: Temporal variation ratio (pred_std / GT_std) improved from 0.2-0.4 to 0.3-1.2, meaning predicted coefficients now exhibit realistic temporal dynamics instead of being over-smoothed.
 
100
 
101
+ **Note on coefficient correlation**: Per-channel correlation with GT is near zero. This is expected β€” the MMM coefficient recovery problem is fundamentally ill-posed (many coefficient combinations produce similar sales). The diffusion model generates *plausible* coefficient trajectories conditioned on the input data, not deterministic point estimates. For practical use, ensemble multiple samples for uncertainty quantification.
 
 
 
 
102
 
103
+ ## Files
104
 
105
+ - `mmm_diffusion_v2.py` β€” Full v2 implementation with all fixes
106
+ - `mmm_diffusion_v2.pt` β€” Best model checkpoint (v2, 150 epochs on GPU)
107
+ - `training_history_v2.png` β€” Training loss curves (all 7 loss components)
108
+ - `coeff_comparison_v2.png` β€” True vs predicted time-varying coefficients
109
+ - `sales_decomposition_v2.png` β€” Sales decomposition with RΒ² and MAPE
110
+ - `mmm_diffusion.py` β€” Original v1 implementation (kept for reference)
111
+ - `mmm_diffusion_model.pt` β€” v1 model checkpoint
112
 
113
  ## Usage
114
 
115
  ```python
116
+ from mmm_diffusion_v2 import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
117
+ import torch
118
 
119
+ # Generate data
120
  gen = MMMDataGenerator(n_weeks=104, seed=42)
121
  samples = gen.generate_dataset(100)
 
 
 
 
 
122
  dataset = MMMDiffusionDataset(samples, normalize=True)
 
123
 
124
+ # Build and load model
125
+ model = MMMDiffusionModel(n_media=5, n_ctrl=3, d_model_campaign=192,
126
+ d_model_channel=256, n_layers_campaign=4,
127
+ n_layers_channel=6, T_diff=500)
128
+ ckpt = torch.load('mmm_diffusion_model_v2.pt', weights_only=False)
129
+ model.load_state_dict(ckpt['model_state_dict'])
130
+ model.eval()
131
+
132
+ # Generate coefficients (DDPM)
133
+ conditioning = ... # (1, T, 9) [media_spend, controls, total_sales]
134
+ coefficients = model.sample(conditioning, n_steps=500)
135
  decoded = dataset.decode_coefficients(coefficients)
136
+ # decoded[:, :, :5] guaranteed positive (media channels)
 
137
 
138
+ # Or faster with DDIM (50 steps)
139
+ coefficients = model.sample_ddim(conditioning, n_steps=50, eta=0.0)
140
+ ```
 
 
 
 
141
 
142
  ## References
143
 
144
+ - **GMD** (arxiv:2305.12577) β€” Two-stage trajectory + body diffusion
145
+ - **MDM** (arxiv:2209.14916) β€” Transformer denoiser, xβ‚€-prediction
146
+ - **PhysDiff** (arxiv:2212.02500) β€” Physics-based constraint projection
147
+ - **PDM** (arxiv:2402.03559) β€” Projected diffusion for hard constraints
148
+ - **NNN** (arxiv:2504.06212) β€” Neural network MMM (Google)
149
+ - **Improved DDPM** (arxiv:2102.09672) β€” Cosine noise schedule
150
+ - **DDIM** (arxiv:2010.02502) β€” Deterministic sampling
151
 
152
  ## License
153