MMM-Diffusion: Marketing Mix Modeling via Dual-Denoiser Diffusion
A generative diffusion model for Marketing Mix Modeling (MMM) that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββ
β MMM-Diffusion Architecture β
β (Adapted from Kimodo/GMD Dual-Denoiser) β
βββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ
β CONDITIONING β β STAGE 1: Campaign/Geo Denoiser β
β β β (β Kimodo Root Denoiser) β
β β’ Media Spend ββββΆβ β
β (5 channels) β β Denoises aggregate-level patterns β
β β’ Controls β β from non-marketing vars + total sales β
β (3 variables) β β β
β β’ Total Sales β β Transformer Encoder (4 layers, d=128) β
β β ββββββββββββββββ¬ββββββββββββββββββββββββββββ
ββββββββββββββββββββ β Campaign Context
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: Channel Denoiser β
β (β Kimodo Body Denoiser) β
β β
β Denoises per-channel time-varying Ξ²_t β
β conditioned on Stage 1 output + media spend β
β β
β Cross-Attention + Transformer (6 layers) β
β β
β CONSTRAINT ENFORCEMENT: β
β β’ Log-space for media (exp β always β₯ 0) β
β β’ PhysDiff-style projection every K steps β
β β’ Soft sign penalty loss β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUT: Time-Varying Coefficients β
β β
β Ξ²_TV(t), Ξ²_Digital(t), Ξ²_Social(t), β
β Ξ²_Print(t), Ξ²_Radio(t) [all β₯ 0] β
β Ξ²_Seasonality(t), Ξ²_Trend(t), β
β Ξ²_CompetitorPrice(t) [unconstrained] β
β β
β β Sales Decomposition: β
β Sales_t = base + Ξ£ Ξ²_m(t)Β·Hill(Adstock(x))β
β + Ξ£ Ξ²_c(t)Β·ctrl_c(t) + noise β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Kimodo β MMM Mapping
| Kimodo (Motion Generation) | MMM-Diffusion (Marketing) |
|---|---|
| Text prompts | Media spend, non-marketing vars, total sales |
| Motion/position constraints | Sign constraints (Ξ²_media β₯ 0) + prior bounds |
| Root denoiser (trajectory) | Campaign/Geo denoiser (aggregate patterns) |
| Body denoiser (joint angles) | Channel denoiser (per-channel coefficients) |
| Skeleton positions/rotations | Time-varying coefficients for decomposition |
| Foot contact constraints | Media positivity constraint |
| Velocity loss | Temporal smoothness loss |
Key Design Decisions
Constraint Enforcement (3 mechanisms, belt-and-suspenders)
Log-space reparametrization: Media coefficients are modeled in log-space during training. At decode time,
exp()guarantees positivity. This is the primary mechanism.PhysDiff-style projection: During reverse diffusion sampling, every K=10 steps the denoised xΜβ is projected into the feasible region (clamped to valid ranges). Based on PhysDiff.
Soft sign penalty: Training loss includes
L_sign = ReLU(-Ξ²_media - threshold)Β²to discourage extreme negative values in log-space.
xβ-prediction (not Ξ΅-prediction)
Following MDM and GMD, the model predicts the clean data xβ directly rather than the noise Ξ΅. This enables:
- Constraint projection at each denoising step (operating on meaningful coefficient values)
- Geometric auxiliary losses (sales reconstruction, temporal smoothness)
Dual-Denoiser Hierarchy
Stage 1 captures aggregate macro patterns (overall media effectiveness, seasonality), while Stage 2 specializes in per-channel coefficient dynamics conditioned on those patterns. This hierarchical decomposition mirrors the Kimodo rootβbody split.
Training Data
Synthetic MMM data generated with realistic patterns:
- 5 media channels: TV, Digital, Social, Print, Radio
- 3 control variables: Seasonality, Trend, Competitor Price
- Adstock transformation: Geometric decay with Ξ± ~ Beta(2,2)
- Hill saturation: With EC50 ~ LogNormal and slope ~ Uniform[0.5, 3]
- Time-varying coefficients: Ornstein-Uhlenbeck random walk with mean reversion
- 500 training scenarios, 104 weeks each
Losses
L_total = L_campaign + L_channel + 0.1Β·L_smooth + 0.01Β·L_sign
L_campaign = MSE(agg_pred, agg_target) β Stage 1 xβ-prediction
L_channel = MSE(coeff_pred, coeff_target) β Stage 2 xβ-prediction
L_smooth = MSE(Ξcoeff_pred, Ξcoeff_target) β Temporal smoothness (β velocity loss)
L_sign = ReLU(-Ξ²_media_log - 5) β Soft positivity
Results (PoC, CPU training, 30 epochs)
- Final training loss: 0.129
- Media positivity constraint: β 100% satisfied (all generated media coefficients > 0)
- Model size: 2.7M parameters
- Generation time: ~2.6s per scenario (200 diffusion steps on CPU)
Usage
from mmm_diffusion import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
# Generate synthetic data
gen = MMMDataGenerator(n_weeks=104, seed=42)
samples = gen.generate_dataset(100)
# Build model
model = MMMDiffusionModel(n_media=5, n_ctrl=3, T_diff=200)
# Train
dataset = MMMDiffusionDataset(samples, normalize=True)
# ... (see mmm_diffusion.py for full training loop)
# Generate coefficients for new conditioning data
conditioning = ... # (1, T, 9) tensor: [media_spend, controls, total_sales]
coefficients = model.sample(conditioning, n_steps=200)
decoded = dataset.decode_coefficients(coefficients)
# decoded[:, :, :5] are GUARANTEED positive (media channels)
Files
mmm_diffusion.pyβ Full implementation (data generation, model, training, evaluation, visualization)mmm_diffusion_model.ptβ Trained model checkpoint (PoC, 30 epochs on CPU)training_history.pngβ Training loss curvescoeff_comparison.pngβ True vs predicted coefficients on validation samplesales_decomposition.pngβ Sales decomposition visualization
References
- GMD (arxiv:2305.12577) β Two-stage trajectory + body diffusion (closest public analog to Kimodo)
- MDM (arxiv:2209.14916) β Transformer denoiser, xβ-prediction, geometric losses
- PhysDiff (arxiv:2212.02500) β Physics-based constraint projection during denoising
- PDM (arxiv:2402.03559) β Projected diffusion for hard constraint satisfaction
- NNN (arxiv:2504.06212) β Neural network MMM architecture (Google)
- TabDDPM (arxiv:2209.15421) β Diffusion models for tabular data
License
MIT