Update Smoothing + Snoo

TL;DR: Combining #128 and #129, decreases iters to 5590 (-20 from #129, and p-values are much more robust.)

This PR combines #128 (Snoo optimizer) and #128 (EMA on top of Muon). Both PRs are in some way “smoothing out” the updates: #129 smoothes the Muon update, and #128 applies a lookahead smoothing wrapper to the entire optimizer. Here, we just apply #128 to #129. After combining the two, the total iterations decreases to 5590.

More detail on method

#129 smooths out the Muon updates:

muon_update = NS(EMA(grads))
final_update = EMA(muon_update)

Here, unlike in #129, we use a constant ema coefficient of 0.2.

#128 applies a lookahead step to the updates: run an inner optimizer for K iterations, and treat the parameter displacement as a “gradient” for an inner SGD optimizer. Note that if K=1 and the SGD optimizer does not employ Nesterov momentum, then I think the two are equivalent - with the exception that #128 works on every parameter rather than just the Muon parameters.

Here, we simply use the smoothed Muon updates of #129 as the inner optimizer for #128, in addition to importing some learning rate tuning from #129.

Overall, the total iterations can be decreased to 5590 (from 5610 in #129 or 5640 in #128). I also was more stringent with the p-value criterion, so that it’s likely there is a bit more “slack” in this submission than in either #128 or #129.

Baselines (80 runs each)

I have noticed that there is substantial variance in the p-values for these runs, so I ran 80 runs of each baseline, and then created 1000 bootstrap samples of size 40 to compute the fraction of times the p-value was less than 0.01. I’m not a real statistician, but I feel better about this methodology than the one employed in #129 to estimate the probability of seeing a p-value below 0.01.

#129:

--- Val Loss Stats ---
mean: 	2.919815
std:  	0.000751
val loss 99% confidence interval: (2.919594 - 2.920037)
val_loss t-test p=0.015461 (small means <2.92)
--- Bootstrap p-value analysis --- (1000 samples of size 40)
Mean p-value: 0.139028
Variance of p-values: 0.029849
Percentage of p-values below 0.01: 21.00%
--- Training Time Stats ---
train time (minutes): mean=23.4811, std=0.1983
train time 99% confidence interval: (23.4227 - 23.5396)
avg ms per iteration: 251.1352. 99%% confidence interval: (250.5097 - 251.7608)

#128 (here I use the current configuration with 5640 steps)

--- Val Loss Stats ---
mean: 	2.919738
std:  	0.000884
val loss 99% confidence interval: (2.919477 - 2.919999)
val_loss t-test p=0.004818 (small means <2.92)

--- Bootstrap p-value analysis --- (1000 samples of size 40)
Mean p-value: 0.092580
Variance of p-values: 0.018915
Percentage of p-values below 0.01: 32.10%

--- Training Time Stats ---
train time (minutes): mean=23.6421, std=0.1916
train time 99% confidence interval: (23.5856 - 23.6986)
avg ms per iteration: 251.5118. 99%% confidence interval: (250.9105 - 252.1131)

So, from this we see that there both of these runs have a reasonable chance of hitting the required p-value in 40 samples. The “mean p-value” for the bootstrap analysis is very high because the mean is disproportionately favoring larger numbers.

This PR

I ran 160 runs for the new changes in order to have more data, and from these again created 1000 bootstrapped samples of size 40 each to get an idea for the variance in the p-value calculation. Over these samples, we see:

--- Val Loss stats over all 160 runs --- 
mean: 	2.919547
std:  	0.000798
val loss 99% confidence interval: (2.919383 - 2.919712)
val_loss t-test p=0.000000 (small means <2.92)

--- Bootstrap p-value analysis (1000 samples of size 40 each) ---
Mean p-value: 0.006984
Max p-value: 0.262882
Variance of p-values: 0.000433
Percentage of p-values below 0.01: 85.40%

--- Training Time Stats ---
train time (minutes): mean=23.4283, std=0.1866
train time 99% confidence interval: (23.3899 - 23.4668)
avg ms per iteration: 251.4670. 99%% confidence interval: (251.0542 - 251.8799)

More Aggressive run with 5580 iterations:

I also checked 120 runs of 5580 iterations. As expected, this still hits the target, but the p-value is a bit less robust.

--- Val loss stats over all 120 runs ---
mean: 	2.919583
std:  	0.000897
val loss 99% confidence interval: (2.919368 - 2.919797)
val_loss t-test p=0.000001 (small means <2.92)

--- Bootstrap p-value analysis (1000 samples of size 40 each) ---
Mean p-value: 0.018333
Max p-value: 0.376981
Variance of p-values: 0.001668
Percentage of p-values below 0.01: 67.90%

--- Training time stats ---
train time (minutes): mean=23.4492, std=0.2011
train time 99% confidence interval: (23.4012 - 23.4973)
avg ms per iteration: 252.1423. 99%% confidence interval: (251.6256 - 252.6591)

Ablation

To make sure that the improvement over 128 is not just from the new LR tuning, I turned off the update smoothing, but kept the LR tuning. I also increase the number of number of iterations to 5600, which I guessed would more than make up for any improved time-per-step:

--- Val Loss Stats ---
mean: 	2.920357
std:  	0.000802
val loss 99% confidence interval: (2.920120 - 2.920593)
val_loss t-test p=0.999924 (small means <2.92)
--- Bootstrap p-value analysis --- (1000 samples of size 40)
Mean p-value: 0.973404
Max p-value: 1.000000
Variance of p-values: 0.003703
Percentage of p-values below 0.01: 0.00%
--- Training Time Stats ---
train time (minutes): mean=23.5413, std=0.2211
train time 99% confidence interval: (23.4761 - 23.6065)
avg ms per iteration: 252.2285. 99%% confidence interval: (251.5297 - 252.9272)

So, it does not seem to hit the target without the smoothing.

I also tried tuning the LR cooldown fraction a bit (both with and without smoothing) as suggested by @YouJiacheng in a comment on #129, but also did not find any improvement from this.

A list of all 120 validation losses:

A list of all 160 timings: