Fizzarolli commited on
Commit
2e743e5
·
verified ·
1 Parent(s): 6c1bfe3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -52,6 +52,8 @@ please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https:/
52
  This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization.
53
  Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.
54
 
 
 
55
  ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace)
56
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/Fc-Dvakg3lSwk2co7jHIM.png)
57
 
 
52
  This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization.
53
  Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.
54
 
55
+ We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
56
+
57
  ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace)
58
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/Fc-Dvakg3lSwk2co7jHIM.png)
59