--- license: mit language: - en base_model: - XiaomiMiMo/MiMo-7B-Base library_name: transformers tags: - writing - creative-writing --- # Koto Small 7B (Pretrained) ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png) Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data. ## Usage This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work. It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context. We found that 1.25 temperature and 0.05 min_p worked best, but YMMV! ## Datasets Some of the data used to train this model includes: - Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library)) - A random sample of public domain books from Project Gutenberg - Furry (anthro and feral) storytelling and smut - A small subset of known high-quality books and story data ## Acknowledgements - thank you to [unk] for drawing the art used in the model card! - thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model - thanks to curse for testing, ideas - thanks to toasty for some data, ideas - thanks to everyone else in allura for moral support ilya <3 ## Call for Help if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)... please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3 ## Technical Appendix
### Training Notes This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization. Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise. ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/Fc-Dvakg3lSwk2co7jHIM.png) ### Finetuning Notes This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem. ### Axolotl Config ```yaml ## model base_model: allura-forge/MiMo-7B-Base-Qwenified trust_remote_code: true ## qlora COPE!!! load_in_8bit: false load_in_4bit: false strict: false ## data datasets: datasets: - path: estrogen/bookscpt2 type: completion field: text shuffle_merged_datasets: true dataset_prepared_path: dataset_prepareds val_set_size: 0.0 output_dir: ./MiMo-Pretrain ## Liger + CCE plugins: - axolotl.integrations.liger.LigerPlugin - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin liger_rope: true liger_rms_norm: true liger_layer_norm: true liger_glu_activation: true liger_fused_linear_cross_entropy: false cut_cross_entropy: true ## CTX settings sequence_len: 32768 sample_packing: true eval_sample_packing: false pad_to_sequence_len: true ## max grad norm max_grad_norm: 1.0 ## WandB wandb_project: Koto-Small wandb_entity: wandb_watch: wandb_name: MiMo-7b_1e-5_adamw-8bit wandb_log_model: ## hoe params gradient_accumulation_steps: 4 # ??? micro_batch_size: 4 num_epochs: 1 lr_scheduler: cosine learning_rate: 1e-5 optimizer: adamw_bnb_8bit # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit" deepcompile: true train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: offload early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 50 saves_per_epoch: 2 debug: deepspeed: ./deepspeed_configs/zero2.json weight_decay: 0.0025 fsdp: fsdp_config: ```