---
license: mit
language:
- en
base_model:
- XiaomiMiMo/MiMo-7B-Base
library_name: transformers
tags:
- writing
- creative-writing
---

# Koto Small 7B (Pretrained)

![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)

Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data.

## Usage

This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work.

It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.

We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

## Datasets

Some of the data used to train this model includes:
- Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library))
- A random sample of public domain books from Project Gutenberg
- Furry (anthro and feral) storytelling and smut
- A small subset of known high-quality books and story data

## Acknowledgements
- thank you to [unk] for drawing the art used in the model card!
- thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model
- thanks to curse for testing, ideas
- thanks to toasty for some data, ideas
- thanks to everyone else in allura for moral support

ilya <3

## Call for Help
if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)...  
please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3

## Technical Appendix
<details>

### Training Notes
This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization.
Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.

### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/Fc-Dvakg3lSwk2co7jHIM.png)

### Finetuning Notes
This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem.

### Axolotl Config
```yaml
## model
base_model: allura-forge/MiMo-7B-Base-Qwenified
trust_remote_code: true
## qlora COPE!!!
load_in_8bit: false
load_in_4bit: false
strict: false

## data 
datasets:
datasets:
  - path: estrogen/bookscpt2
    type: completion
    field: text

shuffle_merged_datasets: true
dataset_prepared_path: dataset_prepareds
val_set_size: 0.0
output_dir: ./MiMo-Pretrain

## Liger + CCE
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true

## CTX settings
sequence_len: 32768
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

## max grad norm
max_grad_norm: 1.0

## WandB
wandb_project: Koto-Small
wandb_entity:
wandb_watch:
wandb_name: MiMo-7b_1e-5_adamw-8bit
wandb_log_model:

## hoe params
gradient_accumulation_steps: 4 # ???
micro_batch_size: 4
num_epochs: 1
lr_scheduler: cosine
learning_rate: 1e-5
optimizer: adamw_bnb_8bit  # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
deepcompile: true
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: offload
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:

warmup_steps: 50
saves_per_epoch: 2
debug:
deepspeed: ./deepspeed_configs/zero2.json
weight_decay: 0.0025
fsdp:
fsdp_config:
```

</details>