Text Generation
Transformers
Safetensors
English
qwen2
writing
creative-writing
conversational
text-generation-inference
Instructions to use allura-org/Koto-Small-7B-PT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use allura-org/Koto-Small-7B-PT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="allura-org/Koto-Small-7B-PT") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("allura-org/Koto-Small-7B-PT") model = AutoModelForCausalLM.from_pretrained("allura-org/Koto-Small-7B-PT") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use allura-org/Koto-Small-7B-PT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "allura-org/Koto-Small-7B-PT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "allura-org/Koto-Small-7B-PT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/allura-org/Koto-Small-7B-PT
- SGLang
How to use allura-org/Koto-Small-7B-PT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "allura-org/Koto-Small-7B-PT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "allura-org/Koto-Small-7B-PT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "allura-org/Koto-Small-7B-PT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "allura-org/Koto-Small-7B-PT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use allura-org/Koto-Small-7B-PT with Docker Model Runner:
docker model run hf.co/allura-org/Koto-Small-7B-PT
| license: mit | |
| language: | |
| - en | |
| base_model: | |
| - XiaomiMiMo/MiMo-7B-Base | |
| library_name: transformers | |
| tags: | |
| - writing | |
| - creative-writing | |
| # Koto Small 7B (Pretrained) | |
|  | |
| Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data. | |
| **Please check out [Aurore-Reveil/Koto-Small-7B-IT](https://huggingface.co/Aurore-Reveil/Koto-Small-7B-IT), it's the official RP and instruct tune!** | |
| ## Usage | |
| This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work. | |
| It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context. | |
| We found that 1.25 temperature and 0.05 min_p worked best, but YMMV! | |
| ## Datasets | |
| Some of the data used to train this model includes: | |
| - Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library)) | |
| - A random sample of public domain books from Project Gutenberg | |
| - Furry (anthro and feral) storytelling and smut | |
| - A small subset of known high-quality books and story data | |
| ## Acknowledgements | |
| - thank you to [unk] for drawing the art used in the model card! | |
| - thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model | |
| - thanks to curse for testing, ideas | |
| - thanks to toasty for some data, ideas | |
| - thanks to everyone else in allura for moral support | |
| ilya <3 | |
| ## Call for Help | |
| if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)... | |
| please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3 | |
| ## Technical Appendix | |
| <details> | |
| ### Training Notes | |
| This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization. | |
| Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise. | |
| We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them. | |
| ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace) | |
|  | |
| ### Finetuning Notes | |
| This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem. | |
| ### Axolotl Config | |
| ```yaml | |
| ## model | |
| base_model: allura-forge/MiMo-7B-Base-Qwenified | |
| trust_remote_code: true | |
| ## qlora COPE!!! | |
| load_in_8bit: false | |
| load_in_4bit: false | |
| strict: false | |
| ## data | |
| datasets: | |
| datasets: | |
| - path: estrogen/bookscpt2 | |
| type: completion | |
| field: text | |
| shuffle_merged_datasets: true | |
| dataset_prepared_path: dataset_prepareds | |
| val_set_size: 0.0 | |
| output_dir: ./MiMo-Pretrain | |
| ## Liger + CCE | |
| plugins: | |
| - axolotl.integrations.liger.LigerPlugin | |
| - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin | |
| liger_rope: true | |
| liger_rms_norm: true | |
| liger_layer_norm: true | |
| liger_glu_activation: true | |
| liger_fused_linear_cross_entropy: false | |
| cut_cross_entropy: true | |
| ## CTX settings | |
| sequence_len: 32768 | |
| sample_packing: true | |
| eval_sample_packing: false | |
| pad_to_sequence_len: true | |
| ## max grad norm | |
| max_grad_norm: 1.0 | |
| ## WandB | |
| wandb_project: Koto-Small | |
| wandb_entity: | |
| wandb_watch: | |
| wandb_name: MiMo-7b_1e-5_adamw-8bit | |
| wandb_log_model: | |
| ## hoe params | |
| gradient_accumulation_steps: 4 # ??? | |
| micro_batch_size: 4 | |
| num_epochs: 1 | |
| lr_scheduler: cosine | |
| learning_rate: 1e-5 | |
| optimizer: adamw_bnb_8bit # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit" | |
| deepcompile: true | |
| train_on_inputs: false | |
| group_by_length: false | |
| bf16: auto | |
| fp16: | |
| tf32: false | |
| gradient_checkpointing: offload | |
| early_stopping_patience: | |
| resume_from_checkpoint: | |
| local_rank: | |
| logging_steps: 1 | |
| xformers_attention: | |
| flash_attention: true | |
| s2_attention: | |
| warmup_steps: 50 | |
| saves_per_epoch: 2 | |
| debug: | |
| deepspeed: ./deepspeed_configs/zero2.json | |
| weight_decay: 0.0025 | |
| fsdp: | |
| fsdp_config: | |
| ``` | |
| </details> |