Instructions to use allura-org/Koto-Small-7B-PT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use allura-org/Koto-Small-7B-PT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="allura-org/Koto-Small-7B-PT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("allura-org/Koto-Small-7B-PT")
model = AutoModelForCausalLM.from_pretrained("allura-org/Koto-Small-7B-PT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use allura-org/Koto-Small-7B-PT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "allura-org/Koto-Small-7B-PT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allura-org/Koto-Small-7B-PT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/allura-org/Koto-Small-7B-PT

SGLang

How to use allura-org/Koto-Small-7B-PT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "allura-org/Koto-Small-7B-PT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allura-org/Koto-Small-7B-PT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "allura-org/Koto-Small-7B-PT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allura-org/Koto-Small-7B-PT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use allura-org/Koto-Small-7B-PT with Docker Model Runner:
```
docker model run hf.co/allura-org/Koto-Small-7B-PT
```

Koto-Small-7B-PT / README.md

Fizzarolli

Update README.md

a43bf30 verified 9 months ago

preview code

raw

history blame contribute delete

4.77 kB

	---
	license: mit
	language:
	- en
	base_model:
	- XiaomiMiMo/MiMo-7B-Base
	library_name: transformers
	tags:
	- writing
	- creative-writing
	---

	# Koto Small 7B (Pretrained)

	![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)

	Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data.

	Please check out [Aurore-Reveil/Koto-Small-7B-IT](https://huggingface.co/Aurore-Reveil/Koto-Small-7B-IT), it's the official RP and instruct tune!

	## Usage

	This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will not work. Multi-turn roleplay will not work.

	It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.

	We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

	## Datasets

	Some of the data used to train this model includes:
	- Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library))
	- A random sample of public domain books from Project Gutenberg
	- Furry (anthro and feral) storytelling and smut
	- A small subset of known high-quality books and story data

	## Acknowledgements
	- thank you to [unk] for drawing the art used in the model card!
	- thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model
	- thanks to curse for testing, ideas
	- thanks to toasty for some data, ideas
	- thanks to everyone else in allura for moral support

	ilya <3

	## Call for Help
	if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)...
	please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3

	## Technical Appendix
	<details>

	### Training Notes
	This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization.
	Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.

	We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.

	### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace)
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/Fc-Dvakg3lSwk2co7jHIM.png)

	### Finetuning Notes
	This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem.

	### Axolotl Config
	```yaml
	## model
	base_model: allura-forge/MiMo-7B-Base-Qwenified
	trust_remote_code: true
	## qlora COPE!!!
	load_in_8bit: false
	load_in_4bit: false
	strict: false

	## data
	datasets:
	datasets:
	- path: estrogen/bookscpt2
	type: completion
	field: text

	shuffle_merged_datasets: true
	dataset_prepared_path: dataset_prepareds
	val_set_size: 0.0
	output_dir: ./MiMo-Pretrain

	## Liger + CCE
	plugins:
	- axolotl.integrations.liger.LigerPlugin
	- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
	liger_rope: true
	liger_rms_norm: true
	liger_layer_norm: true
	liger_glu_activation: true
	liger_fused_linear_cross_entropy: false
	cut_cross_entropy: true

	## CTX settings
	sequence_len: 32768
	sample_packing: true
	eval_sample_packing: false
	pad_to_sequence_len: true

	## max grad norm
	max_grad_norm: 1.0

	## WandB
	wandb_project: Koto-Small
	wandb_entity:
	wandb_watch:
	wandb_name: MiMo-7b_1e-5_adamw-8bit
	wandb_log_model:

	## hoe params
	gradient_accumulation_steps: 4 # ???
	micro_batch_size: 4
	num_epochs: 1
	lr_scheduler: cosine
	learning_rate: 1e-5
	optimizer: adamw_bnb_8bit # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
	deepcompile: true
	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: false

	gradient_checkpointing: offload
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true
	s2_attention:

	warmup_steps: 50
	saves_per_epoch: 2
	debug:
	deepspeed: ./deepspeed_configs/zero2.json
	weight_decay: 0.0025
	fsdp:
	fsdp_config:
	```

	</details>