--- base_model: - moonshotai/Kimi-K2.6 tags: - kimi - fp8 - vllm - compressed-tensors name: RedHatAI/Kimi-K2.6-FP8-BLOCK --- # FP8 Block-Quantized RedHatAI/Kimi-K2.6-FP8-BLOCK This is a preliminary version (and subject to change) of FP8 block-quantized [moonshotai/Kimi-K2.6](https://huggingface.co/Kimi/Kimi-K2.6) model, compatible with [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) fp8 kernels (must be installed separately). The model has both weights and activations quantized to FP8 format with [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor). It is compatible and tested against vllm v0.20.0. Deploy it via `vllm serve` using the recipes at https://recipes.vllm.ai/moonshotai/Kimi-K2.6. # Creation Script: Kimi K2.6 support will land in https://github.com/vllm-project/llm-compressor/pull/2662. The script to create the checkpoint can be seen below:
```python from compressed_tensors.entrypoints.convert import CompressedTensorsDequantizer from llmcompressor import model_free_ptq # moonshotai/Kimi-K2.6 checkpoint is published in compressed-tensors format. # This script will upconvert to bfloat16 so that the model can be compressed # to FP8_BLOCK MODEL_ID = "moonshotai/Kimi-K2.6" SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-BLOCK" ignore = [ "re:.*mlp.gate$", "re:.*lm_head", "re:.*kv_a_proj_with_mqa$", "re:.*q_a_proj$", "re:.*vision_tower.*", "re:.*embed_tokens$", "re:.*norm$", # ignore anything not in language_model "re:.*mm_projector.*", "re:.*vision.*", ] model_free_ptq( model_stub=MODEL_ID, save_directory=SAVE_DIR, scheme="FP8_BLOCK", ignore=ignore, converter=CompressedTensorsDequantizer( MODEL_ID, quant_config_key="text_config.quantization_config", ignore=ignore, ), max_workers=2, device="cuda:0", ) ```
# Preliminary Evaluations 1) GSM8K Platinum: ``` lm_eval --model local-chat-completions \ --tasks gsm8k_platinum_cot_llama \ --model_args "model=RedHatAI/Kimi-K2.6-FP8-BLOCK,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \ --num_fewshot 0 \ --apply_chat_template \ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678" ``` Recovery: | | moonshotai/Kimi-K2.6
(original in W4A16) | RedHatAI/Kimi-K2.6-FP8-BLOCK
(this model) | | -------- | :--------------------: | :------------------------------------: | | Accuracy
| 94.29 | 93.55 | | Recovery | \- | 99.2% | **Note**: More rigorous evaluations are currently in progress and will be available soon.