really awesome speeds! running at 256k context.

#11
by mtcl - opened
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="124150054817792" timestamp=1776966243 id_slot=0 id_task=0 p0=0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="124150054817792" timestamp=1776966252 id_slot=0 id_task=0 p0=8192
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="124150054817792" timestamp=1776966262 id_slot=0 id_task=0 p0=16384
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="124150054817792" timestamp=1776966273 id_slot=0 id_task=0 p0=24576
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="124150054817792" timestamp=1776966285 id_slot=0 id_task=0 p0=32768
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="124150054817792" timestamp=1776966299 id_slot=0 id_task=0 p0=40960
slot print_timing: id  0 | task 0 | 
prompt eval time =   67835.81 ms / 45538 tokens (    1.49 ms per token,   671.30 tokens per second)
       eval time =    8202.57 ms /   139 tokens (   59.01 ms per token,    16.95 tokens per second)
      total time =   76038.38 ms / 45677 tokens

600-700 tk/sec prompt processing and approx ~17 tk/sec tk generation. On 1X6000 Pro.

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES="2" ./build/bin/llama-server \
  --model /media/mukul/data/models/ubergarm/Kimi-K2.6-GGUF/smol-IQ2_KL/Kimi-K2.6-smol-IQ2_KL-00001-of-00009.gguf \
  --chat-template-file /media/mukul/data/models/ubergarm/Kimi-K2.6-GGUF/chat-template-kimi-k-2.6.jinja \
  --alias ubergarm/Kimi-K2.6 \
  --ctx-size 262144 \
  -ctk q8_0 \
  -amb 512 \
  -mla 3 \
  -muge \
  --merge-qkv \
  -b 8192 -ub 8192 \
  -ot "blk\.([0-9]|1[0-1])\.ffn_.*=CUDA0" \
  -ot exps=CPU \
  -ngl 99 \
  --warmup-batch \
  --no-mmap \
  --jinja \
  --parallel 1 \
  --threads 56 \
  --threads-batch 56 \
  --host 0.0.0.0 \
  --port 10002

here is the chat template that I am using. chat-template-kimi-k-2.6.jinja

{%- set preserve_thinking = true %}
{%- macro render_content(msg) -%}
    {%- set c = msg.get('content') -%}
    {%- if c is string -%}
      {{ c }}
    {%- elif c is not none -%}
      {% for content in c -%}
        {% if content['type'] == 'image' or content['type'] == 'image_url' -%}
          <|media_begin|>image<|media_content|><|media_pad|><|media_end|>
        {% elif content['type'] == 'video' or content['type']== 'video_url'-%}
          <|kimi_k25_video_placeholder|>
        {% else -%}
          {{ content['text'] }}
        {%- endif -%}
      {%- endfor -%}
    {%- endif -%}
{%- endmacro -%}

{% macro set_roles(message) -%}
  {%- set role_name =  message.get('name') or  message['role'] -%}
  {%- if message['role'] == 'user' -%}
    <|im_user|>{{role_name}}<|im_middle|>
  {%- elif message['role'] == 'assistant' -%}
    <|im_assistant|>{{role_name}}<|im_middle|>
  {%- else -%}
    <|im_system|>{{role_name}}<|im_middle|>
  {%- endif -%}
{%- endmacro -%}


{%- macro render_toolcalls(message) -%}
  <|tool_calls_section_begin|>
  {%- for tool_call in message['tool_calls'] -%}
    {%- set formatted_id = tool_call['id'] -%}
    <|tool_call_begin|>{{ formatted_id }}<|tool_call_argument_begin|>{% if tool_call['function']['arguments'] is string %}{{ tool_call['function']['arguments'] }}{% else %}{{ tool_call['function']['arguments'] | tojson }}{% endif %}<|tool_call_end|>
  {%- endfor -%}
  <|tool_calls_section_end|>
{%- endmacro -%}


{%- set preserve_thinking = preserve_thinking | default(false) -%}
{# Find last non-tool-call assistant message. If preserve_thinking, keep -1 so hist is empty and all msgs use suffix (retain reasoning). #}
{%- set ns = namespace(last_non_tool_call_assistant_msg=-1) -%}
{%- if not preserve_thinking -%}
{%- for idx in range(messages|length-1, -1, -1) -%}
    {%- if messages[idx]['role'] == 'assistant' and not messages[idx].get('tool_calls') -%}
        {%- set ns.last_non_tool_call_assistant_msg = idx -%}
        {%- break -%}
    {%- endif -%}
{%- endfor -%}
{%- endif -%}

{# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
{%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg+1] -%}
{%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg+1:] -%}

{%- if tools -%}
  {%- if tools_ts_str -%}
    <|im_system|>tool_declare<|im_middle|>{{ tools_ts_str }}<|im_end|>
  {%- else -%}
    <|im_system|>tool_declare<|im_middle|>{{ tools | tojson(separators=(',', ':')) }}<|im_end|>
  {%- endif -%}
{%- endif -%}

  
{%- for message in hist_msgs -%}
  {{set_roles(message)}}
  {%- if message['role'] == 'assistant' -%}
    <think></think>{{render_content(message)}}
    {%- if message.get('tool_calls') -%}
      {{render_toolcalls(message)}}
    {%- endif -%}
  {%- elif message['role'] == 'tool' -%}
    {%- set tool_call_id = message.tool_call_id -%}
    ## Return of {{ tool_call_id }}
{{render_content(message)}}
  {%- elif message['content'] is not none -%}
    {{render_content(message)}}
  {%- endif -%}
  <|im_end|>
{%- endfor -%}

{%- for message in suffix_msgs -%}
  {{set_roles(message)}}
  {%- if message['role'] == 'assistant' -%}
    {%- if thinking is defined and thinking is false and preserve_thinking is false -%}
    <think></think>{{render_content(message)}}
    {%- else -%}
    {%- set rc = message.get('reasoning', message.get('reasoning_content', '')) -%}
    <think>{{rc}}</think>{{render_content(message)}}
    {%- endif -%}
    {%- if message.get('tool_calls') -%}
     {{render_toolcalls(message)}}
    {%- endif -%}
  {%- elif message['role'] == 'tool' -%}
    {%- set tool_call_id = message.tool_call_id -%}
    ## Return of {{ tool_call_id }}
{{render_content(message)}}
  {%- elif message['content'] is not none -%}
    {{render_content(message)}}
  {%- endif -%}
  <|im_end|>
{%- endfor -%}


{%- if add_generation_prompt -%}
  <|im_assistant|>assistant<|im_middle|>
  {%- if thinking is defined and thinking is false -%}
  <think></think>
  {%- endif -%}
{%- endif -%}

Thank you @ubergarm for all the tweaks that you mentioned. Hopefully it helps someone!

Very nice! Thanks for the results.

Zero pressure to try, but surprisingly some reports suggested -muge was slowing them down. In theory it should always help as I understand it, but might be worth a try if you're still tweaking.

Finally, given this is an MLA style model and already compression attention into latent space, and you're running kv-cache on GPU, consider leaving -ctk f16 for best long context performance. But probably not a huge difference.

Cheers!

i do have -muge in my command already there :D

@mtcl

i do have -muge in my command already there :D

Yes, I saw, just curious if you removed it if it would make your setup faster or not. Just an experiment no worries!

Oh man DSV4 is out, but think we need some more work in llama.cpp first to support the fancy attention and convert it.

This comment has been hidden (marked as Resolved)

@mtcl

i do have -muge in my command already there :D

Yes, I saw, just curious if you removed it if it would make your setup faster or not. Just an experiment no worries!

Ah I see it now. I'll try to run that experiment later today :)

Oh man DSV4 is out, but think we need some more work in llama.cpp first to support the fancy attention and convert it.

I know right!!! So pumped about the flash version! That might fit in my 2x6000 pros!

Sign up or log in to comment