fix chat template to avoid empty historical `<think>` blocks

#8

This fixes a chat template issue where historical assistant turns can emit empty <think>...</think> blocks even when reasoning_content is empty.

That matters because these empty historical <think> blocks change the serialized prompt without adding any useful information.

The fix is a really simple one-line change in the template:

from:

{%- if loop.index0 > ns.last_query_index %}

to:

{%- if loop.index0 > ns.last_query_index and reasoning_content %}

Why this is important:

  • it reduces unnecessary prompt drift
  • it improves prefix-cache reuse
  • it helps avoid avoidable cache misses
  • it reduces extra token processing caused by equivalent histories rendering differently

In practice, this means less wasted compute and better cache stability, especially in longer multi-turn or tool-using conversations.

The change is intentionally minimal:

  • keep the historical <think> wrapper when reasoning_content is actually present
  • do not emit an empty <think> block when there is no reasoning content

Without this guard, the template can produce prior turns like:

assistant
<think>

</think>

<tool_call>...

instead of rendering just the assistant content or tool call directly.

So this change preserves real reasoning content while avoiding empty reasoning scaffolding that can hurt caching behavior.

Edit: made a video explaining the bug
https://www.youtube.com/watch?v=3g70-ToSgr0

I think the template should be the same as the schema used in training .

I think the template should be the same as the schema used in training .

@siberiamark this change isn’t altering the live generation format the model relies on. It only avoids re-injecting empty historical <think></think> wrappers on later turns when there is no reasoning content there.

In practice, this was causing prompt drift and unnecessary cache invalidation across follow-up requests, while the model was already completing the original turns correctly.

more context here as well:
https://www.reddit.com/r/LocalLLaMA/

Edit: Made a video explaining the bug

latent-variable changed pull request title from fix chat template to avoid empty historical `<think>` blocks to fix historical assistant turn rendering in chat_template.jinja

small update after more testing: i tried the stricter version that removes historical <think> blocks entirely, but i think that one is too aggressive.

it seems better for cache reuse, but it may affect reasoning behavior / separation in some cases.

so i’m reverting these prs back to the safer minimal fix:

{%- if loop.index0 > ns.last_query_index and reasoning_content %}

that still fixes the empty historical wrapper issue without changing historical turns as aggressively.

latent-variable changed pull request title from fix historical assistant turn rendering in chat_template.jinja to fix chat template to avoid empty historical `<think>` blocks
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment