Broken chat_template tool calling and thinking

#9
by livepeer-ren - opened

I was having a lot of problems with silent ends of my coding agents. It seems to do with chat_template as per https://www.reddit.com/r/Qwen_AI/comments/1stt081/fixed_jinja_chat_templates_for_qwen_35_and_36/
https://github.com/abysslover/qwen36_tool_calling_failure/tree/main
So I think it is worth to include this fixes into the repo.

This comment has been hidden (marked as Low Quality)

Thanks for the report and for pointing to the Qwen3.6 chat template/tool-calling issue.

I reviewed the shipped chat_template.jinja and confirmed that these Qwen3.6-27B-family NVFP4 repos were still using the upstream Qwen3.6 template, so they could inherit the reported coding-agent/tool-calling and thinking-toggle problems.

I have prepared and rolled out a conservative patched chat_template.jinja for the affected Qwen3.6-27B-family releases. The patch keeps Qwen's existing XML-like tool-call format and preserves the original vision/text rendering, while adding fixes for:

  • leading developer messages by folding them into the system block
  • <|think_on|> / <|think_off|> handling without exposing those control tokens to the model
  • enable_thinking=false behavior
  • historical assistant.tool_calls serialization for both mapping and JSON-string arguments
  • including the tool/function name in tool responses when available

I also tested the template rendering locally for normal chat, VLM/image content, tool calls with mapping arguments, tool calls with JSON-string arguments, and thinking-off cases before applying it.

Thanks again for catching this.

— Tonoken3 / LNA-LAB

Thanks for the report and for pointing to the Qwen3.6 chat template/tool-calling issue.

I reviewed the shipped chat_template.jinja and confirmed that these Qwen3.6-27B-family NVFP4 repos were still using the upstream Qwen3.6 template, so they could inherit the reported coding-agent/tool-calling and thinking-toggle problems.

I have prepared and rolled out a conservative patched chat_template.jinja for the affected Qwen3.6-27B-family releases. The patch keeps Qwen's existing XML-like tool-call format and preserves the original vision/text rendering, while adding fixes for:

  • leading developer messages by folding them into the system block
  • <|think_on|> / <|think_off|> handling without exposing those control tokens to the model
  • enable_thinking=false behavior
  • historical assistant.tool_calls serialization for both mapping and JSON-string arguments
  • including the tool/function name in tool responses when available

I also tested the template rendering locally for normal chat, VLM/image content, tool calls with mapping arguments, tool calls with JSON-string arguments, and thinking-off cases before applying it.

Thanks again for catching this.

— Tonoken3 / LNA-LAB

Thanks for help!! Does it also include fix for this? https://huggingface.co/Qwen/Qwen3.6-27B/discussions/16

Quick follow-up: I also checked the Qwen3.6-35B-A3B-family NVFP4 releases.

The official Qwen/Qwen3.6-35B-A3B chat_template.jinja is byte-identical to the Qwen3.6-27B template, so the same tool-calling / thinking-toggle issues can be inherited there as well. I have now rolled out the same conservative patched template to the affected 35B-A3B-family repos too.

This keeps the upstream vision/text rendering and Qwen XML-like tool-call format, while adding the same fixes for leading developer messages, <|think_on|> / <|think_off|>, enable_thinking=false, historical tool-call argument serialization, and function names in tool responses.

— Tonoken3 / LNA-LAB

does it also apply this? https://huggingface.co/Qwen/Qwen3.6-27B/discussions/20 It is known bug that qwen halucinate block into from time to time and this should be addressed as well.

This comment has been hidden (marked as Low Quality)
This comment has been hidden (marked as Low Quality)
livepeer-ren changed discussion status to closed

Quick follow-up: I also checked the Qwen3.6-35B-A3B-family NVFP4 releases.

The official Qwen/Qwen3.6-35B-A3B chat_template.jinja is byte-identical to the Qwen3.6-27B template, so the same tool-calling / thinking-toggle issues can be inherited there as well. I have now rolled out the same conservative patched template to the affected 35B-A3B-family repos too.

This keeps the upstream vision/text rendering and Qwen XML-like tool-call format, while adding the same fixes for leading developer messages, <|think_on|> / <|think_off|>, enable_thinking=false, historical tool-call argument serialization, and function names in tool responses.

— Tonoken3 / LNA-LAB

i am testing your fixed chat template and so far i see MUCH more robust chats, no silent drops yet!! Thanks

image

livepeer-ren changed discussion status to open

I had issues with lack of prompt caching... 0%. I changed over to this chat template from https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4 and all is good now.
Edit: idk now it is fine prompt caching is at 80% so probably skill issue on my end.
Edit; redhat template is stopping silently as wel..
Now using this fixed template https://github.com/allanchan339/vLLM-Qwen3.5-27B/blob/main/qwen3.5-enhanced.jinja
and --tool-call-parser qwen3_xml . Going for 2 hrs without silent stop on opencode.

Hi @livepeer-ren ,

Thank you for the screenshot and for the Reddit reference. I read through the follow-up post as well.

The distinction around preserve_thinking is especially useful. It sounds like there are two related but different serving paths that should not be mixed accidentally:

  1. The current Qwen3.6 / vLLM nightly path using reasoning_parser=qwen3, tool_call_parser=qwen3_coder, and reasoning separation.

  2. The qwen3.5-enhanced.jinja path, where preserve_thinking=false is mandatory, and qwen3_coder is preferred for Qwen3.6 because it can catch tool calls even when the model leaves a thinking block unclosed.

That is a very helpful clarification. I’ll avoid documenting this as a single universal recipe until I compare both paths more carefully.

The driver / NCCL notes from the Reddit post are also important, especially the warning that NVIDIA Studio Driver 595.79 deadlocks can look like tool-calling failures. I’ll keep that in mind when testing and documenting the recommended setup.

For now, I’ll treat the latest vLLM nightly Docker image plus qwen3_coder as the most promising direction, and I’ll test whether the model repos should keep the conservative template, adopt the enhanced template, or document both variants depending on preserve_thinking.

Thanks again. This is exactly the kind of real agentic workflow report that helps make the model card recommendations practical.

Best,
Tonoken3 / LNA-LAB

I just run another test. With qwen_xml and this worked 180k context agentic task in opencode without any hiccups. chat_template.jinja from https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/discussions/2 with this fix to also work with openclaw. Works well finally! latest image: vllm/vllm-openai:nightly is mandatory! also with --preserve thinking true

Quick follow-up: I also checked the Qwen3.6-35B-A3B-family NVFP4 releases.

The official Qwen/Qwen3.6-35B-A3B chat_template.jinja is byte-identical to the Qwen3.6-27B template, so the same tool-calling / thinking-toggle issues can be inherited there as well. I have now rolled out the same conservative patched template to the affected 35B-A3B-family repos too.

This keeps the upstream vision/text rendering and Qwen XML-like tool-call format, while adding the same fixes for leading developer messages, <|think_on|> / <|think_off|>, enable_thinking=false, historical tool-call argument serialization, and function names in tool responses.

— Tonoken3 / LNA-LAB

I can confirm that the base model also has this issue. It's kinda rare - I see it maybe 2 to 6 times per 1000 requests.

Sign up or log in to comment