Can thinking be toggled per response in a multi-turn conversation?

#2
by lostmsu - opened

The model card says the presence of specific token in system prompt triggers thinking.

But that implies that I can't disable thinking temporarily. Sometimes I want instant response.

Hey, you can supply {"enable_thinking": false} to the chat-template-kwargs to achieve this. For example, with llama.cpp's llama-server:

llama-server [...] --chat-template-kwargs '{"enable_thinking": false}'

There's a YouTube video by Fahd Mirza saying that there's a problem in this model's chat template affecting agentic use and the fix is to run the model with a fixed chat template. Fahd doesn't clearly explain exactly how to do this but the time offset in the video discussing the fix is here Gemma 4 Was Broken for Agents - Google Just Fixed It. Do I have to extract the chat template somehow or could I just override the preserve_thinking setting somehow in a similar way to what jaeb does above?

Sign up or log in to comment