Helcyon-Grok 12b v4 Q4_K_M unending output?

#2
by LucreKresnik - opened

Heya!

For whatever reason, no matter which frontend I use, the listed model provides endless output.

If I give it any sort of message, it provides an unending response, including the chatML markdown for itself, for the user, and a response for the user.

Any ideas what's happening?

XeyonAI org
โ€ข
edited 19 days ago

Hey! So I've looked into this and found the issue... it's a mismatch in the GGUF metadata between the chat template and the tokenizer.

The chat template is set to ChatML (which uses <|im_end|> as the stop token), but the tokenizer's actual EOS token is (the default Mistral one), and <|im_end|> doesn't exist in the vocabulary as a special token. So the model has no way to know when to stop generating... it just keeps going.

For now, the quickest fix on your end is to add <|im_end|> as a stop string / stopping sequence in your frontend's generation settings. That should sort it out immediately regardless of which frontend you're using.
I'll be fixing this properly in the GGUF itself so it stops cleanly out of the box. Appreciate you flagging it!

XeyonAI org

Just to follow up. I've done a deep dive into this and it's a known limitation with Mistral Nemo's Tekken tokenizer. <|im_end|> can't be registered as a real special token without breaking the model, so the GGUF's EOS metadata stays as which the model never actually emits under ChatML.

So yes the fix is what I mentioned before... add <|im_end|> as a stop string in your frontend. That's how all Mistral-based ChatML finetunes handle it.

Good to hear! I managed to find that in the frontend I was using, so it's not an issue anymore, thankfully.

Thanks for the rapid support!

Sign up or log in to comment