Tokenizer mismatch

#1
by enriquezaf - opened

Hi,

The chat template is missing three \n.

Was the finetune done with the wrong template?

Also, why not trim whitespace?

lmolino changed discussion status to closed
lmolino changed discussion status to open
Grupo de investigación en Sistemas Inteligentes de Acceso a la Información (SINAI) de la Universidad de Jaén org

Thanks for the feedback! The model was fine-tuned using exactly this chat template, so it is internally consistent, the template reflects the actual format used during training. Using a different template at inference time (e.g. with extra \n) may lead to slightly inconsistent behavior at turn boundaries, since the fine-tuning was done with this specific format. Regarding whitespace, "add_prefix_space: true" is inherited from the LLaMA SentencePiece tokenizer and is a tokenizer-level setting that does not affect the chat template output directly. Have you actually observed leading spaces in the decoded outputs?

Thanks for answering my question,

Good to know that chat_template.jinja is correct and tokenizer_config.json is what is wrong.

"chat_template": "{{- bos_token }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",

By 'trim whitespace' I don't mean the prefix space, sorry for the unclear explanation, I'm refering to the chat template.

Regarding leading spaces in the output, yes, that's what prompted me to check the tokenizer.

example_01

Can replicate if you do a few generations with this prompt:

¿Cuál es la constitución española?

I checked the datasets and found some examples, but I'm not sure if there are enough to bias the model like that.

Grupo de investigación en Sistemas Inteligentes de Acceso a la Información (SINAI) de la Universidad de Jaén org

Hi again! Thanks for the clarification about whitespace trimming.

We have now updated the chat_template in tokenizer_config.json to match the chat_template.jinja used during fine-tuning. This should fix the leading spaces in the output.

Regarding whitespace trimming in the template, the chat_template.jinja already uses {%- and -%} tags where appropriate to avoid unwanted whitespace. Now that both templates are in sync, this should be consistent.

Thanks again for taking the time to investigate and report this so thoroughly!

scarrasc changed discussion status to closed

Sign up or log in to comment