Good work
I built the same quant this morning in much the same way with what looks like the same tooling. Going to run some benchmarks on it after I'm sure it's configured right.
Mine seems to lose coherence after a few turns , sometimes fails to answer, but it could be the chat template or something. How is yours performing ?
Morning! Nice to see another local NVFP4 of this one π
Ours is holding up well. I just ran a 5-turn, context-dependent chat through it (vLLM --quantization modelopt, Liquid's sampling: temperature=0.2 / top_k=80 / repetition_penalty=1.05, prompts built with apply_chat_template): it stayed coherent and on-context every turn β at turn 4 it correctly flipped its recommendation when I changed the priority, and summarized both sides at turn 5. No dropped/empty answers, ~2β5 s/turn on one 16 GB Blackwell. So I don't think the quant itself is the issue.
"Loses coherence after a few turns / sometimes fails to answer" really points at chat-template / reasoning handling. Things worth checking:
- It's a reasoning model (
<think>β¦</think>then the answer). Don't let prior turns'<think>accumulate in the history. LiquidAI's template strips thinking from earlier turns when you build the prompt withapply_chat_template(or feed the full assistant content back so it can strip it) β if your client hand-concatenates or keeps the raw think blocks, context rots fast. In vLLM, serve with--reasoning-parser deepseek_r1and keep onlycontent(notreasoning_content) in history. - "Fails to answer" is usually max_tokens β it thinks first, so too small a budget means it never exits
<think>. Give it generousmax_tokens(I used 1024). We hit exactly this with a 512 cap. - Sampling:
temperature=0.2, top_k=80, repetition_penalty=1.05. Defaults (temp ~1, no rep penalty) can degenerate. - Confirm
--quantization modeloptand that yourchat_template.jinjais byte-identical to LiquidAI's.
A couple of quant-build gotchas in case they bit you: transformers' batched experts export as experts.{i}.gate_proj/up_proj/down_proj, but vLLM's lfm2_moe expects w1/w3/w2 (mismatch = KeyError on load); and modelopt calibration on this arch can hit a RoPE NaN at masked positions β we patched the max-calibrator to drop non-finite values before reduce_amax.
What's your serving stack, sampling params, and how are you assembling the multi-turn prompt? Happy to compare configs.