Good work

by chris-hc - opened 18 days ago

•

I built the same quant this morning in much the same way with what looks like the same tooling. Going to run some benchmarks on it after I'm sure it's configured right.
Mine seems to lose coherence after a few turns , sometimes fails to answer, but it could be the chat template or something. How is yours performing ?

sakamakismile

Owner 18 days ago

Morning! Nice to see another local NVFP4 of this one 🙂

Ours is holding up well. I just ran a 5-turn, context-dependent chat through it (vLLM --quantization modelopt, Liquid's sampling: temperature=0.2 / top_k=80 / repetition_penalty=1.05, prompts built with apply_chat_template): it stayed coherent and on-context every turn — at turn 4 it correctly flipped its recommendation when I changed the priority, and summarized both sides at turn 5. No dropped/empty answers, ~2–5 s/turn on one 16 GB Blackwell. So I don't think the quant itself is the issue.

"Loses coherence after a few turns / sometimes fails to answer" really points at chat-template / reasoning handling. Things worth checking:

It's a reasoning model (<think>…</think> then the answer). Don't let prior turns' <think> accumulate in the history. LiquidAI's template strips thinking from earlier turns when you build the prompt with apply_chat_template (or feed the full assistant content back so it can strip it) — if your client hand-concatenates or keeps the raw think blocks, context rots fast. In vLLM, serve with --reasoning-parser deepseek_r1 and keep only content (not reasoning_content) in history.
"Fails to answer" is usually max_tokens — it thinks first, so too small a budget means it never exits <think>. Give it generous max_tokens (I used 1024). We hit exactly this with a 512 cap.
Sampling: temperature=0.2, top_k=80, repetition_penalty=1.05. Defaults (temp ~1, no rep penalty) can degenerate.
Confirm --quantization modelopt and that your chat_template.jinja is byte-identical to LiquidAI's.

A couple of quant-build gotchas in case they bit you: transformers' batched experts export as experts.{i}.gate_proj/up_proj/down_proj, but vLLM's lfm2_moe expects w1/w3/w2 (mismatch = KeyError on load); and modelopt calibration on this arch can hit a RoPE NaN at masked positions — we patched the max-calibrator to drop non-finite values before reduce_amax.

What's your serving stack, sampling params, and how are you assembling the multi-turn prompt? Happy to compare configs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment