Model Card for Model ID

This is a mixed BF16-INT8 AWQ layer quantization, with working MTP (speculative decoding) via llmcompressor.

Model Details

The "UC" in the name refers to "HuggingFaceH4/ultrachat_200k" dataset used for this quant.

Fixed chat_template with "froggeric/Qwen-Fixed-Chat-Templates"

Working MTP with VLLM flag:

--speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Tested with VLLM 0.19.1 and transformers 5.6.2

Recommended flags:

--enable-auto-tool-choice
--reasoning-parser qwen3
--tool-call-parser qwen3_xml

Rank	Model / Dataset	HumanEval (Code) ↑	Winogrande (Logic) ↑	HellaSwag (Context) ↑	WikiText (PPL) ↓	Verdict
1st	AWQ-UC (Ultrachat)	0.6524	0.7459	0.7842	9.5979	The Context King
2nd	AWQ-CK (CyanKiwi)	0.6524	0.7474	0.7839	9.5986	Best for Pure Logic
3rd	AWQ-NM (NeuralMagic)	0.6585	0.7443	0.7833	9.5974	Best for Code & PPL
4th	Base (BF16)	0.6524	0.7498	0.7843	9.5951	Reference (Slow)

Maximum Fidelity (UC): The UltraChat 200k dataset achieved near-perfect context retention, staying within 0.0001 of the Base model's HellaSwag score. This makes it the superior choice for high-precision document processing.
Coding Optimization (NM): Interestingly, NeuralMagic was the only dataset that actually improved the HumanEval score over the Base model (+0.0061), suggesting a highly effective alignment for algorithmic tasks.
Logical Stability (CK): The CyanKiwi dataset preserved the highest level of Winogrande accuracy among all quantized versions, essential for reasoning-heavy workflows.
Efficiency: Quantization provided a consistent 30% reduction in execution time (from 00:40 to 00:28 in HumanEval) with negligible impact on language fluency (Perplexity).

Safetensors

Model size

10B params

Tensor type

I64

I32

BF16

Base model

Finetuned

Finetuned

Finetuned

Quantized

(9)

this model