--- license: mit language: - en base_model: - WeiboAI/VibeThinker-3B base_model_relation: quantized tags: - math - code - reasoning - quantized - gptq - w4a16 - compressed-tensors - vllm pipeline_tag: text-generation library_name: transformers --- # VibeThinker-3B-W4A16 (4-bit GPTQ, group size 128) A 4-bit weight-quantized version of [WeiboAI/VibeThinker-3B](https://huggingface.co/WeiboAI/VibeThinker-3B), produced so the model fits and runs on a **6 GB consumer GPU**. > **TL;DR** — The original BF16 release is ~5.8 GB; its weights alone don't fit in 6 GB of VRAM. > Quantized to **W4A16 (INT4, group size 128)** with > [`llmcompressor`](https://github.com/vllm-project/llm-compressor), the weights shrink to **~2.0 GB** > and the model serves happily in vLLM on an **RTX 3050 6 GB Laptop GPU** at **~67 tok/s** — with the > step-by-step math reasoning intact. ## The story I wanted to run VibeThinker-3B, a small but strong math/STEM reasoning model, on a laptop with an RTX 3050 (6 GB VRAM, 50 W). The problem: | Format | Weights | Fits 6 GB? | |---|---|---| | BF16 (original) | ~5.8 GB | ❌ weights alone fill the card; nothing left for KV cache + overhead | | **W4A16 (this repo)** | **~2.0 GB** | ✅ leaves ~2.5 GB for KV cache (≈74k tokens) | So I quantized it locally, then loaded the result back into vLLM and confirmed it produces correct, fully-reasoned answers — all on the 6 GB laptop GPU. ## What was done - **Method:** GPTQ, `W4A16` scheme (4-bit symmetric INT weights, FP16 activations), **group size 128**, `lm_head` left unquantized. - **Tool:** `llmcompressor` 0.12, output in `compressed-tensors` (`pack-quantized`) format. - **Calibration:** 512 samples from `HuggingFaceH4/ultrachat_200k` @ 2048 tokens. - **Hardware used for quantization:** RTX 3050 6 GB — the model was loaded on CPU and quantized layer-by-layer (sequential GPU on-loading), keeping peak VRAM ~3.2 GB. ## Measured results (RTX 3050 6 GB, vLLM 0.22) | Metric | Value | |---|---| | Weights on GPU | 1.99 GiB | | KV cache available | 2.55 GiB (74,144 tokens) | | Max concurrency @ 8192 ctx | 9.05× | | Throughput | ~67 tok/s (eager) | | Kernel | Marlin W4A16 + FlashAttention 2 | ## Usage (vLLM) ```bash vllm serve syedazeez/VibeThinker-3B-W4A16-G128 \ --max-model-len 8192 \ --gpu-memory-utilization 0.92 \ --dtype float16 ``` ```python from vllm import LLM, SamplingParams llm = LLM(model="syedazeez/VibeThinker-3B-W4A16-G128", max_model_len=8192, gpu_memory_utilization=0.92, dtype="float16") out = llm.chat( [[{"role": "user", "content": "What is the remainder when 7^100 is divided by 13?"}]], SamplingParams(temperature=0.6, max_tokens=1500), ) print(out[0].outputs[0].text) ``` > This is a **reasoning** model: it emits a `...` trace before the final answer, so give it > a generous `max_tokens` (≥1500) or the answer may be truncated. ## Notes & limitations - Quantization is lossy; on hard benchmarks expect a small quality drop vs the BF16 original. For most math/STEM prompts the reasoning quality is preserved in informal testing. - Inherits the base model's scope: tuned for **verifiable** tasks (competition math, coding, STEM). Not trained for tool-calling/agents or broad open-domain knowledge. - KV cache fits ~74k tokens at this config, so `--max-model-len` can be pushed well above 8192 on 6 GB. ## License & attribution This derivative is released under the **MIT License**, the same license as the original [WeiboAI/VibeThinker-3B](https://huggingface.co/WeiboAI/VibeThinker-3B). All credit for the model itself goes to the original authors. Only the 4-bit quantization was performed here. ```bibtex @misc{xu2026vibethinker3bexploringfrontierverifiable, title={VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models}, author={Sen Xu and Shixi Liu and Wei Wang and Jixin Min and Yingwei Dai and Zhibin Yin and Yirong Chen and Xin Zhou and Junlin Zhang}, year={2026}, eprint={2606.16140}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2606.16140} } ```