---
license: mit
language:
- en
base_model:
- WeiboAI/VibeThinker-3B
base_model_relation: quantized
tags:
- math
- code
- reasoning
- quantized
- gptq
- w4a16
- compressed-tensors
- vllm
pipeline_tag: text-generation
library_name: transformers
---

# VibeThinker-3B-W4A16 (4-bit GPTQ, group size 128)

A 4-bit weight-quantized version of [WeiboAI/VibeThinker-3B](https://huggingface.co/WeiboAI/VibeThinker-3B),
produced so the model fits and runs on a **6 GB consumer GPU**.

> **TL;DR** — The original BF16 release is ~5.8 GB; its weights alone don't fit in 6 GB of VRAM.
> Quantized to **W4A16 (INT4, group size 128)** with
> [`llmcompressor`](https://github.com/vllm-project/llm-compressor), the weights shrink to **~2.0 GB**
> and the model serves happily in vLLM on an **RTX 3050 6 GB Laptop GPU** at **~67 tok/s** — with the
> step-by-step math reasoning intact.

## The story

I wanted to run VibeThinker-3B, a small but strong math/STEM reasoning model, on a laptop with an
RTX 3050 (6 GB VRAM, 50 W). The problem:

| Format | Weights | Fits 6 GB? |
|---|---|---|
| BF16 (original) | ~5.8 GB | ❌ weights alone fill the card; nothing left for KV cache + overhead |
| **W4A16 (this repo)** | **~2.0 GB** | ✅ leaves ~2.5 GB for KV cache (≈74k tokens) |

So I quantized it locally, then loaded the result back into vLLM and confirmed it produces correct,
fully-reasoned answers — all on the 6 GB laptop GPU.

## What was done

- **Method:** GPTQ, `W4A16` scheme (4-bit symmetric INT weights, FP16 activations), **group size 128**,
  `lm_head` left unquantized.
- **Tool:** `llmcompressor` 0.12, output in `compressed-tensors` (`pack-quantized`) format.
- **Calibration:** 512 samples from `HuggingFaceH4/ultrachat_200k` @ 2048 tokens.
- **Hardware used for quantization:** RTX 3050 6 GB — the model was loaded on CPU and quantized
  layer-by-layer (sequential GPU on-loading), keeping peak VRAM ~3.2 GB.

## Measured results (RTX 3050 6 GB, vLLM 0.22)

| Metric | Value |
|---|---|
| Weights on GPU | 1.99 GiB |
| KV cache available | 2.55 GiB (74,144 tokens) |
| Max concurrency @ 8192 ctx | 9.05× |
| Throughput | ~67 tok/s (eager) |
| Kernel | Marlin W4A16 + FlashAttention 2 |

## Usage (vLLM)

```bash
vllm serve syedazeez/VibeThinker-3B-W4A16-G128 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype float16
```

```python
from vllm import LLM, SamplingParams

llm = LLM(model="syedazeez/VibeThinker-3B-W4A16-G128",
          max_model_len=8192, gpu_memory_utilization=0.92, dtype="float16")
out = llm.chat(
    [[{"role": "user", "content": "What is the remainder when 7^100 is divided by 13?"}]],
    SamplingParams(temperature=0.6, max_tokens=1500),
)
print(out[0].outputs[0].text)
```

> This is a **reasoning** model: it emits a `<think>...</think>` trace before the final answer, so give it
> a generous `max_tokens` (≥1500) or the answer may be truncated.

## Notes & limitations

- Quantization is lossy; on hard benchmarks expect a small quality drop vs the BF16 original. For most
  math/STEM prompts the reasoning quality is preserved in informal testing.
- Inherits the base model's scope: tuned for **verifiable** tasks (competition math, coding, STEM).
  Not trained for tool-calling/agents or broad open-domain knowledge.
- KV cache fits ~74k tokens at this config, so `--max-model-len` can be pushed well above 8192 on 6 GB.

## License & attribution

This derivative is released under the **MIT License**, the same license as the original
[WeiboAI/VibeThinker-3B](https://huggingface.co/WeiboAI/VibeThinker-3B). All credit for the model itself
goes to the original authors. Only the 4-bit quantization was performed here.

```bibtex
@misc{xu2026vibethinker3bexploringfrontierverifiable,
      title={VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models},
      author={Sen Xu and Shixi Liu and Wei Wang and Jixin Min and Yingwei Dai and Zhibin Yin and Yirong Chen and Xin Zhou and Junlin Zhang},
      year={2026},
      eprint={2606.16140},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.16140}
}
```