syedazeez's picture
Upload folder using huggingface_hub
526a3e3 verified
|
Raw
History Blame
4.15 kB
---
license: mit
language:
- en
base_model:
- WeiboAI/VibeThinker-3B
base_model_relation: quantized
tags:
- math
- code
- reasoning
- quantized
- gptq
- w4a16
- compressed-tensors
- vllm
pipeline_tag: text-generation
library_name: transformers
---
# VibeThinker-3B-W4A16 (4-bit GPTQ, group size 128)
A 4-bit weight-quantized version of [WeiboAI/VibeThinker-3B](https://huggingface.co/WeiboAI/VibeThinker-3B),
produced so the model fits and runs on a **6 GB consumer GPU**.
> **TL;DR** β€” The original BF16 release is ~5.8 GB; its weights alone don't fit in 6 GB of VRAM.
> Quantized to **W4A16 (INT4, group size 128)** with
> [`llmcompressor`](https://github.com/vllm-project/llm-compressor), the weights shrink to **~2.0 GB**
> and the model serves happily in vLLM on an **RTX 3050 6 GB Laptop GPU** at **~67 tok/s** β€” with the
> step-by-step math reasoning intact.
## The story
I wanted to run VibeThinker-3B, a small but strong math/STEM reasoning model, on a laptop with an
RTX 3050 (6 GB VRAM, 50 W). The problem:
| Format | Weights | Fits 6 GB? |
|---|---|---|
| BF16 (original) | ~5.8 GB | ❌ weights alone fill the card; nothing left for KV cache + overhead |
| **W4A16 (this repo)** | **~2.0 GB** | βœ… leaves ~2.5 GB for KV cache (β‰ˆ74k tokens) |
So I quantized it locally, then loaded the result back into vLLM and confirmed it produces correct,
fully-reasoned answers β€” all on the 6 GB laptop GPU.
## What was done
- **Method:** GPTQ, `W4A16` scheme (4-bit symmetric INT weights, FP16 activations), **group size 128**,
`lm_head` left unquantized.
- **Tool:** `llmcompressor` 0.12, output in `compressed-tensors` (`pack-quantized`) format.
- **Calibration:** 512 samples from `HuggingFaceH4/ultrachat_200k` @ 2048 tokens.
- **Hardware used for quantization:** RTX 3050 6 GB β€” the model was loaded on CPU and quantized
layer-by-layer (sequential GPU on-loading), keeping peak VRAM ~3.2 GB.
## Measured results (RTX 3050 6 GB, vLLM 0.22)
| Metric | Value |
|---|---|
| Weights on GPU | 1.99 GiB |
| KV cache available | 2.55 GiB (74,144 tokens) |
| Max concurrency @ 8192 ctx | 9.05Γ— |
| Throughput | ~67 tok/s (eager) |
| Kernel | Marlin W4A16 + FlashAttention 2 |
## Usage (vLLM)
```bash
vllm serve syedazeez/VibeThinker-3B-W4A16-G128 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--dtype float16
```
```python
from vllm import LLM, SamplingParams
llm = LLM(model="syedazeez/VibeThinker-3B-W4A16-G128",
max_model_len=8192, gpu_memory_utilization=0.92, dtype="float16")
out = llm.chat(
[[{"role": "user", "content": "What is the remainder when 7^100 is divided by 13?"}]],
SamplingParams(temperature=0.6, max_tokens=1500),
)
print(out[0].outputs[0].text)
```
> This is a **reasoning** model: it emits a `<think>...</think>` trace before the final answer, so give it
> a generous `max_tokens` (β‰₯1500) or the answer may be truncated.
## Notes & limitations
- Quantization is lossy; on hard benchmarks expect a small quality drop vs the BF16 original. For most
math/STEM prompts the reasoning quality is preserved in informal testing.
- Inherits the base model's scope: tuned for **verifiable** tasks (competition math, coding, STEM).
Not trained for tool-calling/agents or broad open-domain knowledge.
- KV cache fits ~74k tokens at this config, so `--max-model-len` can be pushed well above 8192 on 6 GB.
## License & attribution
This derivative is released under the **MIT License**, the same license as the original
[WeiboAI/VibeThinker-3B](https://huggingface.co/WeiboAI/VibeThinker-3B). All credit for the model itself
goes to the original authors. Only the 4-bit quantization was performed here.
```bibtex
@misc{xu2026vibethinker3bexploringfrontierverifiable,
title={VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models},
author={Sen Xu and Shixi Liu and Wei Wang and Jixin Min and Yingwei Dai and Zhibin Yin and Yirong Chen and Xin Zhou and Junlin Zhang},
year={2026},
eprint={2606.16140},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.16140}
}
```