XXXXyu's picture
Update README.md
9758ca8 verified
---
license: mit
base_model: 1bitLLM/bitnet_b1_58-3B
tags:
- text-generation
- ternary
- quantized
- edge-ai
- on-device
language:
- en
library_name: vlut.cpp
pipeline_tag: text-generation
---
# bitnet_b1_58-3B-vlut-gguf
This repository contains **state-of-the-art ternary-packed versions** of [bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) in GGUF format, optimized for efficient on-device inference using the [Vec-LUT](https://arxiv.org/abs/2512.06443) method.
### Key Features
- **🎯 SOTA Compression**: Achieves BPW (bits per weight) as low as **1.60** through **lossless** sub-2-bit ternary packing.
- **⚡ SOTA Performance**: Delivers superior throughput (**4.2x speedup**) in **parallel inference** scenarios via vector lookup table (LUT).
- **🔌 Drop-in Ready**: Seamless integration with [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) for immediate deployment on edge devices.
## Available Model Variants
Models are named as `ggml-model-{PACKING}_{TILE}.gguf`:
| File Name | Packing (BPW) | Tile Size | Comment |
|---------|---------|--------|------|
| `ggml-model-I1_V.gguf` | `I1_V` (1.60) | 1 | |
| `ggml-model-I1_V_2.gguf` | `I1_V` (1.60) | 2 | Recommended |
| `ggml-model-I2_V.gguf` | `I2_V` (2.00) | 1 | |
| `ggml-model-I2_V_4.gguf` | `I2_V` (2.00) | 4 | Recommended |
| `ggml-model-I2_V_8.gguf` | `I2_V` (2.00) | 8 | |
### Selection Guide
- **BPW vs. Speed**: `I1_V` achieves lower memory usage but may not always outperform `I2_V` in speed.
- **Tiling Trade-off**: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity.
- **Starting Point**: Use `I1_V_2` or `I2_V_4` as a starting point.
For detailed tiling parameter analysis, see [Evaluation.md](https://github.com/Cipherxzc/vlut.cpp/blob/master/evaluation/Evaluation.md#tiling-parameters) and the paper.
## Usage
### Prerequisites
Install [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) (these models require vlut.cpp, **not** vanilla llama.cpp):
```bash
git clone https://github.com/Cipherxzc/vlut.cpp.git
cd vlut.cpp
cmake -B build && cmake --build build --config Release -j4
```
### Download & Run
```bash
# Download the recommended variant, e.g., I2_V_4
hf download <repo_id> \
ggml-model-I2_V_4.gguf --local-dir ./models
# Run parallel inference
./build/bin/llama-batched \
-m ./models/ggml-model-I2_V_4.gguf \
-p "I believe the meaning of life is" \
-np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5
# Benchmark performance
./build/bin/llama-bench \
-m ./models/ggml-model-I2_V_4.gguf \
-t 1 -p 128 -n 0
```
For comprehensive usage instructions, refer to the [vlut.cpp Quick Start Guide](https://github.com/Cipherxzc/vlut.cpp/blob/master/README.md#quick-start).
## Citation
If you use these models, please cite our [paper](https://arxiv.org/abs/2512.06443):
```bibtex
@article{li2025veclut,
title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
journal={arXiv preprint arXiv:2512.06443},
year={2025},
url={https://arxiv.org/abs/2512.06443}
}
```