XXXXyu commited on
Commit
9758ca8
·
verified ·
1 Parent(s): 0e57268

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: 1bitLLM/bitnet_b1_58-3B
4
+ tags:
5
+ - text-generation
6
+ - ternary
7
+ - quantized
8
+ - edge-ai
9
+ - on-device
10
+ language:
11
+ - en
12
+ library_name: vlut.cpp
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # bitnet_b1_58-3B-vlut-gguf
17
+
18
+ This repository contains **state-of-the-art ternary-packed versions** of [bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) in GGUF format, optimized for efficient on-device inference using the [Vec-LUT](https://arxiv.org/abs/2512.06443) method.
19
+
20
+ ### Key Features
21
+
22
+ - **🎯 SOTA Compression**: Achieves BPW (bits per weight) as low as **1.60** through **lossless** sub-2-bit ternary packing.
23
+ - **⚡ SOTA Performance**: Delivers superior throughput (**4.2x speedup**) in **parallel inference** scenarios via vector lookup table (LUT).
24
+ - **🔌 Drop-in Ready**: Seamless integration with [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) for immediate deployment on edge devices.
25
+
26
+ ## Available Model Variants
27
+
28
+ Models are named as `ggml-model-{PACKING}_{TILE}.gguf`:
29
+
30
+ | File Name | Packing (BPW) | Tile Size | Comment |
31
+ |---------|---------|--------|------|
32
+ | `ggml-model-I1_V.gguf` | `I1_V` (1.60) | 1 | |
33
+ | `ggml-model-I1_V_2.gguf` | `I1_V` (1.60) | 2 | Recommended |
34
+ | `ggml-model-I2_V.gguf` | `I2_V` (2.00) | 1 | |
35
+ | `ggml-model-I2_V_4.gguf` | `I2_V` (2.00) | 4 | Recommended |
36
+ | `ggml-model-I2_V_8.gguf` | `I2_V` (2.00) | 8 | |
37
+
38
+ ### Selection Guide
39
+
40
+ - **BPW vs. Speed**: `I1_V` achieves lower memory usage but may not always outperform `I2_V` in speed.
41
+ - **Tiling Trade-off**: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity.
42
+ - **Starting Point**: Use `I1_V_2` or `I2_V_4` as a starting point.
43
+
44
+ For detailed tiling parameter analysis, see [Evaluation.md](https://github.com/Cipherxzc/vlut.cpp/blob/master/evaluation/Evaluation.md#tiling-parameters) and the paper.
45
+
46
+ ## Usage
47
+
48
+ ### Prerequisites
49
+
50
+ Install [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) (these models require vlut.cpp, **not** vanilla llama.cpp):
51
+
52
+ ```bash
53
+ git clone https://github.com/Cipherxzc/vlut.cpp.git
54
+ cd vlut.cpp
55
+ cmake -B build && cmake --build build --config Release -j4
56
+ ```
57
+
58
+ ### Download & Run
59
+
60
+ ```bash
61
+ # Download the recommended variant, e.g., I2_V_4
62
+ hf download <repo_id> \
63
+ ggml-model-I2_V_4.gguf --local-dir ./models
64
+
65
+ # Run parallel inference
66
+ ./build/bin/llama-batched \
67
+ -m ./models/ggml-model-I2_V_4.gguf \
68
+ -p "I believe the meaning of life is" \
69
+ -np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5
70
+
71
+ # Benchmark performance
72
+ ./build/bin/llama-bench \
73
+ -m ./models/ggml-model-I2_V_4.gguf \
74
+ -t 1 -p 128 -n 0
75
+ ```
76
+
77
+ For comprehensive usage instructions, refer to the [vlut.cpp Quick Start Guide](https://github.com/Cipherxzc/vlut.cpp/blob/master/README.md#quick-start).
78
+
79
+ ## Citation
80
+
81
+ If you use these models, please cite our [paper](https://arxiv.org/abs/2512.06443):
82
+
83
+ ```bibtex
84
+ @article{li2025veclut,
85
+ title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
86
+ author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
87
+ journal={arXiv preprint arXiv:2512.06443},
88
+ year={2025},
89
+ url={https://arxiv.org/abs/2512.06443}
90
+ }
91
+ ```