XXXXyu commited on
Commit
1cf5c2d
·
verified ·
1 Parent(s): bae6c38

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -5
README.md CHANGED
@@ -1,5 +1,105 @@
1
- ---
2
- license: other
3
- license_name: falcon-llm-license
4
- license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: falcon-llm-license
4
+ license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html
5
+ base_model: tiiuae/Falcon3-1B-Instruct-1.58bit
6
+ tags:
7
+ - text-generation
8
+ - conversational
9
+ - ternary
10
+ - quantized
11
+ - edge-ai
12
+ - on-device
13
+ language:
14
+ - en
15
+ library_name: vlut.cpp
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Falcon3-1B-Instruct-1.58bit-vlut-gguf
20
+
21
+ This repository contains **state-of-the-art ternary-packed versions** of [Falcon3-1B-Instruct-1.58bit](https://huggingface.co/tiiuae/Falcon3-1B-Instruct-1.58bit) in GGUF format, optimized for efficient on-device inference using the [Vec-LUT](https://arxiv.org/abs/2512.06443) method.
22
+
23
+ ### Key Features
24
+
25
+ - **🎯 SOTA Compression**: Achieves BPW (bits per weight) as low as **1.60** through **lossless** sub-2-bit ternary packing.
26
+ - **⚡ SOTA Performance**: Delivers superior throughput (**4.2x speedup**) in **parallel inference** scenarios via vector lookup table (LUT).
27
+ - **🔌 Drop-in Ready**: Seamless integration with [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) for immediate deployment on edge devices.
28
+
29
+ ## Available Model Variants
30
+
31
+ Models are named as `ggml-model-{PACKING}_{TILE}.gguf`:
32
+
33
+ | File Name | Packing (BPW) | Tile Size | Comment |
34
+ |---------|---------|--------|------|
35
+ | `ggml-model-I1_V.gguf` | `I1_V` (1.60) | 1 | |
36
+ | `ggml-model-I1_V_2.gguf` | `I1_V` (1.60) | 2 | Recommended |
37
+ | `ggml-model-I2_V.gguf` | `I2_V` (2.00) | 1 | |
38
+ | `ggml-model-I2_V_4.gguf` | `I2_V` (2.00) | 4 | Recommended |
39
+ | `ggml-model-I2_V_8.gguf` | `I2_V` (2.00) | 8 | |
40
+
41
+ ### Selection Guide
42
+
43
+ - **BPW vs. Speed**: `I1_V` achieves lower memory usage but may not always outperform `I2_V` in speed.
44
+ - **Tiling Trade-off**: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity.
45
+ - **Starting Point**: Use `I1_V_2` or `I2_V_4` as a starting point.
46
+
47
+ For detailed tiling parameter analysis, see [Evaluation.md](https://github.com/Cipherxzc/vlut.cpp/blob/master/evaluation/Evaluation.md#tiling-parameters) and the paper.
48
+
49
+ ## Usage
50
+
51
+ ### Prerequisites
52
+
53
+ Install [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) (these models require vlut.cpp, **not** vanilla llama.cpp):
54
+
55
+ ```bash
56
+ git clone https://github.com/Cipherxzc/vlut.cpp.git
57
+ cd vlut.cpp
58
+ cmake -B build && cmake --build build --config Release -j4
59
+ ```
60
+
61
+ ### Download & Run
62
+
63
+ ```bash
64
+ # Download the recommended variant, e.g., I2_V_4
65
+ hf download <repo_id> \
66
+ ggml-model-I2_V_4.gguf --local-dir ./models
67
+
68
+ # Run parallel inference
69
+ ./build/bin/llama-batched \
70
+ -m ./models/ggml-model-I2_V_4.gguf \
71
+ -p "I believe the meaning of life is" \
72
+ -np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5
73
+
74
+ # Benchmark performance
75
+ ./build/bin/llama-bench \
76
+ -m ./models/ggml-model-I2_V_4.gguf \
77
+ -t 1 -p 128 -n 0
78
+ ```
79
+
80
+ For comprehensive usage instructions, refer to the [vlut.cpp Quick Start Guide](https://github.com/Cipherxzc/vlut.cpp/blob/master/README.md#quick-start).
81
+
82
+ ## Citation
83
+
84
+ If you use these models, please cite our [paper](https://arxiv.org/abs/2512.06443):
85
+
86
+ ```bibtex
87
+ @article{li2025veclut,
88
+ title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
89
+ author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
90
+ journal={arXiv preprint arXiv:2512.06443},
91
+ year={2025},
92
+ url={https://arxiv.org/abs/2512.06443}
93
+ }
94
+ ```
95
+
96
+ And the original Falcon3 work:
97
+
98
+ ```bibtex
99
+ @misc{Falcon3,
100
+ title = {The Falcon 3 family of Open Models},
101
+ author = {TII Team},
102
+ month = {December},
103
+ year = {2024}
104
+ }
105
+ ```