--- license: other license_name: falcon-llm-license license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html base_model: tiiuae/Falcon3-1B-Instruct-1.58bit tags: - text-generation - conversational - ternary - quantized - edge-ai - on-device language: - en library_name: vlut.cpp pipeline_tag: text-generation --- # Falcon3-1B-Instruct-1.58bit-vlut-gguf This repository contains **state-of-the-art ternary-packed versions** of [Falcon3-1B-Instruct-1.58bit](https://huggingface.co/tiiuae/Falcon3-1B-Instruct-1.58bit) in GGUF format, optimized for efficient on-device inference using the [Vec-LUT](https://arxiv.org/abs/2512.06443) method. ### Key Features - **🎯 SOTA Compression**: Achieves BPW (bits per weight) as low as **1.60** through **lossless** sub-2-bit ternary packing. - **⚡ SOTA Performance**: Delivers superior throughput (**4.2x speedup**) in **parallel inference** scenarios via vector lookup table (LUT). - **🔌 Drop-in Ready**: Seamless integration with [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) for immediate deployment on edge devices. ## Available Model Variants Models are named as `ggml-model-{PACKING}_{TILE}.gguf`: | File Name | Packing (BPW) | Tile Size | Comment | |---------|---------|--------|------| | `ggml-model-I1_V.gguf` | `I1_V` (1.60) | 1 | | | `ggml-model-I1_V_2.gguf` | `I1_V` (1.60) | 2 | Recommended | | `ggml-model-I2_V.gguf` | `I2_V` (2.00) | 1 | | | `ggml-model-I2_V_4.gguf` | `I2_V` (2.00) | 4 | Recommended | | `ggml-model-I2_V_8.gguf` | `I2_V` (2.00) | 8 | | ### Selection Guide - **BPW vs. Speed**: `I1_V` achieves lower memory usage but may not always outperform `I2_V` in speed. - **Tiling Trade-off**: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity. - **Starting Point**: Use `I1_V_2` or `I2_V_4` as a starting point. For detailed tiling parameter analysis, see [Evaluation.md](https://github.com/Cipherxzc/vlut.cpp/blob/master/evaluation/Evaluation.md#tiling-parameters) and the paper. ## Usage ### Prerequisites Install [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) (these models require vlut.cpp, **not** vanilla llama.cpp): ```bash git clone https://github.com/Cipherxzc/vlut.cpp.git cd vlut.cpp cmake -B build && cmake --build build --config Release -j4 ``` ### Download & Run ```bash # Download the recommended variant, e.g., I2_V_4 hf download \ ggml-model-I2_V_4.gguf --local-dir ./models # Run parallel inference ./build/bin/llama-batched \ -m ./models/ggml-model-I2_V_4.gguf \ -p "I believe the meaning of life is" \ -np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5 # Benchmark performance ./build/bin/llama-bench \ -m ./models/ggml-model-I2_V_4.gguf \ -t 1 -p 128 -n 0 ``` For comprehensive usage instructions, refer to the [vlut.cpp Quick Start Guide](https://github.com/Cipherxzc/vlut.cpp/blob/master/README.md#quick-start). ## Citation If you use these models, please cite our [paper](https://arxiv.org/abs/2512.06443): ```bibtex @article{li2025veclut, title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices}, author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin}, journal={arXiv preprint arXiv:2512.06443}, year={2025}, url={https://arxiv.org/abs/2512.06443} } ``` And the original Falcon3 work: ```bibtex @misc{Falcon3, title = {The Falcon 3 family of Open Models}, author = {TII Team}, month = {December}, year = {2024} } ```