--- license: other license_name: falcon-llm-license license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html language: - en tags: - ternary - quantized - atlas - tq1 - cpu-optimized - falcon3 - llm - cpu-inference - bitnet base_model: tiiuae/Falcon3-3B-Base pipeline_tag: text-generation library_name: atlas --- # Falcon3-3B-Base-1.58bit-ATLAS (v2.10.0) This repository contains a highly optimized **TQ1 quantized version** of the official `tiiuae/Falcon3-3B-Base` model for the **ATLAS Engine** ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement. > Packed using the unified `pack_to_atlas.py` toolchain (v2.10.0) with BF16 weight scale correction. --- ## Engine Specifications | Property | Value | |---|---| | **Format** | ATLAS Binary (`.atlas`), format_version=2 | | **Quantization** | TQ1.0 — Ternary Weight Packing (Base-3, ~1.58 bits/weight) | | **Target** | Native CPU — Intel AVX2 (Haswell 2013+), no GPU needed | | **File Size** | 2.11 GB | | **Inference Speed** | 7.1 tok/s (hybrid) | | **Description** | 22 layers, 3072 hidden, 9216 intermediate — TII Base variant | ### Architecture | Component | Detail | |---|---| | Base Model | `tiiuae/Falcon3-3B-Base` | | Architecture | falcon3 | | Layers | 22 | | Hidden Size | 3072 | | Intermediate Size | 9216 | | Attention Heads | 12 (GQA, 4 KV heads) | | Head Dim | 256 | | RoPE Theta | 1000042.0 | | Vocabulary | 131072 | | Context Window | 4096 (NTK-scalable up to 8192) | ### Verification During pre-release evaluation (v2.10.0), this quantized derivative demonstrated correct convergence: - **T=0 (argmax):** `"The capital of France is Paris."` — correct deterministic output - **T=0.7 (sampling):** Coherent structured generation with sensible continuation > *Note on scale mathematics:* the legacy dequantization path divides by the scale factor rather than multiplying. Since this is a constant across all logits for any given output row, the relative probability distribution remains identical under softmax normalization — no effect on output quality. --- ## Prompt Format This is a **Base model** — it generates raw text continuation without instruction-following. Simply provide your prompt: ``` {prompt} ``` --- ## Usage ### Python ```bash git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git ``` ```python from atlas_infer import AtlasModel model = AtlasModel("Falcon3-3B-Base-tq1.atlas") output = model.generate_c( "What is the capital of France?", max_new_tokens=100, temperature=0.7, top_k=40, ) print(output) ``` ### C++ CLI (standalone, no Python required) ```bash atlas --model Falcon3-3B-Base-tq1.atlas --prompt "What is the capital of France?" --max-tokens 100 ``` ### SSE Web Server ```bash python atlas_server.py --model Falcon3-3B-Base-tq1.atlas --port 8080 curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "What is the capital of France?", "max_tokens": 100}' ``` --- ## What is ATLAS? **ATLAS** is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the **TQ1.0 format** (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper. | Feature | Description | |---|---| | **No GPU required** | Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+) | | **Hybrid matmul** | FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch | | **int4 FFN mode** | Halves FFN memory bandwidth for 18-26% speedup (7B/10B) | | **f32 bypass** | Auto-enabled for small models (≤1B) and SubLN architectures | | **Ring buffer KV cache** | Extended context via NTK-aware RoPE scaling | | **Standalone C++ CLI** | No Python or PyTorch required at runtime | | **SSE web server** | FastAPI-based `/v1/chat/completions` with prompt caching | ### Links - **Engine source code**: [github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0](https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0) - **Original model**: [`tiiuae/Falcon3-3B-Base`](https://huggingface.co/tiiuae/Falcon3-3B-Base) --- ## License & Usage Restrictions This is a **quantized derivative work** based on the **Falcon3** series (original model by the **Technology Innovation Institute (TII)**), originally released under the **Falcon-LLM License**. By downloading or utilizing this file, you agree to be bound by the **Falcon-LLM License**: 1. **Attribution:** Any usage or secondary deployment must credit the Technology Innovation Institute (TII). 2. **Non-Commercial & Small Commercial Use:** Free for academic research, personal projects, and commercial entities with **annual revenue under $1,000,000 USD**. 3. **Commercial Hosting:** Entities intending to provide shared, managed hosting of the model or its derivatives as a service must enter into a separate license arrangement with TII. *Disclaimer: This quantized file is provided "as-is". The ATLAS engine itself is **Apache 2.0 licensed**.*