Falcon3-7B-Instruct-ATLAS (v2.10.0)

This repository contains a highly optimized TQ1 quantized version of the official tiiuae/Falcon3-7B-Instruct model. It is formatted explicitly for the ATLAS Engine ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.

Packed using the unified pack_to_atlas.py toolchain (v2.10.0) with BF16 weight scale correction.

Engine Specifications

Property	Value
Format	ATLAS Binary (`.atlas`), format_version=2
Quantization	TQ1.0 — Ternary Weight Packing (Base-3, ~1.58 bits/weight)
Target	Native CPU — Intel AVX2 (Haswell 2013+), no GPU needed
File Size	2.75 GB
Inference Speed	3.2 tok/s (int4 FFN)
Description	28 layers, 3072 hidden, 23040 intermediate — quality output

Architecture

Component	Detail
Base Model	`tiiuae/Falcon3-7B-Instruct`
Architecture	falcon3
Layers	28
Hidden Size	3072
Intermediate Size	23040
Attention Heads	12 (GQA, 4 KV heads)
Head Dim	256
RoPE Theta	1000042.0
Vocabulary	131080
Context Window	4096 (NTK-scalable up to 8192)

Verification

During pre-release evaluation (v2.10.0), this quantized derivative demonstrated correct convergence:

T=0 (argmax): "The capital of France is Paris." — correct deterministic output
T=0.7 (sampling): Coherent structured generation with sensible continuation

Prompt Template (Falcon3 Instruct)

Use the standard Falcon3 Instruct chat template:

<|role|>
{content}
<|endoftext|>

Example Sequence

<|user|>
Explain quantum computing in one sentence.
<|assistant|>

Usage

Python

git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git

from atlas_infer import AtlasModel

model = AtlasModel("falcon3-7B-Instruct-tq1.atlas")
output = model.generate_c(
    "What is the capital of France?",
    max_new_tokens=100,
    temperature=0.7,
    top_k=40,
)
print(output)

C++ CLI (standalone, no Python required)

atlas --model falcon3-7B-Instruct-tq1.atlas --prompt "What is the capital of France?" --max-tokens 100

SSE Web Server

python atlas_server.py --model falcon3-7B-Instruct-tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?", "max_tokens": 100}'

What is ATLAS?

ATLAS is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the TQ1.0 format (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.

Feature	Description
No GPU required	Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+)
Hybrid matmul	FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch
int4 FFN mode	Halves FFN memory bandwidth for 18-26% speedup (7B/10B)
f32 bypass	Auto-enabled for small models (≤1B) and SubLN architectures
Ring buffer KV cache	Extended context via NTK-aware RoPE scaling
Standalone C++ CLI	No Python or PyTorch required at runtime
SSE web server	FastAPI-based `/v1/chat/completions` with prompt caching

License & Usage Restrictions

This is a quantized derivative work based on Falcon3 architectures developed by the Technology Innovation Institute (TII).

By downloading or utilizing this file, you agree to be bound by the TII Falcon-LLM License 2.0:

Attribution: Any usage or secondary deployment must credit the Technology Innovation Institute (TII).
Non-Commercial & Small Commercial Use: Free for academic research, personal projects, and commercial entities with annual revenue under $1,000,000 USD.
Commercial Royalty Terms: Entities exceeding the $1M annual revenue threshold are subject to a 10% licensing fee on revenue exceeding that amount, as specified in the master Falcon3 license terms.

The ATLAS engine itself is Apache 2.0 licensed — see github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.

Downloads last month: 44

Model tree for xxxn3m3s1sxxx/Falcon3-7B-Instruct-1.58bit-ATLAS

Base model

tiiuae/Falcon3-7B-Base

Finetuned

tiiuae/Falcon3-7B-Instruct

Finetuned

(34)

this model

Collection including xxxn3m3s1sxxx/Falcon3-7B-Instruct-1.58bit-ATLAS

Falcon3 - ATLAS TQ1_0

Collection

TII Falcon3 ternary-quantized ATLAS TQ1.0 models. 1B/3B/7B/10B, CPU inference on any x86-64, no GPU needed. TII Falcon License 2.0. • 7 items • Updated 8 days ago

xxxn3m3s1sxxx
/

Falcon3-7B-Instruct-1.58bit-ATLAS