---
license: other
license_name: falcon-llm-license
license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html
language:
- en
tags:
- ternary
- quantized
- atlas
- tq1
- cpu-optimized
- falcon-e
- llm
- edge
- cpu-inference
- bitnet
- cpu-llm
- edge-ai
- no-gpu
- efficient-inference
base_model: tiiuae/Falcon-E-1B-Instruct
pipeline_tag: text-generation
library_name: atlas
---
# Falcon-E-1B-Instruct-1.58bit-ATLAS (v2.10.0)

This repository contains a highly optimized **TQ1 quantized version** of the official `tiiuae/Falcon-E-1B-Instruct` model for the **ATLAS Engine** ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.

> Packed using the unified `pack_to_atlas.py` toolchain (v2.10.0) with BF16 weight scale correction.

---

## Engine Specifications

| Property | Value |
|---|---|
| **Format** | ATLAS Binary (`.atlas`), format_version=2 |
| **Quantization** | TQ1.0 — Ternary Weight Packing (Base-3, ~1.58 bits/weight) |
| **Target** | Native CPU — Intel AVX2 (Haswell 2013+), no GPU needed |
| **File Size** | 0.56 GB |
| **Inference Speed** | 13 tok/s (hybrid) |
| **Description** | 24 layers, 2048 hidden, 9216 intermediate — lightweight Falcon Edge series |

### Architecture

| Component | Detail |
|---|---|
| Base Model | `tiiuae/Falcon-E-1B-Instruct` |
| Architecture | llama |
| Layers | 24 |
| Hidden Size | 2048 |
| Intermediate Size | 9216 |
| Attention Heads | 16 (GQA, 2 KV heads) |
| Head Dim | 128 |
| RoPE Theta | 1000000 |
| Vocabulary | 32768 |
| Context Window | 32768 (NTK-scalable up to 65536) |


### Verification

During pre-release evaluation (v2.10.0), this quantized derivative demonstrated correct convergence:
- **T=0 (argmax):** `"The capital of France is Paris."` — correct deterministic output
- **T=0.7 (sampling):** Coherent structured generation with sensible continuation

> *Note on scale mathematics:* the legacy dequantization path divides by the scale factor rather than multiplying. Since this is a constant across all logits for any given output row, the relative probability distribution remains identical under softmax normalization — no effect on output quality.

---

## Prompt Template

To prevent token degradation and alignment shifting, use the standard chat template:

```
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
```

### Example Sequence

```
<|im_start|>user
Explain quantum computing in one sentence.<|im_end|>
<|im_start|>assistant
```

---

## Usage

### Python

```bash
git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git
```

```python
from atlas_infer import AtlasModel

model = AtlasModel("Falcon-E-1B-Instruct-1.58bit-ATLAS.tq1.atlas")
output = model.generate_c(
    "What is the capital of France?",
    max_new_tokens=100,
    temperature=0.7,
    top_k=40,
)
print(output)
```

### C++ CLI (standalone, no Python required)

```bash
atlas --model Falcon-E-1B-Instruct-1.58bit-ATLAS.tq1.atlas --prompt "What is the capital of France?" --max-tokens 100
```

### SSE Web Server

```bash
python atlas_server.py --model Falcon-E-1B-Instruct-1.58bit-ATLAS.tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?", "max_tokens": 100}'
```

---

## What is ATLAS?

**ATLAS** is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the **TQ1.0 format** (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.

| Feature | Description |
|---|---|
| **No GPU required** | Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+) |
| **Hybrid matmul** | FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch |
| **int4 FFN mode** | Halves FFN memory bandwidth for 18-26% speedup (7B/10B) |
| **f32 bypass** | Auto-enabled for small models (≤1B) and SubLN architectures |
| **Ring buffer KV cache** | Extended context via NTK-aware RoPE scaling |
| **Standalone C++ CLI** | No Python or PyTorch required at runtime |
| **SSE web server** | FastAPI-based `/v1/chat/completions` with prompt caching |

### Links

- **Engine source code**: [github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0](https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0)
- **Original model**: [`tiiuae/Falcon-E-1B-Instruct`](https://huggingface.co/tiiuae/Falcon-E-1B-Instruct)

---

## License
| Component | License |
|-----------|---------|
| Base Model (TII/Falcon3-1B-Instruct-1.58bit-ATLAS) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
| ATLAS Engine | [Apache 2.0](https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0) |
| This Quantized Derivative | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |