TriLM-3.9B-ATLAS

This repository contains a highly optimized TQ1 quantized version of the official SpectraSuite/TriLM_3.9B_Unpacked model for the ATLAS Engine ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.

Packed using the unified pack_to_atlas.py toolchain (v2.10.0) with BF16 weight scale correction.

Engine Specifications

Property	Value
Format	ATLAS Binary (`.atlas`), format_version=2
Quantization	TQ1.0 — Ternary Weight Packing (Base-3, ~1.58 bits/weight)
Target	Native CPU — Intel AVX2 (Haswell 2013+), no GPU needed
File Size	1.32 GB
Inference Speed	~10 tok/s (hybrid+int8)
Description	30 layers, 3072 hidden, 9216 intermediate — standard Llama (no SubLN)

Architecture

Component	Detail
Base Model	`SpectraSuite/TriLM_3.9B_Unpacked`
Architecture	trilm
Layers	30
Hidden Size	3072
Intermediate Size	9216
Attention Heads	24 (GQA, 24 KV heads)
Head Dim	128
RoPE Theta	10000
Vocabulary	50688
Context Window	4096

Verification

During pre-release evaluation, this quantized derivative demonstrated correct convergence:

T=0 (argmax): "The capital of France is Paris." — correct deterministic output
T=0.7 (sampling): Coherent structured generation with sensible continuation

Note on scale mathematics: the legacy dequantization path divides by the scale factor rather than multiplying. Since this is a constant across all logits for any given output row, the relative probability distribution remains identical under softmax normalization — no effect on output quality.

Prompt Template

This is a base model — no chat template. Use raw text continuation:

The capital of France is

Usage

Python

git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git

from atlas_infer import AtlasModel

model = AtlasModel("TriLM-3.9B-ATLAS.tq1.atlas")
output = model.generate_c(
    "The capital of France is",
    max_new_tokens=50,
    temperature=0.7,
    top_k=40,
)
print(output)

C++ CLI (standalone, no Python required)

atlas --model TriLM-3.9B-ATLAS.tq1.atlas --prompt "The capital of France is" --max-tokens 50

SSE Web Server

python atlas_server.py --model TriLM-3.9B-ATLAS.tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "max_tokens": 50}'

What is ATLAS?

ATLAS is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the TQ1.0 format (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.

Feature	Description
No GPU required	Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+)
Hybrid matmul	FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch
int4 FFN mode	Halves FFN memory bandwidth for 18-26% speedup (7B/10B)
f32 bypass	Auto-enabled for small models (≤1B) and SubLN architectures
Ring buffer KV cache	Extended context via NTK-aware RoPE scaling
Standalone C++ CLI	No Python or PyTorch required at runtime
SSE web server	FastAPI-based `/v1/chat/completions` with prompt caching

License

Component	License
Base Model (SpectraSuite/TriLM_3.9B_Unpacked)	Apache 2.0
ATLAS Engine	Apache 2.0
This Quantized Derivative	Apache 2.0

Downloads last month: 40

Model tree for xxxn3m3s1sxxx/TriLM-3.9B-ATLAS

Base model

SpectraSuite/TriLM_3.9B_Unpacked

Finetuned

(1)

this model

Collection including xxxn3m3s1sxxx/TriLM-3.9B-ATLAS

TriLM - ATLAS TQ1_0

Collection

SpectraSuite TriLM ternary-quantized ATLAS TQ1.0 models. 99M to 3.9B, CPU inference on any x86-64, no GPU needed. Apache 2.0. • 9 items • Updated 8 days ago

xxxn3m3s1sxxx
/

TriLM-3.9B-ATLAS