TriLM-190M-ATLAS

This repository contains a highly optimized TQ1 quantized version of the official SpectraSuite/TriLM_190M_Unpacked model for the ATLAS Engine ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.

Packed using the unified pack_to_atlas.py toolchain (v2.10.0) with BF16 weight scale correction.

Engine Specifications

Property	Value
Format	ATLAS Binary (`.atlas`), format_version=2
Quantization	TQ1.0 — Ternary Weight Packing (Base-3, ~1.58 bits/weight)
Target	Native CPU — Intel AVX2 (Haswell 2013+), no GPU needed
File Size	0.17 GB
Inference Speed	~25 tok/s (f32 bypass)
Description	16 layers, 768 hidden, 2048 intermediate

Architecture

Component	Detail
Base Model	`SpectraSuite/TriLM_190M_Unpacked`
Architecture	trilm-99m
Layers	16
Hidden Size	768
Intermediate Size	2048
Attention Heads	12 (GQA, 12 KV heads)
Head Dim	64
RoPE Theta	10000
Vocabulary	50304
Context Window	4096

f32 bypass active: hidden=768 ≤ 2048, auto-enabled

Verification

During pre-release evaluation, this quantized derivative demonstrated reasonable behavior given its small size:

T=0 (argmax): Coherent English output, but factual recall limited by model capacity
T=0.7 (sampling): Structured generation with sensible continuation

Note: Models below 390M parameters have limited capacity for factual knowledge. Larger TriLM variants (1.5B+) show reliable factual recall.

Prompt Template

This is a base model — no chat template. Use raw text continuation:

The capital of France is

Usage

Python

git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git

from atlas_infer import AtlasModel

model = AtlasModel("TriLM-190M-ATLAS.tq1.atlas")
output = model.generate_c(
    "The capital of France is",
    max_new_tokens=50,
    temperature=0.7,
    top_k=40,
)
print(output)

C++ CLI (standalone, no Python required)

atlas --model TriLM-190M-ATLAS.tq1.atlas --prompt "The capital of France is" --max-tokens 50

SSE Web Server

python atlas_server.py --model TriLM-190M-ATLAS.tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "max_tokens": 50}'

What is ATLAS?

ATLAS is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the TQ1.0 format (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.

Feature	Description
No GPU required	Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+)
Hybrid matmul	FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch
int4 FFN mode	Halves FFN memory bandwidth for 18-26% speedup (7B/10B)
f32 bypass	Auto-enabled for small models (≤1B) and SubLN architectures
Ring buffer KV cache	Extended context via NTK-aware RoPE scaling
Standalone C++ CLI	No Python or PyTorch required at runtime
SSE web server	FastAPI-based `/v1/chat/completions` with prompt caching

License

Component	License
Base Model (SpectraSuite/TriLM_190M_Unpacked)	Apache 2.0
ATLAS Engine	Apache 2.0
This Quantized Derivative	Apache 2.0

Downloads last month: 45

Model tree for xxxn3m3s1sxxx/TriLM-190M-ATLAS

Base model

SpectraSuite/TriLM_190M_Unpacked

Finetuned

(1)

this model

Collection including xxxn3m3s1sxxx/TriLM-190M-ATLAS

TriLM - ATLAS TQ1_0

Collection

SpectraSuite TriLM ternary-quantized ATLAS TQ1.0 models. 99M to 3.9B, CPU inference on any x86-64, no GPU needed. Apache 2.0. • 9 items • Updated 8 days ago

xxxn3m3s1sxxx
/

TriLM-190M-ATLAS