TriLM-3.9B-ATLAS

This repository contains a highly optimized TQ1 quantized version of the official SpectraSuite/TriLM_3.9B_Unpacked model for the ATLAS Engine ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.

Packed using the unified pack_to_atlas.py toolchain (v2.10.0) with BF16 weight scale correction.


Engine Specifications

Property Value
Format ATLAS Binary (.atlas), format_version=2
Quantization TQ1.0 โ€” Ternary Weight Packing (Base-3, ~1.58 bits/weight)
Target Native CPU โ€” Intel AVX2 (Haswell 2013+), no GPU needed
File Size 1.32 GB
Inference Speed ~10 tok/s (hybrid+int8)
Description 30 layers, 3072 hidden, 9216 intermediate โ€” standard Llama (no SubLN)

Architecture

Component Detail
Base Model SpectraSuite/TriLM_3.9B_Unpacked
Architecture trilm
Layers 30
Hidden Size 3072
Intermediate Size 9216
Attention Heads 24 (GQA, 24 KV heads)
Head Dim 128
RoPE Theta 10000
Vocabulary 50688
Context Window 4096

Verification

During pre-release evaluation, this quantized derivative demonstrated correct convergence:

  • T=0 (argmax): "The capital of France is Paris." โ€” correct deterministic output
  • T=0.7 (sampling): Coherent structured generation with sensible continuation

Note on scale mathematics: the legacy dequantization path divides by the scale factor rather than multiplying. Since this is a constant across all logits for any given output row, the relative probability distribution remains identical under softmax normalization โ€” no effect on output quality.


Prompt Template

This is a base model โ€” no chat template. Use raw text continuation:

The capital of France is

Usage

Python

git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git
from atlas_infer import AtlasModel

model = AtlasModel("TriLM-3.9B-ATLAS.tq1.atlas")
output = model.generate_c(
    "The capital of France is",
    max_new_tokens=50,
    temperature=0.7,
    top_k=40,
)
print(output)

C++ CLI (standalone, no Python required)

atlas --model TriLM-3.9B-ATLAS.tq1.atlas --prompt "The capital of France is" --max-tokens 50

SSE Web Server

python atlas_server.py --model TriLM-3.9B-ATLAS.tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "max_tokens": 50}'

What is ATLAS?

ATLAS is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the TQ1.0 format (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.

Feature Description
No GPU required Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+)
Hybrid matmul FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch
int4 FFN mode Halves FFN memory bandwidth for 18-26% speedup (7B/10B)
f32 bypass Auto-enabled for small models (โ‰ค1B) and SubLN architectures
Ring buffer KV cache Extended context via NTK-aware RoPE scaling
Standalone C++ CLI No Python or PyTorch required at runtime
SSE web server FastAPI-based /v1/chat/completions with prompt caching

Links


License

Component License
Base Model (SpectraSuite/TriLM_3.9B_Unpacked) Apache 2.0
ATLAS Engine Apache 2.0
This Quantized Derivative Apache 2.0
Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xxxn3m3s1sxxx/TriLM-3.9B-ATLAS

Finetuned
(1)
this model

Collection including xxxn3m3s1sxxx/TriLM-3.9B-ATLAS