BCJR-QAT-Llama-3.2-1B-2bit

Trained 2-bit quantized weight snapshots for Llama-3.2-1B at single decoder layers, produced by BCJR-QAT β€” a differentiable relaxation of trellis-coded weight quantization. Companion artifacts to the paper

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization (V. Iyengar, 2026; arXiv preprint, pending)

Code: github.com/Venugopalan2610/quant-2bit

Headline results (WikiText-2 PPL)

Configuration PPL Ξ” vs QTIP-PTQ
FP16 baseline (no quantization) 9.7000 β€”
QTIP-PTQ at L4 only (2 bpw) 10.2189 β€”
BCJR-QAT at L4 only, skip-high-T (2 bpw) 10.1347 βˆ’0.0842
QTIP-PTQ at L8 only (2 bpw) 10.3083 β€”
BCJR-QAT at L8 only, naive schedule (2 bpw) 10.3302 +0.022 (overshoot)
QTIP-PTQ at $[L_4, L_8]$ joint 10.9134 β€”
BCJR-QAT at $[L_4, L_8]$ joint (mixed schedules) 10.8364 βˆ’0.0770 (super-additive)

Other 14 layers stay at FP16 in all single- or two-layer configurations. The 2 bpw rate refers to the named layer(s) only.

What's in this repo

Path Bytes Contents
layer_04_skipT_wq.pt 232 MB Hardened-Viterbi snapshot for L4 under T_init=0.3 schedule. The headline winner.
layer_04_skipT_trajectory/ 1.2 GB 5 per-step W_latent checkpoints (steps 2/4/6/8/10) for trajectory analysis.
layer_04_naive_wq.pt 232 MB L4 hardened snapshot under T_init=1.0 (the schedule-overshoot example).
layer_08_naive_wq.pt 232 MB L8 hardened snapshot under T_init=1.0, used in multi-layer compounding test.
bench/30step/ 1.6 GB LR=2e-5 30-step reference run (sub-threshold drift; no codeword movement).
results/*.json ~10 KB Trajectory-eval and multi-layer-eval JSONs reproducing the paper's tables.
bootstrap/perwin_bcjr_n4.npz 3 KB OLMoE per-window NLLs for bootstrap analysis (companion to OLMoE results in the paper).

How to load and evaluate

The snapshots are PyTorch .pt files saved as a dict {"attn_q_proj": tensor, "attn_k_proj": tensor, ..., "mlp_down_proj": tensor}, one entry per quantized linear in the wrapped layer (7 entries: q/k/v/o/gate/up/down). They install in-place into a fresh FP16 Llama-3.2-1B as follows:

import torch
from transformers import AutoModelForCausalLM
from src.qat.eval_llama_layer import install_layer_weights

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B", torch_dtype=torch.float16, device_map="cuda")
snap = torch.load("layer_04_skipT_wq.pt", weights_only=True)
install_layer_weights(model, layer_idx=4, snap_or_fn=snap, dtype=torch.float16)
# now model has 2-bit BCJR-QAT-trained weights at layer 4, FP16 elsewhere

The install_layer_weights helper is in src/qat/eval_llama_layer.py in the companion repo.

Reproduce the headline result from scratch

The headline result was produced by ~5 hours on a single H100 SXM:

git clone https://github.com/Venugopalan2610/quant-olmoe
cd quant-olmoe
bash scripts/vast_setup_llama.sh
bash scripts/vast_train_llama_skip_highT.sh
python -m scripts.eval_llama_trajectory \
    --ckpt-dir cache/llama_bcjr_skipT \
    --target-layer 4 \
    --output results/llama_skipT_trajectory.json \
    --skip-baselines --ppl-fp-cached 9.70 --ppl-ptq-cached 10.2189

Recipe: PTQ-init, $\eta!=!2{\times}10^{-4}$, $N!=!10$ steps, $T$ schedule $0.3 \to 0.05$, BCJR chunk 16, sequence length 1024, batch size 1.

Citation

(Publish Pending)

@article{iyengar2026bcjrqat,
  title   = {BCJR-QAT: A Differentiable Relaxation of Trellis-Coded
             Weight Quantization},
  author  = {Iyengar, Venugopalan},
  year    = {2026},
  journal = {arXiv preprint}
}

Please also cite the upstream Llama-3.2 base model (Llama 3.2 Community License, Meta) and the QTIP paper that this work builds on:

@inproceedings{tseng2024qtip,
  title     = {QTIP: Quantization with Trellises and Incoherence Processing},
  author    = {Tseng, Albert and Yao, Qingyao and Kuleshov, Volodymyr and
               De Sa, Christopher},
  booktitle = {NeurIPS},
  year      = {2024}
}

License

Apache-2.0 for the trained weights and quantization metadata in this repository. Use of the underlying Llama-3.2-1B base model weights is governed by the Llama 3.2 Community License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit

Finetuned
(927)
this model