BCJR-QAT-Llama-3.2-1B-2bit

Trained 2-bit quantized weight snapshots for Llama-3.2-1B at single decoder layers, produced by BCJR-QAT — a differentiable relaxation of trellis-coded weight quantization. Companion artifacts to the paper

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization (V. Iyengar, 2026; arXiv preprint, pending)

Code: github.com/Venugopalan2610/quant-2bit

Headline results (WikiText-2 PPL)

Configuration	PPL	Δ vs QTIP-PTQ
FP16 baseline (no quantization)	9.7000	—
QTIP-PTQ at L4 only (2 bpw)	10.2189	—
BCJR-QAT at L4 only, skip-high-T (2 bpw)	10.1347	−0.0842
QTIP-PTQ at L8 only (2 bpw)	10.3083	—
BCJR-QAT at L8 only, naive schedule (2 bpw)	10.3302	+0.022 (overshoot)
QTIP-PTQ at $[L_4, L_8]$ joint	10.9134	—
BCJR-QAT at $[L_4, L_8]$ joint (mixed schedules)	10.8364	−0.0770 (super-additive)

Other 14 layers stay at FP16 in all single- or two-layer configurations. The 2 bpw rate refers to the named layer(s) only.

What's in this repo

Path	Bytes	Contents
`layer_04_skipT_wq.pt`	232 MB	Hardened-Viterbi snapshot for L4 under `T_init=0.3` schedule. The headline winner.
`layer_04_skipT_trajectory/`	1.2 GB	5 per-step `W_latent` checkpoints (steps 2/4/6/8/10) for trajectory analysis.
`layer_04_naive_wq.pt`	232 MB	L4 hardened snapshot under `T_init=1.0` (the schedule-overshoot example).
`layer_08_naive_wq.pt`	232 MB	L8 hardened snapshot under `T_init=1.0`, used in multi-layer compounding test.
`bench/30step/`	1.6 GB	LR=2e-5 30-step reference run (sub-threshold drift; no codeword movement).
`results/*.json`	~10 KB	Trajectory-eval and multi-layer-eval JSONs reproducing the paper's tables.
`bootstrap/perwin_bcjr_n4.npz`	3 KB	OLMoE per-window NLLs for bootstrap analysis (companion to OLMoE results in the paper).

How to load and evaluate

The snapshots are PyTorch .pt files saved as a dict {"attn_q_proj": tensor, "attn_k_proj": tensor, ..., "mlp_down_proj": tensor}, one entry per quantized linear in the wrapped layer (7 entries: q/k/v/o/gate/up/down). They install in-place into a fresh FP16 Llama-3.2-1B as follows:

import torch
from transformers import AutoModelForCausalLM
from src.qat.eval_llama_layer import install_layer_weights

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B", torch_dtype=torch.float16, device_map="cuda")
snap = torch.load("layer_04_skipT_wq.pt", weights_only=True)
install_layer_weights(model, layer_idx=4, snap_or_fn=snap, dtype=torch.float16)
# now model has 2-bit BCJR-QAT-trained weights at layer 4, FP16 elsewhere

The install_layer_weights helper is in src/qat/eval_llama_layer.py in the companion repo.

Reproduce the headline result from scratch

The headline result was produced by ~5 hours on a single H100 SXM:

git clone https://github.com/Venugopalan2610/quant-olmoe
cd quant-olmoe
bash scripts/vast_setup_llama.sh
bash scripts/vast_train_llama_skip_highT.sh
python -m scripts.eval_llama_trajectory \
    --ckpt-dir cache/llama_bcjr_skipT \
    --target-layer 4 \
    --output results/llama_skipT_trajectory.json \
    --skip-baselines --ppl-fp-cached 9.70 --ppl-ptq-cached 10.2189

Recipe: PTQ-init, $\eta!=!2{\times}10^{-4}$, $N!=!10$ steps, $T$ schedule $0.3 \to 0.05$, BCJR chunk 16, sequence length 1024, batch size 1.

Citation

(Publish Pending)

@article{iyengar2026bcjrqat,
  title   = {BCJR-QAT: A Differentiable Relaxation of Trellis-Coded
             Weight Quantization},
  author  = {Iyengar, Venugopalan},
  year    = {2026},
  journal = {arXiv preprint}
}

Please also cite the upstream Llama-3.2 base model (Llama 3.2 Community License, Meta) and the QTIP paper that this work builds on:

@inproceedings{tseng2024qtip,
  title     = {QTIP: Quantization with Trellises and Incoherence Processing},
  author    = {Tseng, Albert and Yao, Qingyao and Kuleshov, Volodymyr and
               De Sa, Christopher},
  booktitle = {NeurIPS},
  year      = {2024}
}

License

Apache-2.0 for the trained weights and quantization metadata in this repository. Use of the underlying Llama-3.2-1B base model weights is governed by the Llama 3.2 Community License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit

Base model

meta-llama/Llama-3.2-1B

Finetuned

(927)

this model