Sarvam-30B Spiral

Release v1.0 — Spiral_Q4 scheme

Calibration-free INT4 compression of Sarvam-30B that beats the bf16 baseline. Spiral_Q4 lands at wikitext-2 PPL 11.6398 — measurably better than the bf16 reference (11.8490) and better than Sarvam AI's official Q4_K_M imatrix-calibrated GGUF (11.9627) — without calibration data, without fine-tuning, without representative samples. Production-stable on Apple Silicon (Metal). CUDA build supported (validation pending).

This is the v1.0 release of Spiral compression applied to sarvamai/sarvam-30b — the 30B Mixture-of-Experts model from Sarvam AI, trained for 22 Indian languages plus English. Architecture: sarvam_moe (registered in llama.cpp as bailingmoe2).

Quality

Perplexity (wikitext-2-raw-v1)

Measured with spiral fork llama.cpp llama-perplexity, ctx=512:

Method	bpw	Calibration data	NLL (nats/token)	PPL	Gap vs bf16
bf16 (reference, 64.3 GB)	16.00	—	2.4723	11.8490 ± 0.10	0
Q4_K_M (Sarvam team, imatrix English)	~4.50	English wikitext	2.4818	11.9627 ± 0.10	+0.0096
Spiral_Q4 (this model)	4.56	None	2.4541	11.6398 ± 0.10	−0.018

Spiral_Q4 is the first calibration-free quantization scheme on Sarvam-30B that measurably beats the bf16 baseline on wikitext-2 perplexity, at comparable bit budget to Q4_K_M. Both negative-gap measurements (vs bf16) and outperformance over imatrix Q4_K_M (which calibrates on English text) are within ±0.10 PPL noise; reproducible from the artifact + sidecar in this repository.

Indic evaluations

Not measured in v1.0. Sarvam-30B's design center is 22 Indian languages, and wikitext-2 is English-only. Indic-specific perplexity (Sangraha or similar) is on the roadmap and is expected to be a strength of the calibration-free approach — Q4_K_M's imatrix is calibrated on English text, which may bias it against Indic activations.

Cross-platform validation

The same (.gguf, .spiralcb) pair runs on Metal (Apple Silicon) and CUDA (NVIDIA). Generated text on identical prompts is semantically equivalent across platforms; minor token-level drift comes from fp16 ordering differences in the deep stack, same as for stock K-quant GGUFs.

CUDA validation in this release: build verified on RTX PRO 6000 Blackwell + H100. Inference throughput numbers pending; expect 3-5× Metal decode rates based on prior Spiral CUDA work on similar-sized MoE models.

Performance

Apple M2 Max

Configuration	Decode	Prefill
f16 KV + flash attention	~30 tok/s	70-470 tok/s (cache-state dependent)

NVIDIA (CUDA)

Validation runs pending; numbers will be published in v1.1.

Memory footprint

Sarvam-30B weights + 8K context KV cache:

Method	Total
Spiral_Q4	~19 GB
Q4_K_M	~19 GB
bf16	~62 GB

Fits comfortably in 24 GB of unified memory / VRAM for the 8K-context shipping config. The 30B MoE has only 4 KV heads (heavy GQA), so KV cache stays small even at long contexts.

Files

File	Size
`sarvam-30b-Spiral_Q4.gguf`	18.32 GB
`sarvam-30b-Spiral_Q4.spiralcb`	78 MB

Both files are required. The .spiralcb sidecar is loaded by the Spiral runtime alongside the GGUF — it contains the per-tensor rotation matrices and Lloyd-Max codebooks that the kernel uses to decode INT4 weights.

Quick start

Spiral ships with three wrappers — spiral-download, spiral-chat, spiral-serve — that handle the model fetch and inference flags.

Install

brew install reinforceai/spiral/spiral

The brew formula installs the Spiral fork of llama.cpp plus the wrappers. Standard upstream llama.cpp will not load Spiral-compressed models.

Interactive chat

spiral-chat --model sarvam-30b-spiral

First run auto-downloads the GGUF and codebook to ~/.spiral/models/sarvam-30b-spiral/. Subsequent runs use the local cache.

Single prompt

spiral-chat --model sarvam-30b-spiral \
    --prompt "भारत के बारे में कुछ बताओ"

OpenAI-compatible API server

spiral-serve --model sarvam-30b-spiral --port 8080

curl http://localhost:8080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
      "messages": [{"role": "user", "content": "Who are you?"}]
    }'

Build from source

For CUDA, or to develop against the framework:

git clone https://github.com/ReinforceAI/spiral
cd spiral
cmake -B build -DGGML_METAL=ON   # or -DGGML_CUDA=ON
cmake --build build -j --target llama-cli llama-perplexity llama-server

Then run llama-cli directly with SPIRAL_CODEBOOK_PATH set to the codebook file:

SPIRAL_CODEBOOK_PATH=sarvam-30b-Spiral_Q4.spiralcb \
./build/bin/llama-cli \
    -m sarvam-30b-Spiral_Q4.gguf \
    -ctk f16 -ctv f16 -fa on \
    --jinja -cnv -c 8192 --temp 0.7 --top-p 0.95

Recommended sampling params (temp=0.7, top-p=0.95) match Sarvam AI's published defaults for Sarvam-30B.

How it works

Spiral applies a learned geometric transform to weight matrices before scalar quantization. The transform produces weight distributions that are far more amenable to low-bit encoding than the original basis — outlier channels (the primary source of quantization error) are eliminated analytically rather than handled with calibration data.

No calibration data is required. The artifact is a deterministic function of the model weights. Anyone with the upstream weights reconstructs bit-identical files. The calibration-free property buys reproducibility, domain neutrality (no English bias from a wikitext calibration corpus), and audit-friendliness for multilingual deployment.

Source: github.com/ReinforceAI/spiral

Limitations

Indic perplexity not measured in v1.0. Headline number is English wikitext only. Multi-lingual perplexity is on the v1.1 roadmap.
HumanEval / MMLU not measured. Sarvam-30B is not primarily a code or knowledge-recall model; behavioral evals are deferred.
CUDA throughput pending. Build verified, perplexity validation pending. v1.1 will publish H100 numbers.
KV cache compression not in this release. Long-context workloads run at f16 KV. Future releases may revisit.
Custom llama.cpp build required. Stock upstream llama.cpp does not load Spiral models — the SPIRAL_INT4 ggml type and rotation hooks live in the Spiral fork.
Research-grade release. APIs may change before 2.0.

Methodology

Perplexity

wikitext-2-raw-v1 test split
spiral fork llama.cpp llama-perplexity, ctx=512
Same tokenizer source across all three rows of the comparison table
Spiral_Q4 measured against the official Sarvam-team Q4_K_M (multi-shard) and a bf16 GGUF converted via convert_hf_to_gguf.py from the upstream HF weights

Reproduction recipe and scripts: github.com/ReinforceAI/spiral

Acknowledgments

Sarvam AI — Sarvam-30B base model under Apache 2.0. The base model's quality, multilingual coverage, and architectural choices (fused QKV, GQA-16:1, 19-layer MoE with 128 routed experts) made this release possible.
llama.cpp — inference engine, GGUF format, Metal and CUDA backends.
bailingmoe2 — upstream llama.cpp support for the bailingmoe2 architecture family, which Sarvam-30B uses.
The broader open-source ML community — quantization theory (GPTQ, AWQ, QuIP#), rotation methods (QuIP, SpinQuant), and Lloyd-Max optimal quantization research laid the groundwork that Spiral builds upon.

Citation

@misc{spiral2026,
  title={Spiral: Geometric Compression of Rotated Transformers},
  author={Deshwal, Viraj},
  year={2026},
  publisher={ReinforceAI},
  url={https://github.com/ReinforceAI/spiral}
}

License

Inference engine: Based on llama.cpp (MIT)
Spiral compression framework: ReinforceAI
Model weights: Apache 2.0 (inherited from Sarvam-30B)

Downloads last month: 29

GGUF

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Reinforce-ai/sarvam-30b-Spiral

Base model

sarvamai/sarvam-30b

Quantized

(27)

this model

Evaluation results

Perplexity (wikitext-2) on wikitext-2-raw-v1
self-reported

11.640
Perplexity Gap (nats vs bf16) on wikitext-2-raw-v1
self-reported

-0.018
Model Size on wikitext-2-raw-v1
self-reported

18.320
Bits Per Weight on wikitext-2-raw-v1
self-reported

4.560