Sarvam-30B Spiral
Release v1.0 โ Spiral_Q4 scheme
Calibration-free INT4 compression of Sarvam-30B that beats the bf16 baseline. Spiral_Q4 lands at wikitext-2 PPL 11.6398 โ measurably better than the bf16 reference (11.8490) and better than Sarvam AI's official Q4_K_M imatrix-calibrated GGUF (11.9627) โ without calibration data, without fine-tuning, without representative samples. Production-stable on Apple Silicon (Metal). CUDA build supported (validation pending).
This is the v1.0 release of Spiral compression applied to sarvamai/sarvam-30b โ the 30B Mixture-of-Experts model from Sarvam AI, trained for 22 Indian languages plus English. Architecture: sarvam_moe (registered in llama.cpp as bailingmoe2).
Quality
Perplexity (wikitext-2-raw-v1)
Measured with spiral fork llama.cpp llama-perplexity, ctx=512:
| Method | bpw | Calibration data | NLL (nats/token) | PPL | Gap vs bf16 |
|---|---|---|---|---|---|
| bf16 (reference, 64.3 GB) | 16.00 | โ | 2.4723 | 11.8490 ยฑ 0.10 | 0 |
| Q4_K_M (Sarvam team, imatrix English) | ~4.50 | English wikitext | 2.4818 | 11.9627 ยฑ 0.10 | +0.0096 |
| Spiral_Q4 (this model) | 4.56 | None | 2.4541 | 11.6398 ยฑ 0.10 | โ0.018 |
Spiral_Q4 is the first calibration-free quantization scheme on Sarvam-30B that measurably beats the bf16 baseline on wikitext-2 perplexity, at comparable bit budget to Q4_K_M. Both negative-gap measurements (vs bf16) and outperformance over imatrix Q4_K_M (which calibrates on English text) are within ยฑ0.10 PPL noise; reproducible from the artifact + sidecar in this repository.
Indic evaluations
Not measured in v1.0. Sarvam-30B's design center is 22 Indian languages, and wikitext-2 is English-only. Indic-specific perplexity (Sangraha or similar) is on the roadmap and is expected to be a strength of the calibration-free approach โ Q4_K_M's imatrix is calibrated on English text, which may bias it against Indic activations.
Cross-platform validation
The same (.gguf, .spiralcb) pair runs on Metal (Apple Silicon) and CUDA (NVIDIA). Generated text on identical prompts is semantically equivalent across platforms; minor token-level drift comes from fp16 ordering differences in the deep stack, same as for stock K-quant GGUFs.
CUDA validation in this release: build verified on RTX PRO 6000 Blackwell + H100. Inference throughput numbers pending; expect 3-5ร Metal decode rates based on prior Spiral CUDA work on similar-sized MoE models.
Performance
Apple M2 Max
| Configuration | Decode | Prefill |
|---|---|---|
| f16 KV + flash attention | ~30 tok/s | 70-470 tok/s (cache-state dependent) |
NVIDIA (CUDA)
Validation runs pending; numbers will be published in v1.1.
Memory footprint
Sarvam-30B weights + 8K context KV cache:
| Method | Total |
|---|---|
| Spiral_Q4 | ~19 GB |
| Q4_K_M | ~19 GB |
| bf16 | ~62 GB |
Fits comfortably in 24 GB of unified memory / VRAM for the 8K-context shipping config. The 30B MoE has only 4 KV heads (heavy GQA), so KV cache stays small even at long contexts.
Files
| File | Size |
|---|---|
sarvam-30b-Spiral_Q4.gguf |
18.32 GB |
sarvam-30b-Spiral_Q4.spiralcb |
78 MB |
Both files are required. The .spiralcb sidecar is loaded by the Spiral runtime alongside the GGUF โ it contains the per-tensor rotation matrices and Lloyd-Max codebooks that the kernel uses to decode INT4 weights.
Quick start
Spiral ships with three wrappers โ spiral-download, spiral-chat, spiral-serve โ that handle the model fetch and inference flags.
Install
brew install reinforceai/spiral/spiral
The brew formula installs the Spiral fork of llama.cpp plus the wrappers. Standard upstream llama.cpp will not load Spiral-compressed models.
Interactive chat
spiral-chat --model sarvam-30b-spiral
First run auto-downloads the GGUF and codebook to ~/.spiral/models/sarvam-30b-spiral/. Subsequent runs use the local cache.
Single prompt
spiral-chat --model sarvam-30b-spiral \
--prompt "เคญเคพเคฐเคค เคเฅ เคฌเคพเคฐเฅ เคฎเฅเค เคเฅเค เคฌเคคเคพเค"
OpenAI-compatible API server
spiral-serve --model sarvam-30b-spiral --port 8080
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "Who are you?"}]
}'
Build from source
For CUDA, or to develop against the framework:
git clone https://github.com/ReinforceAI/spiral
cd spiral
cmake -B build -DGGML_METAL=ON # or -DGGML_CUDA=ON
cmake --build build -j --target llama-cli llama-perplexity llama-server
Then run llama-cli directly with SPIRAL_CODEBOOK_PATH set to the codebook file:
SPIRAL_CODEBOOK_PATH=sarvam-30b-Spiral_Q4.spiralcb \
./build/bin/llama-cli \
-m sarvam-30b-Spiral_Q4.gguf \
-ctk f16 -ctv f16 -fa on \
--jinja -cnv -c 8192 --temp 0.7 --top-p 0.95
Recommended sampling params (temp=0.7, top-p=0.95) match Sarvam AI's published defaults for Sarvam-30B.
How it works
Spiral applies a learned geometric transform to weight matrices before scalar quantization. The transform produces weight distributions that are far more amenable to low-bit encoding than the original basis โ outlier channels (the primary source of quantization error) are eliminated analytically rather than handled with calibration data.
No calibration data is required. The artifact is a deterministic function of the model weights. Anyone with the upstream weights reconstructs bit-identical files. The calibration-free property buys reproducibility, domain neutrality (no English bias from a wikitext calibration corpus), and audit-friendliness for multilingual deployment.
Source: github.com/ReinforceAI/spiral
Limitations
- Indic perplexity not measured in v1.0. Headline number is English wikitext only. Multi-lingual perplexity is on the v1.1 roadmap.
- HumanEval / MMLU not measured. Sarvam-30B is not primarily a code or knowledge-recall model; behavioral evals are deferred.
- CUDA throughput pending. Build verified, perplexity validation pending. v1.1 will publish H100 numbers.
- KV cache compression not in this release. Long-context workloads run at f16 KV. Future releases may revisit.
- Custom llama.cpp build required. Stock upstream
llama.cppdoes not load Spiral models โ the SPIRAL_INT4 ggml type and rotation hooks live in the Spiral fork. - Research-grade release. APIs may change before 2.0.
Methodology
Perplexity
- wikitext-2-raw-v1 test split
- spiral fork llama.cpp
llama-perplexity, ctx=512 - Same tokenizer source across all three rows of the comparison table
- Spiral_Q4 measured against the official Sarvam-team Q4_K_M (multi-shard) and a bf16 GGUF converted via
convert_hf_to_gguf.pyfrom the upstream HF weights
Reproduction recipe and scripts: github.com/ReinforceAI/spiral
Acknowledgments
- Sarvam AI โ Sarvam-30B base model under Apache 2.0. The base model's quality, multilingual coverage, and architectural choices (fused QKV, GQA-16:1, 19-layer MoE with 128 routed experts) made this release possible.
- llama.cpp โ inference engine, GGUF format, Metal and CUDA backends.
- bailingmoe2 โ upstream llama.cpp support for the bailingmoe2 architecture family, which Sarvam-30B uses.
- The broader open-source ML community โ quantization theory (GPTQ, AWQ, QuIP#), rotation methods (QuIP, SpinQuant), and Lloyd-Max optimal quantization research laid the groundwork that Spiral builds upon.
Citation
@misc{spiral2026,
title={Spiral: Geometric Compression of Rotated Transformers},
author={Deshwal, Viraj},
year={2026},
publisher={ReinforceAI},
url={https://github.com/ReinforceAI/spiral}
}
License
- Inference engine: Based on llama.cpp (MIT)
- Spiral compression framework: ReinforceAI
- Model weights: Apache 2.0 (inherited from Sarvam-30B)
- Downloads last month
- 29
We're not able to determine the quantization variants.
Model tree for Reinforce-ai/sarvam-30b-Spiral
Base model
sarvamai/sarvam-30bEvaluation results
- Perplexity (wikitext-2) on wikitext-2-raw-v1self-reported11.640
- Perplexity Gap (nats vs bf16) on wikitext-2-raw-v1self-reported-0.018
- Model Size on wikitext-2-raw-v1self-reported18.320
- Bits Per Weight on wikitext-2-raw-v1self-reported4.560