Sarvam-30B Spiral

Release v1.0 โ€” Spiral_Q4 scheme

Calibration-free INT4 compression of Sarvam-30B that beats the bf16 baseline. Spiral_Q4 lands at wikitext-2 PPL 11.6398 โ€” measurably better than the bf16 reference (11.8490) and better than Sarvam AI's official Q4_K_M imatrix-calibrated GGUF (11.9627) โ€” without calibration data, without fine-tuning, without representative samples. Production-stable on Apple Silicon (Metal). CUDA build supported (validation pending).

This is the v1.0 release of Spiral compression applied to sarvamai/sarvam-30b โ€” the 30B Mixture-of-Experts model from Sarvam AI, trained for 22 Indian languages plus English. Architecture: sarvam_moe (registered in llama.cpp as bailingmoe2).


Quality

Perplexity (wikitext-2-raw-v1)

Measured with spiral fork llama.cpp llama-perplexity, ctx=512:

Method bpw Calibration data NLL (nats/token) PPL Gap vs bf16
bf16 (reference, 64.3 GB) 16.00 โ€” 2.4723 11.8490 ยฑ 0.10 0
Q4_K_M (Sarvam team, imatrix English) ~4.50 English wikitext 2.4818 11.9627 ยฑ 0.10 +0.0096
Spiral_Q4 (this model) 4.56 None 2.4541 11.6398 ยฑ 0.10 โˆ’0.018

Spiral_Q4 is the first calibration-free quantization scheme on Sarvam-30B that measurably beats the bf16 baseline on wikitext-2 perplexity, at comparable bit budget to Q4_K_M. Both negative-gap measurements (vs bf16) and outperformance over imatrix Q4_K_M (which calibrates on English text) are within ยฑ0.10 PPL noise; reproducible from the artifact + sidecar in this repository.

Indic evaluations

Not measured in v1.0. Sarvam-30B's design center is 22 Indian languages, and wikitext-2 is English-only. Indic-specific perplexity (Sangraha or similar) is on the roadmap and is expected to be a strength of the calibration-free approach โ€” Q4_K_M's imatrix is calibrated on English text, which may bias it against Indic activations.


Cross-platform validation

The same (.gguf, .spiralcb) pair runs on Metal (Apple Silicon) and CUDA (NVIDIA). Generated text on identical prompts is semantically equivalent across platforms; minor token-level drift comes from fp16 ordering differences in the deep stack, same as for stock K-quant GGUFs.

CUDA validation in this release: build verified on RTX PRO 6000 Blackwell + H100. Inference throughput numbers pending; expect 3-5ร— Metal decode rates based on prior Spiral CUDA work on similar-sized MoE models.


Performance

Apple M2 Max

Configuration Decode Prefill
f16 KV + flash attention ~30 tok/s 70-470 tok/s (cache-state dependent)

NVIDIA (CUDA)

Validation runs pending; numbers will be published in v1.1.


Memory footprint

Sarvam-30B weights + 8K context KV cache:

Method Total
Spiral_Q4 ~19 GB
Q4_K_M ~19 GB
bf16 ~62 GB

Fits comfortably in 24 GB of unified memory / VRAM for the 8K-context shipping config. The 30B MoE has only 4 KV heads (heavy GQA), so KV cache stays small even at long contexts.


Files

File Size
sarvam-30b-Spiral_Q4.gguf 18.32 GB
sarvam-30b-Spiral_Q4.spiralcb 78 MB

Both files are required. The .spiralcb sidecar is loaded by the Spiral runtime alongside the GGUF โ€” it contains the per-tensor rotation matrices and Lloyd-Max codebooks that the kernel uses to decode INT4 weights.


Quick start

Spiral ships with three wrappers โ€” spiral-download, spiral-chat, spiral-serve โ€” that handle the model fetch and inference flags.

Install

brew install reinforceai/spiral/spiral

The brew formula installs the Spiral fork of llama.cpp plus the wrappers. Standard upstream llama.cpp will not load Spiral-compressed models.

Interactive chat

spiral-chat --model sarvam-30b-spiral

First run auto-downloads the GGUF and codebook to ~/.spiral/models/sarvam-30b-spiral/. Subsequent runs use the local cache.

Single prompt

spiral-chat --model sarvam-30b-spiral \
    --prompt "เคญเคพเคฐเคค เค•เฅ‡ เคฌเคพเคฐเฅ‡ เคฎเฅ‡เค‚ เค•เฅเค› เคฌเคคเคพเค“"

OpenAI-compatible API server

spiral-serve --model sarvam-30b-spiral --port 8080
curl http://localhost:8080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
      "messages": [{"role": "user", "content": "Who are you?"}]
    }'

Build from source

For CUDA, or to develop against the framework:

git clone https://github.com/ReinforceAI/spiral
cd spiral
cmake -B build -DGGML_METAL=ON   # or -DGGML_CUDA=ON
cmake --build build -j --target llama-cli llama-perplexity llama-server

Then run llama-cli directly with SPIRAL_CODEBOOK_PATH set to the codebook file:

SPIRAL_CODEBOOK_PATH=sarvam-30b-Spiral_Q4.spiralcb \
./build/bin/llama-cli \
    -m sarvam-30b-Spiral_Q4.gguf \
    -ctk f16 -ctv f16 -fa on \
    --jinja -cnv -c 8192 --temp 0.7 --top-p 0.95

Recommended sampling params (temp=0.7, top-p=0.95) match Sarvam AI's published defaults for Sarvam-30B.


How it works

Spiral applies a learned geometric transform to weight matrices before scalar quantization. The transform produces weight distributions that are far more amenable to low-bit encoding than the original basis โ€” outlier channels (the primary source of quantization error) are eliminated analytically rather than handled with calibration data.

No calibration data is required. The artifact is a deterministic function of the model weights. Anyone with the upstream weights reconstructs bit-identical files. The calibration-free property buys reproducibility, domain neutrality (no English bias from a wikitext calibration corpus), and audit-friendliness for multilingual deployment.

Source: github.com/ReinforceAI/spiral


Limitations

  • Indic perplexity not measured in v1.0. Headline number is English wikitext only. Multi-lingual perplexity is on the v1.1 roadmap.
  • HumanEval / MMLU not measured. Sarvam-30B is not primarily a code or knowledge-recall model; behavioral evals are deferred.
  • CUDA throughput pending. Build verified, perplexity validation pending. v1.1 will publish H100 numbers.
  • KV cache compression not in this release. Long-context workloads run at f16 KV. Future releases may revisit.
  • Custom llama.cpp build required. Stock upstream llama.cpp does not load Spiral models โ€” the SPIRAL_INT4 ggml type and rotation hooks live in the Spiral fork.
  • Research-grade release. APIs may change before 2.0.

Methodology

Perplexity

  • wikitext-2-raw-v1 test split
  • spiral fork llama.cpp llama-perplexity, ctx=512
  • Same tokenizer source across all three rows of the comparison table
  • Spiral_Q4 measured against the official Sarvam-team Q4_K_M (multi-shard) and a bf16 GGUF converted via convert_hf_to_gguf.py from the upstream HF weights

Reproduction recipe and scripts: github.com/ReinforceAI/spiral


Acknowledgments

  • Sarvam AI โ€” Sarvam-30B base model under Apache 2.0. The base model's quality, multilingual coverage, and architectural choices (fused QKV, GQA-16:1, 19-layer MoE with 128 routed experts) made this release possible.
  • llama.cpp โ€” inference engine, GGUF format, Metal and CUDA backends.
  • bailingmoe2 โ€” upstream llama.cpp support for the bailingmoe2 architecture family, which Sarvam-30B uses.
  • The broader open-source ML community โ€” quantization theory (GPTQ, AWQ, QuIP#), rotation methods (QuIP, SpinQuant), and Lloyd-Max optimal quantization research laid the groundwork that Spiral builds upon.

Citation

@misc{spiral2026,
  title={Spiral: Geometric Compression of Rotated Transformers},
  author={Deshwal, Viraj},
  year={2026},
  publisher={ReinforceAI},
  url={https://github.com/ReinforceAI/spiral}
}

License

  • Inference engine: Based on llama.cpp (MIT)
  • Spiral compression framework: ReinforceAI
  • Model weights: Apache 2.0 (inherited from Sarvam-30B)
Downloads last month
29
GGUF
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Reinforce-ai/sarvam-30b-Spiral

Quantized
(27)
this model

Evaluation results