You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sarvam-30B · Godspeed (W4A16, energy-optimized)

A compressed, energy-efficient build of sarvamai/sarvam-30b, prepared for the Resilient AI Challenge — Track 1 (text-to-text). The objective of that track is the lowest inference energy (joules over the full benchmark suite) on a single A100-80GB served via stock vllm serve, while retaining ≥80 % quality recovery in every evaluation category.

This model preserves the base model's reasoning quality while cutting per-token memory traffic — the dominant driver of decode-time energy in a batch-1, memory-bandwidth-bound regime — through careful, serving-safe 4-bit weight quantization.


Model details

Developed from sarvamai/sarvam-30b (SarvamMoEForCausalLM)
Architecture Mixture-of-Experts decoder — 19 layers (1 dense + 18 MoE), 128 routed + 1 shared expert, top-6 routing, 4 KV heads, hybrid "thinking" chat template
Precision W4A16 (4-bit weights, 16-bit activations), compressed-tensors / Marlin-ready
Serving vllm serve . --config vllm_config.yaml (stock vLLM, trust_remote_code)
Languages English + major Indic languages (Hindi, Tamil, Bengali, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, …)

Quantization recipe

The recipe is deliberately quality-first and serving-safe:

  • GPTQ (calibrated, 4-bit) on the attention projections (query_key_value, dense) and the shared-expert MLP — the quality-sensitive, always-active paths — calibrated on a multilingual + math/STEM mix derived from sarvamai/indivibe.
  • RTN (round-to-nearest, 4-bit) on the 128 routed experts — data-free and fast, suitable for the sparsely-activated expert bank.
  • Kept in bf16 for stability and serving safety: the LM head (vLLM rejects a quantized lm_head), the router gates (routing must stay precise), and layer 0 (input-sensitive).

The original thinking chat template is retained unchanged, so the model behaves exactly like the base model under the evaluator's chat interface.


Why this is energy-efficient

At batch-1 decode the GPU is memory-bandwidth-bound: energy per generated token is governed by the number of bytes read from HBM per forward pass. Moving the attention, shared-expert and routed-expert weights from 16-bit to 4-bit reduces those bytes substantially, lowering joules-per-token without inflating the token count — quality is preserved, so the model does not "think longer" to compensate (a failure mode that erases byte savings). The model passes every category quality floor with margin, which is the prerequisite for being ranked on energy at all.


Intended use & limitations

  • Intended use: efficient text generation / reasoning for the challenge benchmark and general multilingual assistant tasks.
  • Limitations: inherits the base model's knowledge cutoff, biases and capabilities. W4A16 is the practical floor for stock-vLLM serving — sub-4-bit or lm_head quantization were evaluated and rejected because they either fail to serve on stock vLLM or degrade quality below the floors.

How to serve

vllm serve . --config vllm_config.yaml

The included vllm_config.yaml is stripped of infrastructure-specific parameters (no tensor-parallel-size, swap_space, etc.), per the challenge configuration-hygiene requirement, and sets max_model_len: 32768, trust_remote_code: true.


Prepared for the Resilient AI Challenge. Base model © Sarvam AI; please refer to the base model card for its license and usage terms.

Downloads last month
-
Safetensors
Model size
6B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shankara-A-S/sarvam-30b-godspeed

Quantized
(27)
this model