Sarvam-30B · Godspeed (W4A16, energy-optimized)
A compressed, energy-efficient build of sarvamai/sarvam-30b,
prepared for the Resilient AI Challenge — Track 1 (text-to-text). The objective of that track is the
lowest inference energy (joules over the full benchmark suite) on a single A100-80GB served via stock
vllm serve, while retaining ≥80 % quality recovery in every evaluation category.
This model preserves the base model's reasoning quality while cutting per-token memory traffic — the dominant driver of decode-time energy in a batch-1, memory-bandwidth-bound regime — through careful, serving-safe 4-bit weight quantization.
Model details
| Developed from | sarvamai/sarvam-30b (SarvamMoEForCausalLM) |
| Architecture | Mixture-of-Experts decoder — 19 layers (1 dense + 18 MoE), 128 routed + 1 shared expert, top-6 routing, 4 KV heads, hybrid "thinking" chat template |
| Precision | W4A16 (4-bit weights, 16-bit activations), compressed-tensors / Marlin-ready |
| Serving | vllm serve . --config vllm_config.yaml (stock vLLM, trust_remote_code) |
| Languages | English + major Indic languages (Hindi, Tamil, Bengali, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, …) |
Quantization recipe
The recipe is deliberately quality-first and serving-safe:
- GPTQ (calibrated, 4-bit) on the attention projections (
query_key_value,dense) and the shared-expert MLP — the quality-sensitive, always-active paths — calibrated on a multilingual + math/STEM mix derived fromsarvamai/indivibe. - RTN (round-to-nearest, 4-bit) on the 128 routed experts — data-free and fast, suitable for the sparsely-activated expert bank.
- Kept in bf16 for stability and serving safety: the LM head (vLLM rejects a quantized
lm_head), the router gates (routing must stay precise), and layer 0 (input-sensitive).
The original thinking chat template is retained unchanged, so the model behaves exactly like the base model under the evaluator's chat interface.
Why this is energy-efficient
At batch-1 decode the GPU is memory-bandwidth-bound: energy per generated token is governed by the number of bytes read from HBM per forward pass. Moving the attention, shared-expert and routed-expert weights from 16-bit to 4-bit reduces those bytes substantially, lowering joules-per-token without inflating the token count — quality is preserved, so the model does not "think longer" to compensate (a failure mode that erases byte savings). The model passes every category quality floor with margin, which is the prerequisite for being ranked on energy at all.
Intended use & limitations
- Intended use: efficient text generation / reasoning for the challenge benchmark and general multilingual assistant tasks.
- Limitations: inherits the base model's knowledge cutoff, biases and capabilities. W4A16 is the practical
floor for stock-vLLM serving — sub-4-bit or
lm_headquantization were evaluated and rejected because they either fail to serve on stock vLLM or degrade quality below the floors.
How to serve
vllm serve . --config vllm_config.yaml
The included vllm_config.yaml is stripped of infrastructure-specific parameters (no tensor-parallel-size,
swap_space, etc.), per the challenge configuration-hygiene requirement, and sets max_model_len: 32768,
trust_remote_code: true.
Prepared for the Resilient AI Challenge. Base model © Sarvam AI; please refer to the base model card for its license and usage terms.
- Downloads last month
- -
Model tree for Shankara-A-S/sarvam-30b-godspeed
Base model
sarvamai/sarvam-30b