Sarvam 30B GPTQ W4A16 — INT4 Quantization (V1)
Submitted to the UNESCO Resilient AI Challenge 2026 — Track: Efficient & Sustainable AI
This is a post-training quantized version of the Sarvam 30B MoE model, compressed to INT4 weights using GPTQ W4A16 via llmcompressor. The model achieves a 6.3x size reduction while maintaining strong multilingual performance across English and 9 Indian languages.
Model Description
| Property | Value |
|---|---|
| Base Model | Sarvam 30B MoE |
| Architecture | BailingMoE (custom SarvamMoE) |
| Quantization | GPTQ W4A16 (INT4 weights, FP16 activations) |
| Weight group size | 128 |
| Model size | ~19 GB |
| FP16 baseline size | ~120 GB |
| Compression ratio | 6.3x smaller |
| Languages | English + 9 Indian languages |
Training & Quantization Pipeline
Sarvam 30B FP16 (base)
|
v
QLoRA Fine-tuning
(Indic language alignment)
|
v
Knowledge Distillation (KD)
(LoRA adapter trained with KD loss)
|
v
LoRA + KD Merge -> sarvam-30b-lora-merged
|
v
GPTQ W4A16 Quantization
(llmcompressor + SmoothQuant)
|
v
sarvam-30b-gptq-w4a16 (this model)
Quantization Details
Tool: llmcompressor==0.10.0.2
Recipe:
- SmoothQuantModifier applied first to smooth activation outliers before GPTQ
- GPTQModifier with INT4 group-wise quantization, group size 128, symmetric
Calibration Data:
- 1,200 samples total
- Mix of 1,142 Indic multilingual samples + 58 English reasoning samples
- Max sequence length: 2,048 tokens
Layers kept in FP16 (skipped from quantization to preserve model quality):
- Embedding layer: embed_tokens
- Language model head: lm_head
- Attention projections: q_proj, k_proj, v_proj, o_proj
- MoE routing layers: router, moe.*gate, gate_logits, gate_score
Efficiency Gains vs FP16 Baseline
Evaluated on 200 prompts across Math, Medical, Questions, and Writing categories using vLLM on NVIDIA A100 80GB:
| Metric | Improvement over FP16 |
|---|---|
| Model size | 6.3x smaller (120 GB to 19 GB) |
| Inference throughput | ~45% faster (tok/s) |
| Average latency | ~28% lower |
| Energy consumption | ~33% less CO2eq |
How to Use
Requires vLLM for inference (https://github.com/vllm-project/vllm).
from vllm import LLM, SamplingParams
llm = LLM(
model="meghanamakkapati/sarvam30b_INT4_quantisation",
trust_remote_code=True,
dtype="float16",
max_model_len=8192,
gpu_memory_utilization=0.90,
)
sampling_params = SamplingParams(temperature=0, max_tokens=2048)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)
Thinking Mode (Chain-of-Thought)
The model supports reasoning via thinking tokens. Pass enable_thinking=True in chat_template_kwargs when using the chat template.
Hardware Requirements
- Minimum: 1x NVIDIA A100 80GB GPU
- Recommended inference engine: vLLM
Supported Languages
English and 9 Indian languages: Hindi, Telugu, Kannada, Malayalam, Tamil, Marathi, Bengali, Gujarati, Punjabi.
Intended Use
- Multilingual text generation for Indian language tasks
- English and Indic reasoning
- Efficient deployment on single A100 GPU
- Research on sustainable and efficient AI for low-resource languages
Environmental Impact
This quantized model directly addresses AI sustainability goals:
- 33% reduction in energy consumption vs FP16 baseline
- 6.3x smaller disk and memory footprint
- Enables single-GPU deployment vs multi-GPU for FP16
Submission
This model is submitted as part of the UNESCO Resilient AI Challenge 2026, in collaboration with Sarvam AI, focused on building efficient and sustainable AI systems for multilingual and low-resource settings.
License
Apache 2.0
- Downloads last month
- 44
Model tree for meghanamakkapati/sarvam30b_INT4_quantisation
Base model
sarvamai/sarvam-1-v0.5