You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sarvam 30B GPTQ W4A16 — INT4 Quantization (V1)

Submitted to the UNESCO Resilient AI Challenge 2026 — Track: Efficient & Sustainable AI

This is a post-training quantized version of the Sarvam 30B MoE model, compressed to INT4 weights using GPTQ W4A16 via llmcompressor. The model achieves a 6.3x size reduction while maintaining strong multilingual performance across English and 9 Indian languages.

Model Description

Property	Value
Base Model	Sarvam 30B MoE
Architecture	BailingMoE (custom SarvamMoE)
Quantization	GPTQ W4A16 (INT4 weights, FP16 activations)
Weight group size	128
Model size	~19 GB
FP16 baseline size	~120 GB
Compression ratio	6.3x smaller
Languages	English + 9 Indian languages

Training & Quantization Pipeline

Sarvam 30B FP16 (base)
        |
        v
   QLoRA Fine-tuning
   (Indic language alignment)
        |
        v
   Knowledge Distillation (KD)
   (LoRA adapter trained with KD loss)
        |
        v
   LoRA + KD Merge -> sarvam-30b-lora-merged
        |
        v
   GPTQ W4A16 Quantization
   (llmcompressor + SmoothQuant)
        |
        v
   sarvam-30b-gptq-w4a16 (this model)

Quantization Details

Tool: llmcompressor==0.10.0.2

Recipe:

SmoothQuantModifier applied first to smooth activation outliers before GPTQ
GPTQModifier with INT4 group-wise quantization, group size 128, symmetric

Calibration Data:

1,200 samples total
Mix of 1,142 Indic multilingual samples + 58 English reasoning samples
Max sequence length: 2,048 tokens

Layers kept in FP16 (skipped from quantization to preserve model quality):

Embedding layer: embed_tokens
Language model head: lm_head
Attention projections: q_proj, k_proj, v_proj, o_proj
MoE routing layers: router, moe.*gate, gate_logits, gate_score

Efficiency Gains vs FP16 Baseline

Evaluated on 200 prompts across Math, Medical, Questions, and Writing categories using vLLM on NVIDIA A100 80GB:

Metric	Improvement over FP16
Model size	6.3x smaller (120 GB to 19 GB)
Inference throughput	~45% faster (tok/s)
Average latency	~28% lower
Energy consumption	~33% less CO2eq

How to Use

Requires vLLM for inference (https://github.com/vllm-project/vllm).

from vllm import LLM, SamplingParams

llm = LLM(
    model="meghanamakkapati/sarvam30b_INT4_quantisation",
    trust_remote_code=True,
    dtype="float16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
)

sampling_params = SamplingParams(temperature=0, max_tokens=2048)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

Thinking Mode (Chain-of-Thought)

The model supports reasoning via thinking tokens. Pass enable_thinking=True in chat_template_kwargs when using the chat template.

Hardware Requirements

Minimum: 1x NVIDIA A100 80GB GPU
Recommended inference engine: vLLM

Supported Languages

English and 9 Indian languages: Hindi, Telugu, Kannada, Malayalam, Tamil, Marathi, Bengali, Gujarati, Punjabi.

Intended Use

Multilingual text generation for Indian language tasks
English and Indic reasoning
Efficient deployment on single A100 GPU
Research on sustainable and efficient AI for low-resource languages

Environmental Impact

This quantized model directly addresses AI sustainability goals:

33% reduction in energy consumption vs FP16 baseline
6.3x smaller disk and memory footprint
Enables single-GPU deployment vs multi-GPU for FP16

Submission

This model is submitted as part of the UNESCO Resilient AI Challenge 2026, in collaboration with Sarvam AI, focused on building efficient and sustainable AI systems for multilingual and low-resource settings.

License

Apache 2.0

Downloads last month: 44

Safetensors

Model size

6B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for meghanamakkapati/sarvam30b_INT4_quantisation

Base model

sarvamai/sarvam-1-v0.5

Quantized

(8)

this model