You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sarvam 30B GPTQ W4A16 — INT4 Quantization (V1)

Submitted to the UNESCO Resilient AI Challenge 2026 — Track: Efficient & Sustainable AI

This is a post-training quantized version of the Sarvam 30B MoE model, compressed to INT4 weights using GPTQ W4A16 via llmcompressor. The model achieves a 6.3x size reduction while maintaining strong multilingual performance across English and 9 Indian languages.


Model Description

Property Value
Base Model Sarvam 30B MoE
Architecture BailingMoE (custom SarvamMoE)
Quantization GPTQ W4A16 (INT4 weights, FP16 activations)
Weight group size 128
Model size ~19 GB
FP16 baseline size ~120 GB
Compression ratio 6.3x smaller
Languages English + 9 Indian languages

Training & Quantization Pipeline

Sarvam 30B FP16 (base)
        |
        v
   QLoRA Fine-tuning
   (Indic language alignment)
        |
        v
   Knowledge Distillation (KD)
   (LoRA adapter trained with KD loss)
        |
        v
   LoRA + KD Merge -> sarvam-30b-lora-merged
        |
        v
   GPTQ W4A16 Quantization
   (llmcompressor + SmoothQuant)
        |
        v
   sarvam-30b-gptq-w4a16 (this model)

Quantization Details

Tool: llmcompressor==0.10.0.2

Recipe:

  • SmoothQuantModifier applied first to smooth activation outliers before GPTQ
  • GPTQModifier with INT4 group-wise quantization, group size 128, symmetric

Calibration Data:

  • 1,200 samples total
  • Mix of 1,142 Indic multilingual samples + 58 English reasoning samples
  • Max sequence length: 2,048 tokens

Layers kept in FP16 (skipped from quantization to preserve model quality):

  • Embedding layer: embed_tokens
  • Language model head: lm_head
  • Attention projections: q_proj, k_proj, v_proj, o_proj
  • MoE routing layers: router, moe.*gate, gate_logits, gate_score

Efficiency Gains vs FP16 Baseline

Evaluated on 200 prompts across Math, Medical, Questions, and Writing categories using vLLM on NVIDIA A100 80GB:

Metric Improvement over FP16
Model size 6.3x smaller (120 GB to 19 GB)
Inference throughput ~45% faster (tok/s)
Average latency ~28% lower
Energy consumption ~33% less CO2eq

How to Use

Requires vLLM for inference (https://github.com/vllm-project/vllm).

from vllm import LLM, SamplingParams

llm = LLM(
    model="meghanamakkapati/sarvam30b_INT4_quantisation",
    trust_remote_code=True,
    dtype="float16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
)

sampling_params = SamplingParams(temperature=0, max_tokens=2048)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

Thinking Mode (Chain-of-Thought)

The model supports reasoning via thinking tokens. Pass enable_thinking=True in chat_template_kwargs when using the chat template.


Hardware Requirements

  • Minimum: 1x NVIDIA A100 80GB GPU
  • Recommended inference engine: vLLM

Supported Languages

English and 9 Indian languages: Hindi, Telugu, Kannada, Malayalam, Tamil, Marathi, Bengali, Gujarati, Punjabi.


Intended Use

  • Multilingual text generation for Indian language tasks
  • English and Indic reasoning
  • Efficient deployment on single A100 GPU
  • Research on sustainable and efficient AI for low-resource languages


Environmental Impact

This quantized model directly addresses AI sustainability goals:

  • 33% reduction in energy consumption vs FP16 baseline
  • 6.3x smaller disk and memory footprint
  • Enables single-GPU deployment vs multi-GPU for FP16

Submission

This model is submitted as part of the UNESCO Resilient AI Challenge 2026, in collaboration with Sarvam AI, focused on building efficient and sustainable AI systems for multilingual and low-resource settings.


License

Apache 2.0

Downloads last month
44
Safetensors
Model size
6B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for meghanamakkapati/sarvam30b_INT4_quantisation

Quantized
(8)
this model