---
language:
- en
tags:
- hymba
- mamba
- moe
- 4-bit
- gptq
- hybrid-quantization
license: other
base_model:
- nvidia/Hymba-1.5B-Base
---

# Hymba-1.5B-Eigen-Hybrid-4bit

This is a **Hybrid 4-bit Quantized** version of [nvidia/Hymba-1.5B-Base](https://huggingface.co/nvidia/Hymba-1.5B-Base).
It applies GPTQ (4-bit, group size 64) to the **Mamba backbone** and **MoE Experts**, drastically reducing VRAM usage (to ~1GB) while maintaining coherence.

## Benchmarks
- **VRAM Usage:** ~1.01 GB (vs ~3.5 GB for Base)
- **Speed:** ~10 tokens/sec (Consumer GPU)
- **Perplexity:** 6.43 (WikiText-2)

## CRITICAL USAGE NOTE
This model uses a **Hybrid Quantization Strategy** that standard AutoGPTQ loaders do not support automatically.
**You MUST use the custom python script below to load this model.**

## How to Run

```python
import torch
import os
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from auto_gptq.utils.accelerate_utils import load_checkpoint_in_model
from auto_gptq.nn_modules.qlinear.qlinear_cuda import QuantLinear # Requires AutoGPTQ

# 1. Setup
model_path = "krishhx/Hymba-1.5B-Eigen-Hybrid-4bit" # CHANGE THIS TO YOUR REPO ID
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# 2. Build Empty Skeleton (CPU)
with torch.device("cpu"):
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# 3. MANUAL SURGERY: Inject 4-bit Layers
targets = ["mamba.in_proj", "mamba.out_proj", "moe.experts.0.gate_proj", "moe.experts.0.down_proj", "moe.experts.0.up_proj"]

def replace_linear_with_quant(module, name_path=""):
    for name, child in module.named_children():
        full_name = f"{name_path}.{name}" if name_path else name
        is_target = any(t in full_name for t in targets)
        if is_target and isinstance(child, torch.nn.Linear):
            # Skip odd-dimension layers (kept in FP16 for stability)
            if child.in_features % 32 != 0 or child.out_features % 32 != 0: continue
            
            new_layer = QuantLinear(bits=4, group_size=64, infeatures=child.in_features, outfeatures=child.out_features, bias=child.bias is not None)
            setattr(module, name, new_layer)
        else:
            replace_linear_with_quant(child, full_name)

replace_linear_with_quant(model)

# 4. Force-Load Weights
import os
from huggingface_hub import hf_hub_download
checkpoint = hf_hub_download(repo_id=model_path, filename="gptq_model-4bit-64g.safetensors")

load_checkpoint_in_model(model, checkpoint=checkpoint, device_map=None, dtype=torch.float16)
model.to("cuda")

# 5. Patch Cache & Run
model_module = torch.utils.sys.modules[model.__class__.__module__]
if hasattr(model_module, "HybridMambaAttentionDynamicCache"):
    CacheClass = getattr(model_module, "HybridMambaAttentionDynamicCache")
    if not hasattr(CacheClass, "layers"): CacheClass.layers = property(lambda self: [None] * 32)
    if not hasattr(CacheClass, "get_usable_length"): CacheClass.get_usable_length = lambda self, i, l=None: self.get_seq_length(l)
    if not hasattr(CacheClass, "seen_tokens"): CacheClass.seen_tokens = property(lambda self: self.get_seq_length())
    if not hasattr(CacheClass, "get_max_length"): CacheClass.get_max_length = lambda self: model.config.max_position_embeddings

input_ids = tokenizer("The future of AI is", return_tensors="pt").to("cuda")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
```

⚠️ LIMITATION: Long-Context Retrieval While this model retains high reasoning capabilities (Perplexity 6.43), the aggressive 4-bit quantization of the Mamba backbone limits effective memory retrieval to <4k tokens. For tasks requiring >4k context, we recommend waiting for our upcoming v1.1 Mixed-Precision release.