--- language: - en tags: - hymba - mamba - moe - 4-bit - gptq - hybrid-quantization license: other base_model: - nvidia/Hymba-1.5B-Base --- # Hymba-1.5B-Eigen-Hybrid-4bit This is a **Hybrid 4-bit Quantized** version of [nvidia/Hymba-1.5B-Base](https://huggingface.co/nvidia/Hymba-1.5B-Base). It applies GPTQ (4-bit, group size 64) to the **Mamba backbone** and **MoE Experts**, drastically reducing VRAM usage (to ~1GB) while maintaining coherence. ## Benchmarks - **VRAM Usage:** ~1.01 GB (vs ~3.5 GB for Base) - **Speed:** ~10 tokens/sec (Consumer GPU) - **Perplexity:** 6.43 (WikiText-2) ## CRITICAL USAGE NOTE This model uses a **Hybrid Quantization Strategy** that standard AutoGPTQ loaders do not support automatically. **You MUST use the custom python script below to load this model.** ## How to Run ```python import torch import os from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM from auto_gptq.utils.accelerate_utils import load_checkpoint_in_model from auto_gptq.nn_modules.qlinear.qlinear_cuda import QuantLinear # Requires AutoGPTQ # 1. Setup model_path = "krishhx/Hymba-1.5B-Eigen-Hybrid-4bit" # CHANGE THIS TO YOUR REPO ID config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) # 2. Build Empty Skeleton (CPU) with torch.device("cpu"): model = AutoModelForCausalLM.from_config(config, trust_remote_code=True) # 3. MANUAL SURGERY: Inject 4-bit Layers targets = ["mamba.in_proj", "mamba.out_proj", "moe.experts.0.gate_proj", "moe.experts.0.down_proj", "moe.experts.0.up_proj"] def replace_linear_with_quant(module, name_path=""): for name, child in module.named_children(): full_name = f"{name_path}.{name}" if name_path else name is_target = any(t in full_name for t in targets) if is_target and isinstance(child, torch.nn.Linear): # Skip odd-dimension layers (kept in FP16 for stability) if child.in_features % 32 != 0 or child.out_features % 32 != 0: continue new_layer = QuantLinear(bits=4, group_size=64, infeatures=child.in_features, outfeatures=child.out_features, bias=child.bias is not None) setattr(module, name, new_layer) else: replace_linear_with_quant(child, full_name) replace_linear_with_quant(model) # 4. Force-Load Weights import os from huggingface_hub import hf_hub_download checkpoint = hf_hub_download(repo_id=model_path, filename="gptq_model-4bit-64g.safetensors") load_checkpoint_in_model(model, checkpoint=checkpoint, device_map=None, dtype=torch.float16) model.to("cuda") # 5. Patch Cache & Run model_module = torch.utils.sys.modules[model.__class__.__module__] if hasattr(model_module, "HybridMambaAttentionDynamicCache"): CacheClass = getattr(model_module, "HybridMambaAttentionDynamicCache") if not hasattr(CacheClass, "layers"): CacheClass.layers = property(lambda self: [None] * 32) if not hasattr(CacheClass, "get_usable_length"): CacheClass.get_usable_length = lambda self, i, l=None: self.get_seq_length(l) if not hasattr(CacheClass, "seen_tokens"): CacheClass.seen_tokens = property(lambda self: self.get_seq_length()) if not hasattr(CacheClass, "get_max_length"): CacheClass.get_max_length = lambda self: model.config.max_position_embeddings input_ids = tokenizer("The future of AI is", return_tensors="pt").to("cuda")["input_ids"] output = model.generate(input_ids, max_new_tokens=50) print(tokenizer.decode(output[0])) ``` ⚠️ LIMITATION: Long-Context Retrieval While this model retains high reasoning capabilities (Perplexity 6.43), the aggressive 4-bit quantization of the Mamba backbone limits effective memory retrieval to <4k tokens. For tasks requiring >4k context, we recommend waiting for our upcoming v1.1 Mixed-Precision release.