--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation ---

MoBE: Mixture-of-Basis-Experts for Compressing
MoE-based LLMs

arXiv
This repository contains the model checkpoints for **MoBE (Mixture-of-Basis-Experts)**, a novel model compression technique for MoE-based LLMs. For more usage instructions and details, please check the GitHub repository: https://github.com/inclusionAI/MoBE Specific implementation for Qwen3 can be found here: https://github.com/Bobchenyx/MoBE/tree/Qwen3 ## 📘 Introduction **MoBE (Mixture-of-Basis-Experts)** is a novel model compression technique designed for MoE LLMs developed by the **AGI Center, Ant Group Research**. It achieves efficient parameter reduction by factorizing each expert's weight matrix as: $$ \mathbf{W} = \mathbf{A}\mathbf{B}, \quad \text{where} \quad \mathbf{B} = \sum_{i=1}^m \alpha_i B_i $$ - $\mathbf{A}$: Expert-specific matrix - $\mathbf{B}$: Linear combination of **basis matrices** across all experts, weighted by coefficients $\alpha_i$ The factorization is learned by minimizing the **reconstruction error** between the original and compressed weight matrices. ### 🔍 Key Results MoBE significantly outperforms prior compression methods with minimal accuracy degradation: - Reduces parameter count by **24%–30%** in leading open-source models - Incurs only **1%–2% absolute accuracy drop** (≈2% relative) - Demonstrated on **Qwen3-235B**, **DeepSeek-V3 (671B)**, and **Kimi-K2-Instruct (1T)** ## 💡 MoBE Generate Example ```python from transformers import AutoTokenizer from models.modeling_deepseek_v3_mobe import DeepseekV3MoBEForCausalLM from models.modeling_qwen3_mobe import Qwen3MoBEForCausalLM from models.modeling_kimi_k2_mobe import KimiK2MoBEForCausalLM import torch model_name = "Bobchenyx/Qwen3-235B-A22B-2507-MoBE" # Replace with your local path or repo name offload_folder = "./offload_dir" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token max_memory = {i: "120GiB" for i in range(8)} max_memory["cpu"] = "1200GiB" if 'Qwen' in model_name: model = Qwen3MoBEForCausalLM.from_pretrained( model_name, device_map="auto", offload_folder=offload_folder, offload_state_dict=True, torch_dtype=torch.bfloat16, max_memory=max_memory ) elif 'DeepSeek' in model_name: model = DeepseekV3MoBEForCausalLM.from_pretrained( model_name, device_map="auto", offload_folder=offload_folder, offload_state_dict=True, torch_dtype=torch.bfloat16, max_memory=max_memory ) else: model = KimiK2MoBEForCausalLM.from_pretrained( model_name, device_map="auto", offload_folder=offload_folder, offload_state_dict=True, torch_dtype=torch.bfloat16, max_memory=max_memory ) input_text = "Artificial intelligence is" inputs = tokenizer(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, pad_token_id=tokenizer.eos_token_id ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print("Generated text:") print(generated_text) ``` ## 📚 Citation If you find MoBE useful in your research or application, please consider citing our work: ```bibtex @misc{chen2025mobemixtureofbasisexpertscompressingmoebased, title={MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs}, author={Xiaodong Chen and Mingming Ha and Zhenzhong Lan and Jing Zhang and Jianguo Li}, year={2025}, eprint={2508.05257}, archivePrefix={arXiv}, } ```