How to use from
Ollama
ollama run hf.co/FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF:MXFP4_MOE
Quick Links

GLM-4.7-Flash-MXFP4 MOE-GGUF

GGUF quantization of zai-org/GLM-4.7-Flash โ€” a 30B-parameter Mixture-of-Experts language model with ~3.2B active parameters per token, built on the DeepSeek2 architecture with Multi-head Latent Attention (MLA) and 64 routed experts.

Quantized to MXFP4 MOE format for efficient inference with minimal quality loss.

About MXFP4 MOE

MXFP4 (Microscaling FP4, E2M1) is an open standard 4-bit format under the OCP Microscaling Formats (MX) specification. In MXFP4_MOE mode, expert weights are stored in MXFP4 while non-expert tensors (attention, embeddings, norms) remain at Q8_0, balancing quality and compression for Mixture-of-Experts models. Works on any GPU or CPU without hardware-specific acceleration.

Files

Filename Type Size Description
glm-4.7-flash-mxfp4_moe.gguf GGUF (MXFP4 MOE) 15.8 GB Quantized model weights
README.md Markdown - Model card

Quantization Details

Property Value
Format MXFP4 MOE
Bits Per Weight 4.53 BPW
File Size 15.8 GB
Tensor Count 844
Architecture DeepSeek2 (custom for GLM-4.7-Flash)

Model Description

  • Developer: Zhipu AI
  • Architecture: Mixture-of-Experts (MoE) with DeepSeek2-style MLA
  • Parameters: ~30B total, ~3.2B active per token
  • Context Length: 200,000 tokens
  • Layers: 47 transformer layers
  • Attention: Multi-head Latent Attention (q_lora_rank=768, kv_lora_rank=512)
  • Experts: 64 routed experts (4 per token) + 1 shared expert
  • Vocab Size: 151,936
  • Languages: English, Chinese
  • Thinking: Enabled by default (native <think>/</think> tokens, hidden in history for clean multi-turn reasoning)
  • Pipeline: text-generation only (no vision encoder)

Usage

llama.cpp

# Basic generation
./llama-cli -m glm-4.7-flash-mxfp4_moe.gguf \
  -p "Hello, how are you?" \
  -n 256

# With thinking/reasoning controlled
./llama-cli -m glm-4.7-flash-mxfp4_moe.gguf \
  -p "Solve this step by step: 23 * 47" \
  -n 512 \
  -no-cnv

HuggingFace Hub

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF",
    filename="glm-4.7-flash-mxfp4_moe.gguf",
    repo_type="model"
)

Pipeline Commands

Source: zai-org/GLM-4.7-Flash (58 GB, 48 safetensor shards)

  1. F16 GGUF Conversion:

    python convert_hf_to_gguf.py D:\AI_MODELS\glm-4.7-src --outfile glm-4.7-f16.gguf --outtype f16
    

    Output: 55.79 GB, 844 tensors (DeepSeek2 arch, Glm4MoeLiteModel)

  2. MXFP4 MOE Quantization:

    llama-quantize.exe glm-4.7-f16.gguf glm-4.7-flash-mxfp4_moe.gguf MXFP4_MOE
    

    Duration: ~310s on RTX 5060 Ti

Hardware

Component Specification
GPU NVIDIA RTX 5060 Ti 16 GB (Blackwell)
System RAM 64 GB
Storage D: (NVMe)

License

MIT โ€” same as the original zai-org/GLM-4.7-Flash.

Downloads last month
429
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF

Quantized
(87)
this model