How to use from
llama.cpp
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
# Run inference directly in the terminal:
llama cli -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
# Run inference directly in the terminal:
llama cli -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
# Run inference directly in the terminal:
./llama-cli -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
Use Docker
docker model run hf.co/noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF:MXFP4_MOE_XL
Quick Links

This is an experimental MXFP4_MOE quantization of the model GLM-4.7-Flash.

I have created an importance-aware MXFP4_MOE quantization that dynamically allocates precision based on tensor importance scores from an imatrix I created with code_tiny.
This is a coding optimized quantization and is slightly larger than the mainline MXFP4_MOE, and the way it works is that it keeps a better quantization depending on the importance of each tensor.

Quantization Types

  • BF16 (16-bit) for highly important tensors (>75% importance)
  • Q8_0 (8-bit) for moderately important tensors (>60% importance)
  • MXFP4 (4-bit) for less important tensors (<50% importance)

Quantization per Layer Count

As I've mentioned it is experimental, and still not have done any benchmark on it, to see if it's any better than mainline, but you are freely to try it out and report back!

Downloads last month
67
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for noctrex/GLM-4.7-Flash-i1-MXFP4_MOE_XL-exp-GGUF

Quantized
(87)
this model