Instructions to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF",
	filename="GLM-4.7-Flash-asym-2bitexp.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF
# Run inference directly in the terminal:
llama-cli -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF
# Run inference directly in the terminal:
llama-cli -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF
# Run inference directly in the terminal:
./llama-cli -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Use Docker

docker model run hf.co/hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

LM Studio
Jan
Ollama
How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with Ollama:
```
ollama run hf.co/hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF
```

Unsloth Studio

How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF to start chatting

How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with Docker Model Runner:
```
docker model run hf.co/hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF
```

Lemonade

How to use hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Run and chat with the model

lemonade run user.GLM-4.7-Flash-asym-2bitexp-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

GLM-4.7-Flash — Asymmetric 2-bit-Expert GGUF

An asymmetric, imatrix-calibrated GGUF quant of GLM-4.7-Flash (30B-A3B MoE, glm4moe arch — internally served via the DeepSeek2 MLA + MoE path).

The design goal: push the routed experts (the overwhelming majority of the weights, but only ~3B active per token) down to 2–3 bits, while keeping every component that is touched on every token at high precision. The result fits a 16 GB GPU with room for a useful context window.

Asymmetric scheme

Component	Tensor pattern	Type	Rationale
Routed experts — gate, up	`ffn_gate_exps`, `ffn_up_exps`	IQ2_S	bulk of weights, sparsely active
Routed experts — down	`ffn_down_exps`	IQ3_S	down-proj is more quant-sensitive
Shared expert	`ffn_*_shexp`	Q6_K	active every token
Dense block-0 FFN	`blk.0.ffn_{gate,up,down}`	Q6_K	dense layer, active every token
Attention (MLA)	`attn_*`	Q4_K (`attn_k_b`→Q5_0, 192-col fallback)	small, latency-critical
Token embedding	`token_embd`	Q6_K	shared in/out vocabulary
Output head	`output`	Q6_K	logit quality
Base / everything else	—	IQ3_S

Built with the Hyperspace prism fork's llama-quantize using a repeatable --tensor-type REGEX=TYPE plan + --imatrix.

Provenance

Source: unsloth/GLM-4.7-Flash-GGUF → BF16 (BF16/GLM-4.7-Flash-BF16-*.gguf, the highest-precision GGUF in the repo).
imatrix: bartowski calibration_datav3.txt, 125 chunks @ ctx 512, computed on the BF16 source (imatrix.dat included).
Quantize: base IQ3_S + the per-tensor overrides above; --token-embedding-type q6_K --output-tensor-type q6_K.

Size & quality

	This quant (asym 2-bit-exp)	Q4_K_M baseline
On-disk	10.67 GB (3.06 BPW)	18.31 GB
Wikitext-2 PPL (200 chunks, ctx 512)	10.7749	10.0863

PPL delta: +6.83% for a ~42% smaller file.

16 GB fit

GLM-4.7-Flash uses MLA, so the KV cache is unusually small (compressed latent ~576 elems/layer × 47 layers):

Weights: 10.67 GB
KV @ 16k ctx (f16): ~0.89 GB
KV @ 32k ctx (f16): ~1.77 GB
- compute/context buffers: ~1–2 GB

→ ~12.5–13 GB total at 16k ctx, comfortably inside 16 GB VRAM (32k also fits).

Usage (thinking model)

GLM-4.7-Flash is a reasoning model. To disable the thinking trace, pass chat_template_kwargs: {"enable_thinking": false} with --jinja:

llama-server -m GLM-4.7-Flash-asym-2bitexp.gguf -ngl 99 -c 16384 --jinja
# then POST /v1/chat/completions with:
#   "chat_template_kwargs": {"enable_thinking": false}

Coherence verified on a coding prompt (correct memoized fib, fib(10)=55) and a short reasoning prompt.

Caveats

2-bit routed experts carry a measurable quality cost vs Q4_K_M (see PPL). On adversarial logic riddles the model can occasionally slip; for general coding/chat/reasoning under a tight VRAM budget it stays coherent. Use Q4_K_M or higher if you have the memory.

Downloads last month: 259

GGUF

Model size

30B params

Architecture

deepseek2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Base model

zai-org/GLM-4.7-Flash

Quantized

(85)

this model