Instructions to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF", filename="Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF # Run inference directly in the terminal: llama cli -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF # Run inference directly in the terminal: llama cli -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF # Run inference directly in the terminal: ./llama-cli -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Use Docker
docker model run hf.co/hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
- LM Studio
- Jan
- vLLM
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
- Ollama
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with Ollama:
ollama run hf.co/hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
- Unsloth Studio
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF to start chatting
- Pi
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with Docker Model Runner:
docker model run hf.co/hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
- Lemonade
How to use hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF
Run and chat with the model
lemonade run user.Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Upload README.md with huggingface_hub
Browse files|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: Qwen/Qwen3-30B-A3B-Instruct-2507
|
| 4 |
+
base_model_relation: quantized
|
| 5 |
+
tags:
|
| 6 |
+
- gguf
|
| 7 |
+
- qwen3moe
|
| 8 |
+
- imatrix
|
| 9 |
+
- asymmetric-quantization
|
| 10 |
+
- 2-bit
|
| 11 |
+
- moe
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Qwen3-30B-A3B-Instruct-2507 — Asymmetric 2-bit-Expert GGUF (imatrix)
|
| 16 |
+
|
| 17 |
+
An **asymmetric, expert-aware** quantization of
|
| 18 |
+
[Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
|
| 19 |
+
(arch `qwen3moe`, 128 routed experts, 8 active per token, ~3B active params,
|
| 20 |
+
48 layers).
|
| 21 |
+
|
| 22 |
+
The idea (the "antirez" insight): in a routed-MoE model the bulk of the weights
|
| 23 |
+
live in the expert FFNs, and most experts are only sparsely active. Push the
|
| 24 |
+
routed experts to **2-bit** where the model is most redundant, keep the
|
| 25 |
+
**attention and the embedding/output weights at higher precision** where error
|
| 26 |
+
is most damaging, and steer the per-tensor bit-allocation with an **importance
|
| 27 |
+
matrix (imatrix)**. The result fits comfortably in **16 GB** with only a modest
|
| 28 |
+
perplexity cost versus the standard 4-bit baseline.
|
| 29 |
+
|
| 30 |
+
## Asymmetric quantization scheme
|
| 31 |
+
|
| 32 |
+
| Tensor group | Type | Count |
|
| 33 |
+
|---|---|---|
|
| 34 |
+
| Routed expert **gate** (`ffn_gate_exps`) | `IQ2_S` | 48 |
|
| 35 |
+
| Routed expert **up** (`ffn_up_exps`) | `IQ2_S` | 48 |
|
| 36 |
+
| Routed expert **down** (`ffn_down_exps`) | `IQ3_S` | 48 |
|
| 37 |
+
| Attention `q/k/v/output` | `Q4_K` | 192 |
|
| 38 |
+
| `token_embd` | `Q6_K` | 1 |
|
| 39 |
+
| `output` (lm_head) | `Q6_K` | 1 |
|
| 40 |
+
|
| 41 |
+
Notes:
|
| 42 |
+
- **`down` experts get an extra bit (`IQ3_S`)** — they are more error-sensitive
|
| 43 |
+
than `gate`/`up`, so they are protected.
|
| 44 |
+
- This architecture has **no shared expert** — all FFN experts are routed, so
|
| 45 |
+
there is no always-on expert to hold separately at high precision.
|
| 46 |
+
- Quantization was guided by an **imatrix** computed over
|
| 47 |
+
`bartowski/calibration_datav3.txt` (128 chunks, ctx 512). `imatrix.dat` is
|
| 48 |
+
included in this repo.
|
| 49 |
+
|
| 50 |
+
**Effective rate: 2.99 BPW**, on-disk **11.4 GB** (10.14 GiB).
|
| 51 |
+
|
| 52 |
+
## Quality (perplexity, wikitext-2 raw test, 200 chunks @ ctx 512)
|
| 53 |
+
|
| 54 |
+
| Model | PPL | Δ vs Q4_K_M |
|
| 55 |
+
|---|---|---|
|
| 56 |
+
| This asym 2-bit-expert (2.99 BPW, 11.4 GB) | **7.62** | +0.31 (+4.2%) |
|
| 57 |
+
| Standard `Q4_K_M` (~4.8 BPW, 18.6 GB) | **7.32** | — |
|
| 58 |
+
|
| 59 |
+
PPL measured with the same harness and chunk count for both. Lower is better.
|
| 60 |
+
The asym build trades a small PPL increase for a **~38% smaller** file that
|
| 61 |
+
clears the 16 GB bar.
|
| 62 |
+
|
| 63 |
+
## 16 GB fit
|
| 64 |
+
|
| 65 |
+
- Weights on disk / in VRAM: **11.4 GB**.
|
| 66 |
+
- KV cache (48 layers, GQA `n_head_kv = 4`, `head_dim = 128`) at f16 is
|
| 67 |
+
~0.092 MB/token, so a **16K-token** context adds ~**1.5 GB**.
|
| 68 |
+
- 11.4 GB weights + ~1.5 GB KV (16K ctx) + runtime overhead ≈ **~14 GB < 16 GB**. ✅
|
| 69 |
+
|
| 70 |
+
Use a quantized KV cache (`-ctk q8_0 -ctv q8_0`) to push context further.
|
| 71 |
+
|
| 72 |
+
## Usage (llama.cpp)
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
# Instruct (non-thinking) variant — no <think> blocks.
|
| 76 |
+
llama-server -m Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf -ngl 99 -c 16384
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Provenance / reproducibility
|
| 80 |
+
|
| 81 |
+
- **Source:** `unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF` at `Q8_0`
|
| 82 |
+
(near-lossless) as the requantization source (`--allow-requantize`).
|
| 83 |
+
- **imatrix corpus:** `bartowski/calibration_datav3.txt`, 128 chunks @ ctx 512.
|
| 84 |
+
- **Tooling:** `llama-quantize` with repeatable `--tensor-type REGEX=TYPE`
|
| 85 |
+
overrides plus `--token-embedding-type Q6_K --output-tensor-type Q6_K`,
|
| 86 |
+
base type `IQ3_S`, imatrix-guided.
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
llama-quantize --allow-requantize --imatrix imatrix.dat \
|
| 90 |
+
--tensor-type "ffn_gate_exps=IQ2_S" --tensor-type "ffn_up_exps=IQ2_S" \
|
| 91 |
+
--tensor-type "ffn_down_exps=IQ3_S" \
|
| 92 |
+
--tensor-type "attn_q=Q4_K" --tensor-type "attn_k=Q4_K" \
|
| 93 |
+
--tensor-type "attn_v=Q4_K" --tensor-type "attn_output=Q4_K" \
|
| 94 |
+
--token-embedding-type Q6_K --output-tensor-type Q6_K \
|
| 95 |
+
src-Q8_0.gguf Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf IQ3_S
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
Coherence verified on a coding task (`merge_intervals`) and a chickens/rabbits
|
| 99 |
+
reasoning problem (35 heads / 94 legs → 23 chickens, 12 rabbits).
|