Instructions to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT", filename="Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01132.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16 # Run inference directly in the terminal: llama-cli -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16 # Run inference directly in the terminal: llama-cli -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16 # Run inference directly in the terminal: ./llama-cli -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Use Docker
docker model run hf.co/Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
- LM Studio
- Jan
- Ollama
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with Ollama:
ollama run hf.co/Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
- Unsloth Studio
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT to start chatting
- Pi
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with Docker Model Runner:
docker model run hf.co/Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
- Lemonade
How to use Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT:BF16
Run and chat with the model
lemonade run user.Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT-BF16
List all available models
lemonade list
Thireus commited on
Commit ·
0d05512
1
Parent(s): 19e3e1e
Update README
Browse files
README.md
CHANGED
|
@@ -5,12 +5,13 @@ license: mit
|
|
| 5 |
|
| 6 |
## 🤔 What is this [HuggingFace repository](https://huggingface.co/Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT/) about?
|
| 7 |
|
| 8 |
-
This repository provides **GGUF-quantized tensors** for the Qwen3-235B-A22B-Instruct-2507 model (official repo: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). These GGUF shards are designed to be used with **Thireus’ GGUF Tool Suite** (https://gguf.thireus.com), a collection of tools that automatically finds the perplexity-optimal mix of quantizations for any given VRAM and RAM target. With
|
| 9 |
|
| 10 |
- 📖 Read more: https://github.com/Thireus/GGUF-Tool-Suite
|
| 11 |
-
- 🔍 Example
|
| 12 |
-
-
|
| 13 |
-
-
|
|
|
|
| 14 |
|
| 15 |
*tl;dr: Expand the details section below*
|
| 16 |
<details>
|
|
@@ -33,7 +34,7 @@ cd ..
|
|
| 33 |
# Obtain Thireus' GGUF-Tool-Suite
|
| 34 |
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/Thireus/GGUF-Tool-Suite
|
| 35 |
|
| 36 |
-
# Download model quant mix from recipe file:
|
| 37 |
cd GGUF-Tool-Suite
|
| 38 |
rm -f download.conf # Make sure to copy the relevant download.conf for the model before running quant_assign.py
|
| 39 |
cp -f models/DeepSeek-R1-0528/download.conf . # Use the download.conf of the chosen model
|
|
@@ -46,7 +47,7 @@ mkdir -p kitchen && cd kitchen
|
|
| 46 |
ulimit -n 9999 # Lifts "too many open files" limitation on Linux
|
| 47 |
~/ik_llama.cpp/build/bin/llama-cli \
|
| 48 |
-m DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01148.gguf \
|
| 49 |
-
-mla 3 -fa -amb 512 -
|
| 50 |
-ot "blk\.(3|4|5|6)\.ffn_.*=CUDA0" \
|
| 51 |
-ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \
|
| 52 |
-ot exps=CPU -b 2048 -ub 1024 --warmup-batch --no-mmap --threads 36 \
|
|
@@ -86,7 +87,7 @@ Check out the [GGUF Tool Suite README](https://github.com/Thireus/GGUF-Tool-Suit
|
|
| 86 |
|
| 87 |
1. ⚠️ **Requirements** – Which `ik_llama.cpp` (or `llama.cpp`) version to use and how to compile.
|
| 88 |
- Windows binaries (no patching needed) at: https://github.com/Thireus/ik_llama.cpp/releases
|
| 89 |
-
2. 📥 **Download Model Shards** – Use `quant_downloader.sh` to fetch GGUF shards from any recipe.
|
| 90 |
- Recipe examples: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
|
| 91 |
3. 🧠 **Run a Downloaded Model** – Sample usage with `llama-cli`.
|
| 92 |
4. 🛠️ **Generate a Custom Recipe** – Produce recipes tailored to your VRAM/RAM target usage for optimum perplexity.
|
|
@@ -103,7 +104,7 @@ Supported models are listed under `models/` in the [Tool Suite Github repo](http
|
|
| 103 |
|
| 104 |
No, because I believe in **tailored quantization** for each user’s hardware. If you prefer ready-made shards, you are welcome to merge them via `llama-gguf-split --merge`, or request someone to publish them, or rely on generic GGUF dynamic quants such as [unsloth](https://huggingface.co/unsloth)'s.
|
| 105 |
|
| 106 |
-
Instead, I prefer to share examples of recipes so users can see exactly how they were produced (command included inside these recipe files) and tweak them for their own rigs. The `quant_downloader.sh` script handles automatic fetching and verification of each shard. Note that recipes provided by [Ubergarm](https://huggingface.co/ubergarm) on his model cards are also compatible with `quant_downloader.sh`.
|
| 107 |
|
| 108 |
Users who don’t trust the GGUF shards on HuggingFace can also quantize their own by passing recipe lines to `llama-quantize --custom-q` ([see example](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh#L482-L486)). Run `llama-quantize --help` to list compatible quants for `quant_assign.py`. This approach is especially useful if you prefer `llama.cpp` over `ik_llama.cpp`.
|
| 109 |
|
|
|
|
| 5 |
|
| 6 |
## 🤔 What is this [HuggingFace repository](https://huggingface.co/Thireus/Qwen3-235B-A22B-Instruct-2507-THIREUS-BF16-SPECIAL_SPLIT/) about?
|
| 7 |
|
| 8 |
+
This repository provides **GGUF-quantized tensors** for the Qwen3-235B-A22B-Instruct-2507 model (official repo: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). These GGUF shards are designed to be used with **Thireus’ GGUF Tool Suite** (https://gguf.thireus.com), a collection of tools that automatically finds the perplexity-optimal mix of quantizations for any given VRAM and RAM target. With this GGUF Tool Suite, you can produce your own Dynamic 3.0 Quants recipes and achieve optimum accuracy & SOTA quantization performance.
|
| 9 |
|
| 10 |
- 📖 Read more: https://github.com/Thireus/GGUF-Tool-Suite
|
| 11 |
+
- 🔍 Example of GGUF recipes: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
|
| 12 |
+
- ☁️ Download GGUF models from recipe files: https://gguf.thireus.com/quant_downloader.html
|
| 13 |
+
- 🛠️ Create your own recipes: https://colab.research.google.com/github/Thireus/GGUF-Tool-Suite/blob/main/quant_recipe_pipeline.ipynb
|
| 14 |
+
- 📂 Browse available models: https://gguf.thireus.com
|
| 15 |
|
| 16 |
*tl;dr: Expand the details section below*
|
| 17 |
<details>
|
|
|
|
| 34 |
# Obtain Thireus' GGUF-Tool-Suite
|
| 35 |
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/Thireus/GGUF-Tool-Suite
|
| 36 |
|
| 37 |
+
# Download model quant mix from recipe file - you can also try the web version: https://gguf.thireus.com/quant_downloader.html
|
| 38 |
cd GGUF-Tool-Suite
|
| 39 |
rm -f download.conf # Make sure to copy the relevant download.conf for the model before running quant_assign.py
|
| 40 |
cp -f models/DeepSeek-R1-0528/download.conf . # Use the download.conf of the chosen model
|
|
|
|
| 47 |
ulimit -n 9999 # Lifts "too many open files" limitation on Linux
|
| 48 |
~/ik_llama.cpp/build/bin/llama-cli \
|
| 49 |
-m DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01148.gguf \
|
| 50 |
+
-mla 3 -fa auto -amb 512 -ctk f16 -c 4096 -ngl 99 \
|
| 51 |
-ot "blk\.(3|4|5|6)\.ffn_.*=CUDA0" \
|
| 52 |
-ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \
|
| 53 |
-ot exps=CPU -b 2048 -ub 1024 --warmup-batch --no-mmap --threads 36 \
|
|
|
|
| 87 |
|
| 88 |
1. ⚠️ **Requirements** – Which `ik_llama.cpp` (or `llama.cpp`) version to use and how to compile.
|
| 89 |
- Windows binaries (no patching needed) at: https://github.com/Thireus/ik_llama.cpp/releases
|
| 90 |
+
2. 📥 **Download Model Shards** – Use `quant_downloader.sh` or [quant_downloader.html](https://gguf.thireus.com/quant_downloader.html) to fetch GGUF shards from any recipe.
|
| 91 |
- Recipe examples: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
|
| 92 |
3. 🧠 **Run a Downloaded Model** – Sample usage with `llama-cli`.
|
| 93 |
4. 🛠️ **Generate a Custom Recipe** – Produce recipes tailored to your VRAM/RAM target usage for optimum perplexity.
|
|
|
|
| 104 |
|
| 105 |
No, because I believe in **tailored quantization** for each user’s hardware. If you prefer ready-made shards, you are welcome to merge them via `llama-gguf-split --merge`, or request someone to publish them, or rely on generic GGUF dynamic quants such as [unsloth](https://huggingface.co/unsloth)'s.
|
| 106 |
|
| 107 |
+
Instead, I prefer to share examples of recipes so users can see exactly how they were produced (command included inside these recipe files) and tweak them for their own rigs. The `quant_downloader.sh` script or [quant_downloader.html](https://gguf.thireus.com/quant_downloader.html) (web port of this script) handles automatic fetching and verification of each shard. Note that recipes provided by [Ubergarm](https://huggingface.co/ubergarm) on his model cards are also compatible with `quant_downloader.sh` and [quant_downloader.html](https://gguf.thireus.com/quant_downloader.html), providing a "SPECIAL_SPLIT" version of these models exists (see https://gguf.thireus.com/).
|
| 108 |
|
| 109 |
Users who don’t trust the GGUF shards on HuggingFace can also quantize their own by passing recipe lines to `llama-quantize --custom-q` ([see example](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh#L482-L486)). Run `llama-quantize --help` to list compatible quants for `quant_assign.py`. This approach is especially useful if you prefer `llama.cpp` over `ik_llama.cpp`.
|
| 110 |
|