Instructions to use nisten/mixtral8x22-imatrix-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nisten/mixtral8x22-imatrix-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="nisten/mixtral8x22-imatrix-gguf",
	filename="mix2k-noimatrix-but-usable-reference.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use nisten/mixtral8x22-imatrix-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf nisten/mixtral8x22-imatrix-gguf
# Run inference directly in the terminal:
llama-cli -hf nisten/mixtral8x22-imatrix-gguf

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf nisten/mixtral8x22-imatrix-gguf
# Run inference directly in the terminal:
llama-cli -hf nisten/mixtral8x22-imatrix-gguf

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf nisten/mixtral8x22-imatrix-gguf
# Run inference directly in the terminal:
./llama-cli -hf nisten/mixtral8x22-imatrix-gguf

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf nisten/mixtral8x22-imatrix-gguf
# Run inference directly in the terminal:
./build/bin/llama-cli -hf nisten/mixtral8x22-imatrix-gguf

Use Docker

docker model run hf.co/nisten/mixtral8x22-imatrix-gguf

LM Studio
Jan
Ollama
How to use nisten/mixtral8x22-imatrix-gguf with Ollama:
```
ollama run hf.co/nisten/mixtral8x22-imatrix-gguf
```

Unsloth Studio

How to use nisten/mixtral8x22-imatrix-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for nisten/mixtral8x22-imatrix-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for nisten/mixtral8x22-imatrix-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for nisten/mixtral8x22-imatrix-gguf to start chatting

Atomic Chat new
Docker Model Runner
How to use nisten/mixtral8x22-imatrix-gguf with Docker Model Runner:
```
docker model run hf.co/nisten/mixtral8x22-imatrix-gguf
```

Lemonade

How to use nisten/mixtral8x22-imatrix-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull nisten/mixtral8x22-imatrix-gguf

Run and chat with the model

lemonade run user.mixtral8x22-imatrix-gguf-{{QUANT_TAG}}

List all available models

lemonade list

Importance-Matrix quantizations of Mixtral-8x22B-v0.1 💫

the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )

Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.

To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)

cat mix4ns-0000* > mix4ns.gguf

careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la

Run with llama.cpp

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j

./main -m ~/mix4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orgbits?"

Perplexity benchmarks

Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.

./perplexity -m ~/mix4xs.gguf -f wiki.test.raw --chunks 12 -t 48

The results are interesting. quantizing from hf-bf16 folder to f16 gguf adds a bit of loss (increases perplexity). I've noticed on smaller models that going straight from huggingface repo folder to 8bit via using python convert.py --outtype q8_0 produces less perplexity than going hf-f16-q8_0. What's even more interesting is that quantizing TWICE (hf-q8_0 and then q8_0-imatrix) also produces better perplexity compared to regular f16gguf to imatrix.

All you need to pay attention to is the final value PPL = 2.2585 in this case that of a regular 8bit

NOT ALL 8 BIT ARE CREATED EQUAL, this took 9 hours to convert to 8bit on a 64core cpu 256GB-RAM (8channel DDR5)

Even though the file is a tiny bit slower, it gets a tiny bit lower perplexity. It looks like nothing here over 12 chunks, and 2.2584-mix8ns vs 2.2585-mix8 regular q8_0 but past testing on smaller models and 100+ chunks has shown this difference to be a bit more pronounced

perplexity regular q8_0 (from f16): 126.35 seconds per pass - ETA 6.32 minutes
[1]2.6256,[2]3.1043,[3]3.6463,[4]3.2092,[5]2.6847,[6]2.4791,[7]2.3112,[8]2.2502,[9]2.2858,[10]2.2690,[11]2.2693,[12]2.2585,
Final estimate: PPL = 2.2585 +/- 0.06534

perplexity q8_0 (slow convert.py from hf): 96.86 seconds per pass - ETA 4.83 minutes
[1]2.6191,[2]3.1045,[3]3.6551,[4]3.2302,[5]2.6990,[6]2.4908,[7]2.3167,[8]2.2541,[9]2.2877,[10]2.2682,[11]2.2685,[12]2.2584,
Final estimate: PPL = 2.2584 +/- 0.06514

perplexity regular iq4_xs (no imatrix): 91.53 seconds per pass 
[1]2.6966,[2]3.1749,[3]3.6972,[4]3.2577,[5]2.7905,[6]2.6097,[7]2.4536,[8]2.4001,[9]2.4469,[10]2.4219,[11]2.4366,[12]2.4367,
Final estimate: PPL = 2.4367 +/- 0.07218

perplexity regular q4_km (no imatrix): 108.59 seconds per pass 
[1]2.6100,[2]3.1304,[3]3.6897,[4]3.3500,[5]2.8118,[6]2.5992,[7]2.4349,[8]2.3816,[9]2.4174,[10]2.3959,[11]2.3988,[12]2.3976,
Final estimate: PPL = 2.3976 +/- 0.07111

perplexity EdgeQuant iq4-ns (no imatrix) 84.45 seconds per pass - FILESIZE 77258 MB 
[1]2.7195,[2]3.1821,[3]3.7177,[4]3.3017,[5]2.8012,[6]2.6034,[7]2.4318,[8]2.3747,[9]2.4160,[10]2.3931,[11]2.4023,[12]2.4013,
Final estimate: PPL = 2.4013 +/- 0.07116

perplexity EdgeQuant iq4-ns (WITH imatrix) 82.76 seconds per pass - FILESIZE 73636 MB ( mix4ns.gguf ) //BEST ONE FOR 80GB CARD
[1]2.7166,[2]3.1720,[3]3.6988,[4]3.3195,[5]2.7949,[6]2.5862,[7]2.4186,[8]2.3621,[9]2.3981,[10]2.3876,[11]2.3971,[12]2.3973,
Final estimate: PPL = 2.3973 +/- 0.07080

perplexity EdgeQuant mix3ns (WITH imatrix) FILESIZE 60826 MB //BEST ONE FOR 64GB MACHINE
[1]2.7921,[2]3.2356,[3]3.8254,[4]3.3874,[5]2.9992,[6]2.8053,[7]2.7000,[8]2.6565,[9]2.7085,[10]2.7248,[11]2.7627,[12]2.7589,
Final estimate: PPL = 2.7589 +/- 0.08399

perplexity 2K (no imatrix) 207.70 seconds per pass - FILESIZE 47564MB (mix2k-noimatrix-but-usable-reference.gguf)
[1]2.9401,[2]3.4224,[3]4.0174,[4]3.8503,[5]3.5607,[6]3.4449,[7]3[9]3.5589,[10]3.6546,[11]3.7810,[12]3.7733,
Final estimate: PPL = 3.7733 +/- 0.13299

perplexity EdgeQuant mix2ns (WITH imatrix) FILESIZE 44024 MB //BEST ONE FOR 48GB LIMIT
[1]2.9890,[2]3.4809,[3]4.0181,[4]4.1660,[5]4.0785,[6]3.9915,[7]4.0004,[8]3.9970,[9]4.0762,[10]4.1886,[11]4.3717,[12]4.3661,
Final estimate: PPL = 4.3661 +/- 0.16065

command to run these was:

./main -m mix4ns.gguf -n 256 -t 48 --temp 0.5 --color -p "How to build a city on mars via shipping through aldrin cycler orbits?"

Downloads last month: 15

GGUF

Model size

141B params

Architecture

llama

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nisten/mixtral8x22-imatrix-gguf

Base model

mistral-community/Mixtral-8x22B-v0.1

Quantized

(33)

this model