Instructions to use nisten/mixtral8x22-imatrix-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use nisten/mixtral8x22-imatrix-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="nisten/mixtral8x22-imatrix-gguf", filename="mix2k-noimatrix-but-usable-reference.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use nisten/mixtral8x22-imatrix-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nisten/mixtral8x22-imatrix-gguf # Run inference directly in the terminal: llama-cli -hf nisten/mixtral8x22-imatrix-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nisten/mixtral8x22-imatrix-gguf # Run inference directly in the terminal: llama-cli -hf nisten/mixtral8x22-imatrix-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf nisten/mixtral8x22-imatrix-gguf # Run inference directly in the terminal: ./llama-cli -hf nisten/mixtral8x22-imatrix-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf nisten/mixtral8x22-imatrix-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf nisten/mixtral8x22-imatrix-gguf
Use Docker
docker model run hf.co/nisten/mixtral8x22-imatrix-gguf
- LM Studio
- Jan
- Ollama
How to use nisten/mixtral8x22-imatrix-gguf with Ollama:
ollama run hf.co/nisten/mixtral8x22-imatrix-gguf
- Unsloth Studio
How to use nisten/mixtral8x22-imatrix-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nisten/mixtral8x22-imatrix-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nisten/mixtral8x22-imatrix-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for nisten/mixtral8x22-imatrix-gguf to start chatting
- Atomic Chat new
- Docker Model Runner
How to use nisten/mixtral8x22-imatrix-gguf with Docker Model Runner:
docker model run hf.co/nisten/mixtral8x22-imatrix-gguf
- Lemonade
How to use nisten/mixtral8x22-imatrix-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull nisten/mixtral8x22-imatrix-gguf
Run and chat with the model
lemonade run user.mixtral8x22-imatrix-gguf-{{QUANT_TAG}}List all available models
lemonade list
Importance-Matrix quantizations of Mixtral-8x22B-v0.1 π«
the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)
cat mix4ns-0000* > mix4ns.gguf
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
Run with llama.cpp
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
./main -m ~/mix4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orgbits?"
Perplexity benchmarks
Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
./perplexity -m ~/mix4xs.gguf -f wiki.test.raw --chunks 12 -t 48
The results are interesting. quantizing from hf-bf16 folder to f16 gguf adds a bit of loss (increases perplexity). I've noticed on smaller models that going straight from huggingface repo folder to 8bit via using python convert.py --outtype q8_0 produces less perplexity than going hf-f16-q8_0. What's even more interesting is that quantizing TWICE (hf-q8_0 and then q8_0-imatrix) also produces better perplexity compared to regular f16gguf to imatrix.
All you need to pay attention to is the final value PPL = 2.2585 in this case that of a regular 8bit
NOT ALL 8 BIT ARE CREATED EQUAL, this took 9 hours to convert to 8bit on a 64core cpu 256GB-RAM (8channel DDR5)
Even though the file is a tiny bit slower, it gets a tiny bit lower perplexity. It looks like nothing here over 12 chunks, and 2.2584-mix8ns vs 2.2585-mix8 regular q8_0 but past testing on smaller models and 100+ chunks has shown this difference to be a bit more pronounced
perplexity regular q8_0 (from f16): 126.35 seconds per pass - ETA 6.32 minutes
[1]2.6256,[2]3.1043,[3]3.6463,[4]3.2092,[5]2.6847,[6]2.4791,[7]2.3112,[8]2.2502,[9]2.2858,[10]2.2690,[11]2.2693,[12]2.2585,
Final estimate: PPL = 2.2585 +/- 0.06534
perplexity q8_0 (slow convert.py from hf): 96.86 seconds per pass - ETA 4.83 minutes
[1]2.6191,[2]3.1045,[3]3.6551,[4]3.2302,[5]2.6990,[6]2.4908,[7]2.3167,[8]2.2541,[9]2.2877,[10]2.2682,[11]2.2685,[12]2.2584,
Final estimate: PPL = 2.2584 +/- 0.06514
perplexity regular iq4_xs (no imatrix): 91.53 seconds per pass
[1]2.6966,[2]3.1749,[3]3.6972,[4]3.2577,[5]2.7905,[6]2.6097,[7]2.4536,[8]2.4001,[9]2.4469,[10]2.4219,[11]2.4366,[12]2.4367,
Final estimate: PPL = 2.4367 +/- 0.07218
perplexity regular q4_km (no imatrix): 108.59 seconds per pass
[1]2.6100,[2]3.1304,[3]3.6897,[4]3.3500,[5]2.8118,[6]2.5992,[7]2.4349,[8]2.3816,[9]2.4174,[10]2.3959,[11]2.3988,[12]2.3976,
Final estimate: PPL = 2.3976 +/- 0.07111
perplexity EdgeQuant iq4-ns (no imatrix) 84.45 seconds per pass - FILESIZE 77258 MB
[1]2.7195,[2]3.1821,[3]3.7177,[4]3.3017,[5]2.8012,[6]2.6034,[7]2.4318,[8]2.3747,[9]2.4160,[10]2.3931,[11]2.4023,[12]2.4013,
Final estimate: PPL = 2.4013 +/- 0.07116
perplexity EdgeQuant iq4-ns (WITH imatrix) 82.76 seconds per pass - FILESIZE 73636 MB ( mix4ns.gguf ) //BEST ONE FOR 80GB CARD
[1]2.7166,[2]3.1720,[3]3.6988,[4]3.3195,[5]2.7949,[6]2.5862,[7]2.4186,[8]2.3621,[9]2.3981,[10]2.3876,[11]2.3971,[12]2.3973,
Final estimate: PPL = 2.3973 +/- 0.07080
perplexity EdgeQuant mix3ns (WITH imatrix) FILESIZE 60826 MB //BEST ONE FOR 64GB MACHINE
[1]2.7921,[2]3.2356,[3]3.8254,[4]3.3874,[5]2.9992,[6]2.8053,[7]2.7000,[8]2.6565,[9]2.7085,[10]2.7248,[11]2.7627,[12]2.7589,
Final estimate: PPL = 2.7589 +/- 0.08399
perplexity 2K (no imatrix) 207.70 seconds per pass - FILESIZE 47564MB (mix2k-noimatrix-but-usable-reference.gguf)
[1]2.9401,[2]3.4224,[3]4.0174,[4]3.8503,[5]3.5607,[6]3.4449,[7]3[9]3.5589,[10]3.6546,[11]3.7810,[12]3.7733,
Final estimate: PPL = 3.7733 +/- 0.13299
perplexity EdgeQuant mix2ns (WITH imatrix) FILESIZE 44024 MB //BEST ONE FOR 48GB LIMIT
[1]2.9890,[2]3.4809,[3]4.0181,[4]4.1660,[5]4.0785,[6]3.9915,[7]4.0004,[8]3.9970,[9]4.0762,[10]4.1886,[11]4.3717,[12]4.3661,
Final estimate: PPL = 4.3661 +/- 0.16065
command to run these was:
./main -m mix4ns.gguf -n 256 -t 48 --temp 0.5 --color -p "How to build a city on mars via shipping through aldrin cycler orbits?"
- Downloads last month
- 15
We're not able to determine the quantization variants.
Model tree for nisten/mixtral8x22-imatrix-gguf
Base model
mistral-community/Mixtral-8x22B-v0.1
