Instructions to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf", filename="phi-3.5-moe-Q8_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0 # Run inference directly in the terminal: llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0 # Run inference directly in the terminal: llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
Use Docker
docker model run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
- LM Studio
- Jan
- vLLM
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
- Ollama
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Ollama:
ollama run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
- Unsloth Studio
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf to start chatting
- Atomic Chat new
- Docker Model Runner
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Docker Model Runner:
docker model run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
- Lemonade
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
Run and chat with the model
lemonade run user.phi-3.5-moe-q8-0-cpu-offload-gguf-Q8_0
List all available models
lemonade list
Phi-3.5-MoE Q8_0 with CPU Offloading
This is a Q8_0 quantization of Microsoft's Phi-3.5-MoE-Instruct model with MoE (Mixture of Experts) CPU offloading capability enabled via Rust bindings for llama.cpp.
Model Details
- Base Model: microsoft/Phi-3.5-MoE-instruct
- Quantization: Q8_0 (8-bit)
- File Size: 42 GB (from 79 GB F16)
- Architecture: Mixture of Experts (MoE)
- License: MIT
- Feature: MoE expert CPU offloading support
Performance Benchmarks
Tested on Lambda Cloud GH200 (96GB VRAM, 480GB RAM, CUDA 12.8) with shimmy v1.6.0:
| Configuration | VRAM Usage | VRAM Saved | Reduction |
|---|---|---|---|
| All GPU (baseline) | 41.91 GB | - | - |
CPU Offload (--cpu-moe) |
2.46 GB | 39.45 GB | 94.1% |
Key Metrics
- VRAM Reduction: 94.1% with CPU offloading enabled
- Generation Quality: Near-F16 quality, minimal degradation
- Average Tokens Generated: 73 tokens per test (N=3)
- Test Prompt: "Explain quantum computing in simple terms"
What is MoE CPU Offloading?
Mixture of Experts models activate only a subset of parameters per token (sparse activation). This quantization includes Rust bindings that expose llama.cpp's MoE CPU offloading feature, allowing inactive experts to reside in system RAM instead of VRAM.
Note: The core MoE CPU offloading algorithm was implemented in llama.cpp (PR #15077, August 2025). This release provides Rust language bindings and production integration for that functionality.
Usage
With shimmy CLI
# Download the model
huggingface-cli download MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf \
phi-3.5-moe-Q8_0.gguf --local-dir ./models
# Run with CPU offloading (uses ~2.5 GB VRAM)
shimmy serve \
--model-dirs ./models \
--cpu-moe \
--bind 127.0.0.1:11435
# Run without offloading (uses ~42 GB VRAM)
shimmy serve \
--model-dirs ./models \
--bind 127.0.0.1:11435
With llama-cpp-2 (Rust)
use llama_cpp_2::context::params::LlamaContextParams;
use llama_cpp_2::llama_backend::LlamaBackend;
use llama_cpp_2::model::params::LlamaModelParams;
use llama_cpp_2::model::LlamaModel;
fn main() {
let backend = LlamaBackend::init().unwrap();
// Enable MoE CPU offloading
let model_params = LlamaModelParams::default()
.with_cpu_moe_all(); // Offload all inactive experts to CPU
let model = LlamaModel::load_from_file(
&backend,
"phi-3.5-moe-Q8_0.gguf",
&model_params
).unwrap();
let ctx_params = LlamaContextParams::default()
.with_n_ctx(2048);
let mut ctx = model.new_context(&backend, ctx_params).unwrap();
// ... tokenize and generate as normal
}
With llama.cpp (C++)
# Build llama.cpp with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Run with CPU offloading
./build/bin/llama-cli \
-m phi-3.5-moe-Q8_0.gguf \
-p "Explain quantum computing" \
--cpu-moe
When to Use This Quantization
β Use Q8_0 if you want:
- Highest quality: Near-F16 accuracy with minimal quality loss
- Production critical: Quality-sensitive applications
- Still save VRAM: 94% VRAM reduction with CPU offloading (2.5 GB vs 42 GB)
- Best of both worlds: High quality + VRAM savings
β Consider alternatives if:
- Smaller size needed β Use Q4_K_M variant (24 GB, good balance)
- Maximum compression β Use Q2_K variant (15 GB, 1.3 GB VRAM)
- Absolute precision β Use F16 base model (79 GB, no quantization)
Quantization Details
- Method: 8-bit quantization (Q8_0)
- Bits per weight: 8 bits
- Quantization tool: llama-quantize (llama.cpp b6686)
- Source: F16 version of microsoft/Phi-3.5-MoE-instruct
- Trade-off: Larger size, nearly lossless quality
Technical Notes
MoE Architecture
Phi-3.5-MoE uses a sparse Mixture of Experts architecture where only a subset of experts are activated per token. This allows the model to have high capacity (many parameters) while maintaining efficiency (sparse activation).
CPU Offloading Implementation
The --cpu-moe flag (or with_cpu_moe_all() in Rust) tells llama.cpp to:
- Keep active experts in VRAM for fast inference
- Move inactive experts to system RAM
- Swap experts as needed during generation
This dramatically reduces VRAM usage with a manageable performance trade-off.
VRAM Breakdown (CPU Offload Mode)
- Model buffer: ~1.3 GB (active experts only)
- KV cache: 0.51 GB
- Compute buffer: 0.10 GB
- Total: ~2.5 GB
Sample Output
Prompt: "Explain quantum computing in simple terms"
Response:
Quantum computing is a type of computing that uses the principles of quantum mechanics, a branch of physics that describes the behavior of particles at the smallest scales. Unlike classical computers that use bits (0s and 1s) to process information...
(High-quality response, near-F16 quality)
Citation
If you use this model in your work, please cite the original Phi-3.5 paper and acknowledge the quantization:
@article{phi3.5,
title={Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone},
author={Microsoft Research},
year={2024}
}
Links
- Original Model: microsoft/Phi-3.5-MoE-instruct
- shimmy Project: github.com/utilityai/shimmy
- llama.cpp: github.com/ggerganov/llama.cpp
- Other Quantizations:
License: MIT (inherited from base model)
Quantized by: MikeKuykendall
Date: October 2025
- Downloads last month
- 7
8-bit
Model tree for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf
Base model
microsoft/Phi-3.5-MoE-instruct