Instructions to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf",
	filename="phi-3.5-moe-Q8_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
# Run inference directly in the terminal:
llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
# Run inference directly in the terminal:
llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0

Use Docker

docker model run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0

LM Studio
Jan

vLLM

How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0

Ollama
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Ollama:
```
ollama run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
```

Unsloth Studio

How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf to start chatting

Atomic Chat new
Docker Model Runner
How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Docker Model Runner:
```
docker model run hf.co/MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0
```

Lemonade

How to use MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf:Q8_0

Run and chat with the model

lemonade run user.phi-3.5-moe-q8-0-cpu-offload-gguf-Q8_0

List all available models

lemonade list

Phi-3.5-MoE Q8_0 with CPU Offloading

This is a Q8_0 quantization of Microsoft's Phi-3.5-MoE-Instruct model with MoE (Mixture of Experts) CPU offloading capability enabled via Rust bindings for llama.cpp.

Model Details

Base Model: microsoft/Phi-3.5-MoE-instruct
Quantization: Q8_0 (8-bit)
File Size: 42 GB (from 79 GB F16)
Architecture: Mixture of Experts (MoE)
License: MIT
Feature: MoE expert CPU offloading support

Performance Benchmarks

Tested on Lambda Cloud GH200 (96GB VRAM, 480GB RAM, CUDA 12.8) with shimmy v1.6.0:

Configuration	VRAM Usage	VRAM Saved	Reduction
All GPU (baseline)	41.91 GB	-	-
CPU Offload (`--cpu-moe`)	2.46 GB	39.45 GB	94.1%

Key Metrics

VRAM Reduction: 94.1% with CPU offloading enabled
Generation Quality: Near-F16 quality, minimal degradation
Average Tokens Generated: 73 tokens per test (N=3)
Test Prompt: "Explain quantum computing in simple terms"

What is MoE CPU Offloading?

Mixture of Experts models activate only a subset of parameters per token (sparse activation). This quantization includes Rust bindings that expose llama.cpp's MoE CPU offloading feature, allowing inactive experts to reside in system RAM instead of VRAM.

Note: The core MoE CPU offloading algorithm was implemented in llama.cpp (PR #15077, August 2025). This release provides Rust language bindings and production integration for that functionality.

Usage

With shimmy CLI

# Download the model
huggingface-cli download MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf \
  phi-3.5-moe-Q8_0.gguf --local-dir ./models

# Run with CPU offloading (uses ~2.5 GB VRAM)
shimmy serve \
  --model-dirs ./models \
  --cpu-moe \
  --bind 127.0.0.1:11435

# Run without offloading (uses ~42 GB VRAM)
shimmy serve \
  --model-dirs ./models \
  --bind 127.0.0.1:11435

With llama-cpp-2 (Rust)

use llama_cpp_2::context::params::LlamaContextParams;
use llama_cpp_2::llama_backend::LlamaBackend;
use llama_cpp_2::model::params::LlamaModelParams;
use llama_cpp_2::model::LlamaModel;

fn main() {
    let backend = LlamaBackend::init().unwrap();
    
    // Enable MoE CPU offloading
    let model_params = LlamaModelParams::default()
        .with_cpu_moe_all();  // Offload all inactive experts to CPU
    
    let model = LlamaModel::load_from_file(
        &backend,
        "phi-3.5-moe-Q8_0.gguf",
        &model_params
    ).unwrap();
    
    let ctx_params = LlamaContextParams::default()
        .with_n_ctx(2048);
    
    let mut ctx = model.new_context(&backend, ctx_params).unwrap();
    
    // ... tokenize and generate as normal
}

With llama.cpp (C++)

# Build llama.cpp with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Run with CPU offloading
./build/bin/llama-cli \
  -m phi-3.5-moe-Q8_0.gguf \
  -p "Explain quantum computing" \
  --cpu-moe

When to Use This Quantization

✅ Use Q8_0 if you want:

Highest quality: Near-F16 accuracy with minimal quality loss
Production critical: Quality-sensitive applications
Still save VRAM: 94% VRAM reduction with CPU offloading (2.5 GB vs 42 GB)
Best of both worlds: High quality + VRAM savings

❌ Consider alternatives if:

Smaller size needed → Use Q4_K_M variant (24 GB, good balance)
Maximum compression → Use Q2_K variant (15 GB, 1.3 GB VRAM)
Absolute precision → Use F16 base model (79 GB, no quantization)

Quantization Details

Method: 8-bit quantization (Q8_0)
Bits per weight: 8 bits
Quantization tool: llama-quantize (llama.cpp b6686)
Source: F16 version of microsoft/Phi-3.5-MoE-instruct
Trade-off: Larger size, nearly lossless quality

Technical Notes

MoE Architecture

Phi-3.5-MoE uses a sparse Mixture of Experts architecture where only a subset of experts are activated per token. This allows the model to have high capacity (many parameters) while maintaining efficiency (sparse activation).

CPU Offloading Implementation

The --cpu-moe flag (or with_cpu_moe_all() in Rust) tells llama.cpp to:

Keep active experts in VRAM for fast inference
Move inactive experts to system RAM
Swap experts as needed during generation

This dramatically reduces VRAM usage with a manageable performance trade-off.

VRAM Breakdown (CPU Offload Mode)

Model buffer: ~1.3 GB (active experts only)
KV cache: 0.51 GB
Compute buffer: 0.10 GB
Total: ~2.5 GB

Sample Output

Prompt: "Explain quantum computing in simple terms"

Response:

Quantum computing is a type of computing that uses the principles of quantum mechanics, a branch of physics that describes the behavior of particles at the smallest scales. Unlike classical computers that use bits (0s and 1s) to process information...

(High-quality response, near-F16 quality)

Citation

If you use this model in your work, please cite the original Phi-3.5 paper and acknowledge the quantization:

@article{phi3.5,
  title={Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone},
  author={Microsoft Research},
  year={2024}
}

Model tree for MikeKuykendall/phi-3.5-moe-q8-0-cpu-offload-gguf

Base model

microsoft/Phi-3.5-MoE-instruct

Quantized

(15)

this model

MikeKuykendall
/

phi-3.5-moe-q8-0-cpu-offload-gguf