Instructions to use skibare87/gemma-4-12B-it-FP8-DSpark with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use skibare87/gemma-4-12B-it-FP8-DSpark with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="skibare87/gemma-4-12B-it-FP8-DSpark") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("skibare87/gemma-4-12B-it-FP8-DSpark") model = AutoModelForMultimodalLM.from_pretrained("skibare87/gemma-4-12B-it-FP8-DSpark") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use skibare87/gemma-4-12B-it-FP8-DSpark with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "skibare87/gemma-4-12B-it-FP8-DSpark" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "skibare87/gemma-4-12B-it-FP8-DSpark", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/skibare87/gemma-4-12B-it-FP8-DSpark
- SGLang
How to use skibare87/gemma-4-12B-it-FP8-DSpark with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "skibare87/gemma-4-12B-it-FP8-DSpark" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "skibare87/gemma-4-12B-it-FP8-DSpark", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "skibare87/gemma-4-12B-it-FP8-DSpark" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "skibare87/gemma-4-12B-it-FP8-DSpark", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use skibare87/gemma-4-12B-it-FP8-DSpark with Docker Model Runner:
docker model run hf.co/skibare87/gemma-4-12B-it-FP8-DSpark
Configuration Parsing Warning:In config.json: "quantization_config.modules_to_not_convert" must be an array
gemma-4-12B-it — FP8 + DSpark speculative decoding
A load-and-go FP8 quantization of google/gemma-4-12B-it,
plus a reproducible recipe for running it with DeepSeek DSpark
speculative decoding (draft head: deepseek-ai/dspark_gemma4_12b_block7)
— on a single 32 GB Blackwell GPU (validated on an RTX 5090).
Two recipes are documented here, both starting from the same FP8 target:
| recipe | speed | context | notes |
|---|---|---|---|
| Fast (torch.compile max-autotune) | ~32 k | short/medium chat + code | |
| Long-context (windowed KV cache) | ~40–55 tok/s | 128 k = 26.6 GB, 256 k = 28.7 GB, in-VRAM | full 256 k on 32 GB |
The weights in this repo are the FP8 target. The DSpark draft, DeepSpec loop, and the two recipes are
described below (code in recipe/).
The FP8 quantization (this repo's weights)
torchao dynamic-activation / dynamic-weight float8, per-row, forced onto torch's native
_scaled_mm kernel. On Blackwell (sm_120) _scaled_mm fp8 matmul is ~2.5× a bf16 matmul; the default
KernelPreference.AUTO instead tries a cutlass kernel that doesn't load on sm_120/py3.12 and silently
falls back to a slow dequant path — so KernelPreference.TORCH is essential.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load-and-go: the fp8 quantization_config is baked into config.json — no config needed at load.
model = AutoModelForCausalLM.from_pretrained(
"skibare87/gemma-4-12B-it-FP8-DSpark",
dtype=torch.bfloat16, device_map="cuda", attn_implementation="sdpa",
).eval()
tok = AutoTokenizer.from_pretrained("skibare87/gemma-4-12B-it-FP8-DSpark")
To re-quantize google/gemma-4-12B-it yourself instead of using these weights:
from transformers import AutoModelForCausalLM, TorchAoConfig
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.quantization.quantize_.common.kernel_preference import KernelPreference
cfg = Float8DynamicActivationFloat8WeightConfig(
granularity=PerRow(), kernel_preference=KernelPreference.TORCH, # native _scaled_mm, not AUTO
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-12B-it", quantization_config=TorchAoConfig(cfg),
dtype=torch.bfloat16, device_map="cuda", attn_implementation="sdpa",
).eval()
# model.save_pretrained("gemma-4-12B-it-FP8") # <- produces the checkpoint in this repo
FP8 target ≈ 13 GB (vs ~24 GB bf16). Draft head ≈ 7 GB.
Recipe 1 — Fast (torch.compile max-autotune), ~150 tok/s on code
DSpark verifies the draft's proposals against the FP8 target. Speculation shines where the draft is
predictable: on code, accept-length ≈ 5 and 150 tok/s (136–171 measured, ~2× a plain 12B); on
open-ended prose accept-length ≈ 2.5 (90 tok/s). Speed comes from torch.compile(mode= "max-autotune-no-cudagraphs", dynamic=True) on the target, which fuses the fp8 activation-quant +
_scaled_mm into proper triton kernels.
- Compile is slow (
30 min cold). Persist it: set4 min) instead of a recompile.TORCHINDUCTOR_CACHE_DIRoff/tmp, and use torch 2.11's mega-cache (torch.compiler.save_cache_artifacts()/load_cache_artifacts()) so restarts are a cache hit (torch._dynamo.config.caching_precompiledoes not work with torchao fp8 (it can't serializeFloat8Tensorguards). - (AOTInductor gives ~21 s startup / ~173 tok/s but its static cache caps context at ~32 k on 32 GB — fine for short context, superseded by Recipe 2 for long context.)
Recipe 2 — Long-context (128 k–256 k in-VRAM on 32 GB)
gemma-4-12B has 40 sliding-attention layers (window 1024) + 8 full-attention layers. A plain
DynamicCache stores the sliding layers full-length, so 256 k KV would be 90 GB. Windowing the
sliding layers makes 256 k KV **5 GB**. The pieces (patches in recipe/, applied to a
DeepSpec checkout):
windowed_cache.py—SpecSlidingLayer: a crop-safe sliding cache (storeswindow + padso a speculative reject never eats into the real window;get_mask_sizesreports the true stored length so gemma's sliding mask stays aligned). Validated logit-exact vs the full forward past 1024 tokens, through crop cycles.- Chunked prefill (
base_evaluator.patch): prefill long prompts in chunks and keep only the draft's target hidden-state layers, with a rolling window sotarget_hidden_statesnever materializes full-length. This was the real memory lever (128 k: 38.7 → 26.6 GB). - Windowed draft context (
evaluator.patch): the draft only proposes, so its context is windowed (DSPARK_DRAFT_CTX_WINDOW, default 16384); a cumulative-offset trick keeps absolute positions correct. - Efficient SDPA backend: the 8 full-attn layers have
head_dim=512; flash-attn caps at 256, so forcetorch.nn.attention.sdpa_kernel([EFFICIENT_ATTENTION, MATH])— the math backend uses 32 GB for one such attention, efficient uses 4.4 GB.
Measured on the RTX 5090: 128 k = 26.6 GB / 56 s, 256 k = 28.7 GB / 181 s, both fully in-VRAM.
Heads-up: a transformers bug you'll hit at long context
DynamicCache(config=...) (and get_head_shapes) do layer_types[:-num_kv_shared_layers]; gemma-4-12B
has num_kv_shared_layers = 0, so [:-0] is an empty list and the sliding-window cache layers are
silently never created (everything becomes full-storage). The workaround (build the cache layers
manually) is in windowed_cache.py; details + a minimal repro in
recipe/transformers-num-kv-shared-layers-bug.md.
Serving it
recipe/server.py is a self-contained OpenAI-compatible /v1/chat/completions
shim wrapping DeepSpec's Gemma4DSparkEvaluator, with env knobs for both recipes
(DSPARK_COMPILE=1 → fast path; DSPARK_DRAFT_CTX_WINDOW / DSPARK_PREFILL_CHUNK → long context).
It supports real token-by-token streaming (a stream_callback added to DSpark's generate loop
pushes accepted tokens as speculation commits them — see the patches) and thinking: gemma4 reasons
in a <|channel>thought … <channel|> channel, which the shim exposes as OpenAI-style reasoning_content
(streamed separately from the answer content; toggle with DSPARK_THINKING=0). Works through a
LiteLLM gateway into Open WebUI.
Vision — gemma-4-12B is a VLM, and speculation carries images. Send OpenAI multimodal content
(image_url) and the shim runs it through the processor, then passes pixel_values (+ mm_token_type_ids,
image_position_ids) into the DSpark prefill via a prefill_mm param threaded through
generate_decoding_sample. The target embeds the image, the KV cache carries it, and the draft
speculates over image-aware hidden states — vision on the same speculative loop as text (accept-len
~3.2, ~32 tok/s), and it reasons about the image when thinking is on.
Gotchas
- Use the
-it(instruct) variant. The base pattern-completes and never stops;-itships its chat template andeos_token_id: [1, 106, 50]. gemma-4 uses a<|channel|>/<|think|>(harmony-style) format, not<start_of_turn>. - Validated stack: torch 2.11 + torchao 0.17 + transformers 5.x, CUDA 12.8, RTX 5090 (sm_120), WSL2.
attn_implementation="sdpa"(not flash — head_dim 512).
Attribution & license
- Base model: google/gemma-4-12B-it — © Google, distributed under the Gemma Terms of Use. This is a derivative (FP8 quantization); the Gemma Terms and use restrictions apply. "Gemma" is a trademark of Google.
- Draft head + speculative-decoding method: DeepSeek DSpark —
deepseek-ai/dspark_gemma4_12b_block7, DeepSpec. - Quantization: torchao. Serving: transformers.
Recipe assembled while getting DSpark + gemma-4-12B running at long context on a single 5090; shared so
others don't have to rediscover the Blackwell fp8 kernel choice, the mega-cache persistence, or the
sliding-cache/num_kv_shared_layers interactions.
- Downloads last month
- -