Instructions to use Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine" --prompt "Once upon a time"
DeepSeek V4 Flash MLX Q2 Mixed
This is an MLX conversion of deepseek-ai/DeepSeek-V4-Flash.
Source
- Base model:
deepseek-ai/DeepSeek-V4-Flash - Source revision:
6e763230a9d263eca2023f1d4a5ce1bfe126cf48 - Architecture:
DeepseekV4ForCausalLM - Model type:
deepseek_v4
Conversion Recipe
- Tooling branch:
Thump604/mlx-lm, branchdeepseek-v4-support-fixes - Minimum tooling commit for generation:
9c990f4 - Output path during conversion:
/Volumes/Lexar/mlx_models/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine - Quantization recipe:
mixed_2_6 - Quantization mode:
affine - Group size:
128 - Effective bits per weight reported by MLX:
2.992 - Shards:
23 - Indexed MLX tensor size:
106,355,393,628bytes
The mixed recipe uses 2-bit affine quantization for lower-risk routed expert paths and 6-bit affine quantization for sensitive paths including embeddings, LM head, attention projections, compressed-attention/indexer components, shared experts, and selected down projections.
Validation
- Conversion completed successfully.
- Lazy MLX load completed successfully on a 128GB Mac Studio.
- Raw prompt generation smoke completed successfully with
--max-tokens 2 --max-kv-size 1024. - Observed smoke numbers:
54.59sreal time,74.5GBmax RSS,106.94GBpeak footprint, zero swaps.
This artifact is a low-bit local fallback. It is not quality-qualified for production writing or coding lanes. Treat quality, long-context behavior, and sparse compressed-attention parity as open until evaluated with a real task suite.
Notes
DeepSeek V4 support in MLX is still under active development. This artifact was produced with local DeepSeek V4 support fixes, including FP4/FP8 checkpoint handling, F8_E8M0 scale metadata reinterpretation as raw uint8 exponent bytes before sanitizer decode, attention sink dtype handling, and quantized grouped output projection support.
- Downloads last month
- 2,305
4-bit
Model tree for Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine
Base model
deepseek-ai/DeepSeek-V4-Flash