DeepSeek V4 Flash MLX Q2 Mixed

This is an MLX conversion of deepseek-ai/DeepSeek-V4-Flash.

Source

  • Base model: deepseek-ai/DeepSeek-V4-Flash
  • Source revision: 6e763230a9d263eca2023f1d4a5ce1bfe126cf48
  • Architecture: DeepseekV4ForCausalLM
  • Model type: deepseek_v4

Conversion Recipe

  • Tooling branch: Thump604/mlx-lm, branch deepseek-v4-support-fixes
  • Minimum tooling commit for generation: 9c990f4
  • Output path during conversion: /Volumes/Lexar/mlx_models/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine
  • Quantization recipe: mixed_2_6
  • Quantization mode: affine
  • Group size: 128
  • Effective bits per weight reported by MLX: 2.992
  • Shards: 23
  • Indexed MLX tensor size: 106,355,393,628 bytes

The mixed recipe uses 2-bit affine quantization for lower-risk routed expert paths and 6-bit affine quantization for sensitive paths including embeddings, LM head, attention projections, compressed-attention/indexer components, shared experts, and selected down projections.

Validation

  • Conversion completed successfully.
  • Lazy MLX load completed successfully on a 128GB Mac Studio.
  • Raw prompt generation smoke completed successfully with --max-tokens 2 --max-kv-size 1024.
  • Observed smoke numbers: 54.59s real time, 74.5GB max RSS, 106.94GB peak footprint, zero swaps.

This artifact is a low-bit local fallback. It is not quality-qualified for production writing or coding lanes. Treat quality, long-context behavior, and sparse compressed-attention parity as open until evaluated with a real task suite.

Notes

DeepSeek V4 support in MLX is still under active development. This artifact was produced with local DeepSeek V4 support fixes, including FP4/FP8 checkpoint handling, F8_E8M0 scale metadata reinterpretation as raw uint8 exponent bytes before sanitizer decode, attention sink dtype handling, and quantized grouped output projection support.

Downloads last month
2,305
Safetensors
Model size
284B params
Tensor type
BF16
·
U32
·
F32
·
I64
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine

Quantized
(71)
this model