--- tags: - gguf - llama.cpp - text-generation - moe - quantized - bailing license: apache-2.0 language: - en - zh pipeline_tag: text-generation base_model: inclusionAI/Ling-2.6-flash base_model_relation: quantized --- # Ling-2.6-flash GGUF Quantized GGUF of [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) — a 104B parameter MoE model (7.4B active) with hybrid MLA/GLA architecture. ## Files | File | Size | Format | |------|------|--------| | `Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf` | ~57 GB | IQ4_NL | ## Running in llama.cpp **This model requires a custom llama.cpp branch with Bailing Hybrid architecture support:** *https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2* While the mtp works (llama-server accepts '--spec-type mtp') atm it actually slows down the decode. So the speed test below are *without* mtp in. IDK why mtp does not help. (can think o reasonsf: the mtp implementation is poor or buggy, Ling-2.6 has only 1 extra head, giving only 1 extra token - does not suffice, or maybe the quantisation is detremental) ### Build ```bash git clone https://github.com/ljubomirj/llama.cpp.git cd llama.cpp git checkout LJ-Ling-2.6-flash-r2 mkdir -p build && cd build cmake .. -DLLAMA_METAL=ON make -j llama-cli llama-server llama-batched-bench ``` ### CLI ```bash ./bin/llama-cli \ -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \ -st -p "The capital of France is" ``` ```bash ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.013 sec ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s) ggml_metal_device_init: GPU name: MTL0 (Apple M2 Max) ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: has tensor = false ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 92274.69 MB Loading model... > The capital of France is The capital of France is Paris. [ Prompt: 96.1 t/s | Generation: 33.3 t/s ] Exiting... common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - MTL0 (Apple M2 Max) | 88000 = 27848 + (59447 = 58324 + 632 + 490) + 704 | common_memory_breakdown_print: | - Host | 653 = 345 + 0 + 308 | ggml_metal_free: deallocating ``` ### Server ```bash ./bin/llama-server \ -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \ -ctx 4096 -fa -ngl 99 ``` ## Performance (MacBook Pro M2 Max, 96 GB) - Prefill: ~250-400 tok/s - Generation: ~30-45 tok/s ```bash ./bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000 ``` ```bash main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8 | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | |-------|--------|------|--------|----------|----------|----------|----------|----------|----------| | 512 | 128 | 1 | 640 | 1.169 | 437.96 | 2.739 | 46.73 | 3.908 | 163.75 | | 1024 | 128 | 1 | 1152 | 2.855 | 358.72 | 3.534 | 36.22 | 6.389 | 180.32 | | 2048 | 128 | 1 | 2176 | 6.073 | 337.25 | 3.535 | 36.20 | 9.608 | 226.48 | | 4096 | 128 | 1 | 4224 | 12.564 | 326.00 | 3.753 | 34.10 | 16.318 | 258.86 | | 8192 | 128 | 1 | 8320 | 26.474 | 309.43 | 3.938 | 32.50 | 30.412 | 273.57 | | 16384 | 128 | 1 | 16512 | 57.800 | 283.46 | 4.252 | 30.10 | 62.052 | 266.10 | | 32768 | 128 | 1 | 32896 | 131.884 | 248.46 | 4.631 | 27.64 | 136.515 | 240.97 | llama_perf_context_print: load time = 7196.80 ms llama_perf_context_print: prompt eval time = 239042.77 ms / 65040 tokens ( 3.68 ms per token, 272.09 tokens per second) llama_perf_context_print: eval time = 26374.75 ms / 896 runs ( 29.44 ms per token, 33.97 tokens per second) llama_perf_context_print: total time = 272401.59 ms / 65936 tokens llama_perf_context_print: graphs reused = 889 ``` ## Implementation Notes ### Reference: `bailing_hybrid.py` The [`docs/bailing_hybrid.py`](https://github.com/ljubomirj/llama.cpp/blob/LJ-Ling-2.6-flash-r2/docs/bailing_hybrid.py) in the llama.cpp fork is the original MLX model implementation from [mlx-lm PR #1227](https://github.com/ml-explore/mlx-lm/pull/1227). It was the primary reference for porting the Bailing Hybrid architecture to llama.cpp — covering MLA attention, GLA (Gated Linear Attention) with the recurrent state kernel, MoE expert routing, and the MTP speculative decoding head. ### GLA Slope Fix The upstream model had an [off-by-one bug in the GLA decay slope](https://huggingface.co/inclusionAI/Ling-2.6-flash/commit/7c60792051a885a3f14a75576f01f7f5cb6a08ff): `(self.layer_idx - 1)` was used instead of `self.layer_idx` in the layer-dependent decay scaling. This caused incorrect decay rates for GLA layers, with the most severe effect on layer 0 (which got a negative slope). Our llama.cpp implementation used the correct formula from the start: `layer_factor = 1.0 - il / (n_layer - 1) + 1e-5`. ### MTP (Multi-Token Prediction) The MTP speculative decoding head works (100% draft acceptance with greedy sampling) but provides no speedup for this model. Ling-2.6 has only 1 MTP head (`nextn_predict_layers=1`), limiting speculative decoding to 1 draft per trunk verification pass. With the MTP head on CPU, the extra draft overhead exceeds any trunk pass savings. Models with multiple MTP heads (e.g. DeepSeek-V3 with 3 heads) would benefit more. ## Quantization Method This GGUF quantization was developed entirely by AI coding agents reading the [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227) and adapting it for llama.cpp compatibility. Agents / LLMs used to make this run on my M2 Max: - **Claude / GLM-5.1** - **OpenCode / Kimi-K2.6** - **OpenCode / DeepSeek-V4-Pro** ## Credits - The OG [llama.cpp](https://github.com/ggml-org/llama.cpp) making all this possible! - Original model [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) - The original [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227) - MLX reference implementation: [mlx-community/Ling-2.6-flash-mlx-4bit-DWQ](https://huggingface.co/mlx-community/Ling-2.6-flash-mlx-4bit-DWQ) - Custom llama.cpp fork [ljubomirj/llama.cpp @ LJ-Ling-2.6-flash-r2](https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2)