---
tags:
- gguf
- llama.cpp
- text-generation
- moe
- quantized
- bailing
license: apache-2.0
language:
- en
- zh
pipeline_tag: text-generation
base_model: inclusionAI/Ling-2.6-flash
base_model_relation: quantized
---

# Ling-2.6-flash GGUF

Quantized GGUF of [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) — a 104B parameter MoE model (7.4B active) with hybrid MLA/GLA architecture.

## Files

| File | Size | Format |
|------|------|--------|
| `Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf` | ~57 GB | IQ4_NL |

## Running in llama.cpp

**This model requires a custom llama.cpp branch with Bailing Hybrid architecture support:**

*https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2*

While the mtp works (llama-server accepts '--spec-type mtp') atm it actually slows down the decode. So the speed test below are *without* mtp in. IDK why mtp does not help. (can think o reasonsf: the mtp implementation is poor or buggy, Ling-2.6 has only 1 extra head, giving only 1 extra token - does not suffice, or maybe the quantisation is detremental)

### Build

```bash
git clone https://github.com/ljubomirj/llama.cpp.git
cd llama.cpp
git checkout LJ-Ling-2.6-flash-r2
mkdir -p build && cd build
cmake .. -DLLAMA_METAL=ON
make -j llama-cli llama-server llama-batched-bench
```

### CLI

```bash
./bin/llama-cli \
  -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
  -st -p "The capital of France is"
```

```bash
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.013 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0 (Apple M2 Max)
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 92274.69 MB

Loading model...

> The capital of France is

The capital of France is Paris.

[ Prompt: 96.1 t/s | Generation: 33.3 t/s ]

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - MTL0 (Apple M2 Max) | 88000 = 27848 + (59447 = 58324 +     632 +     490) +         704 |
common_memory_breakdown_print: |   - Host                |                    653 =   345 +       0 +     308                |
ggml_metal_free: deallocating
```

### Server

```bash
./bin/llama-server \
  -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
  -ctx 4096 -fa -ngl 99
```

## Performance (MacBook Pro M2 Max, 96 GB)

- Prefill: ~250-400 tok/s
- Generation: ~30-45 tok/s

```bash
./bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000
```

```bash
main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.169 |   437.96 |    2.739 |    46.73 |    3.908 |   163.75 |
|  1024 |    128 |    1 |   1152 |    2.855 |   358.72 |    3.534 |    36.22 |    6.389 |   180.32 |
|  2048 |    128 |    1 |   2176 |    6.073 |   337.25 |    3.535 |    36.20 |    9.608 |   226.48 |
|  4096 |    128 |    1 |   4224 |   12.564 |   326.00 |    3.753 |    34.10 |   16.318 |   258.86 |
|  8192 |    128 |    1 |   8320 |   26.474 |   309.43 |    3.938 |    32.50 |   30.412 |   273.57 |
| 16384 |    128 |    1 |  16512 |   57.800 |   283.46 |    4.252 |    30.10 |   62.052 |   266.10 |
| 32768 |    128 |    1 |  32896 |  131.884 |   248.46 |    4.631 |    27.64 |  136.515 |   240.97 |

llama_perf_context_print:        load time =    7196.80 ms
llama_perf_context_print: prompt eval time =  239042.77 ms / 65040 tokens (    3.68 ms per token,   272.09 tokens per second)
llama_perf_context_print:        eval time =   26374.75 ms /   896 runs   (   29.44 ms per token,    33.97 tokens per second)
llama_perf_context_print:       total time =  272401.59 ms / 65936 tokens
llama_perf_context_print:    graphs reused =        889
```

## Implementation Notes

### Reference: `bailing_hybrid.py`

The [`docs/bailing_hybrid.py`](https://github.com/ljubomirj/llama.cpp/blob/LJ-Ling-2.6-flash-r2/docs/bailing_hybrid.py) in the llama.cpp fork is the original MLX model implementation from [mlx-lm PR #1227](https://github.com/ml-explore/mlx-lm/pull/1227). It was the primary reference for porting the Bailing Hybrid architecture to llama.cpp — covering MLA attention, GLA (Gated Linear Attention) with the recurrent state kernel, MoE expert routing, and the MTP speculative decoding head.

### GLA Slope Fix

The upstream model had an [off-by-one bug in the GLA decay slope](https://huggingface.co/inclusionAI/Ling-2.6-flash/commit/7c60792051a885a3f14a75576f01f7f5cb6a08ff): `(self.layer_idx - 1)` was used instead of `self.layer_idx` in the layer-dependent decay scaling. This caused incorrect decay rates for GLA layers, with the most severe effect on layer 0 (which got a negative slope). Our llama.cpp implementation used the correct formula from the start: `layer_factor = 1.0 - il / (n_layer - 1) + 1e-5`.

### MTP (Multi-Token Prediction)

The MTP speculative decoding head works (100% draft acceptance with greedy sampling) but provides no speedup for this model. Ling-2.6 has only 1 MTP head (`nextn_predict_layers=1`), limiting speculative decoding to 1 draft per trunk verification pass. With the MTP head on CPU, the extra draft overhead exceeds any trunk pass savings. Models with multiple MTP heads (e.g. DeepSeek-V3 with 3 heads) would benefit more.

## Quantization Method

This GGUF quantization was developed entirely by AI coding agents reading the [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227) and adapting it for llama.cpp compatibility.

Agents / LLMs used to make this run on my M2 Max:
- **Claude / GLM-5.1**
- **OpenCode / Kimi-K2.6**
- **OpenCode / DeepSeek-V4-Pro**

## Credits

- The OG [llama.cpp](https://github.com/ggml-org/llama.cpp) making all this possible!
- Original model [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
- The original [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227)
- MLX reference implementation: [mlx-community/Ling-2.6-flash-mlx-4bit-DWQ](https://huggingface.co/mlx-community/Ling-2.6-flash-mlx-4bit-DWQ)
- Custom llama.cpp fork [ljubomirj/llama.cpp @ LJ-Ling-2.6-flash-r2](https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2)