Instructions to use ljupco/Ling-2.6-flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ljupco/Ling-2.6-flash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ljupco/Ling-2.6-flash-GGUF",
	filename="Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use ljupco/Ling-2.6-flash-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL
# Run inference directly in the terminal:
./llama-cli -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Use Docker

docker model run hf.co/ljupco/Ling-2.6-flash-GGUF:IQ4_NL

LM Studio
Jan

vLLM

How to use ljupco/Ling-2.6-flash-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ljupco/Ling-2.6-flash-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ljupco/Ling-2.6-flash-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Ollama
How to use ljupco/Ling-2.6-flash-GGUF with Ollama:
```
ollama run hf.co/ljupco/Ling-2.6-flash-GGUF:IQ4_NL
```

Unsloth Studio new

How to use ljupco/Ling-2.6-flash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ljupco/Ling-2.6-flash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ljupco/Ling-2.6-flash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ljupco/Ling-2.6-flash-GGUF to start chatting

Pi new

How to use ljupco/Ling-2.6-flash-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ljupco/Ling-2.6-flash-GGUF:IQ4_NL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ljupco/Ling-2.6-flash-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Run Hermes

hermes

Docker Model Runner
How to use ljupco/Ling-2.6-flash-GGUF with Docker Model Runner:
```
docker model run hf.co/ljupco/Ling-2.6-flash-GGUF:IQ4_NL
```

Lemonade

How to use ljupco/Ling-2.6-flash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ljupco/Ling-2.6-flash-GGUF:IQ4_NL

Run and chat with the model

lemonade run user.Ling-2.6-flash-GGUF-IQ4_NL

List all available models

lemonade list

Ling-2.6-flash GGUF

Quantized GGUF of inclusionAI/Ling-2.6-flash — a 104B parameter MoE model (7.4B active) with hybrid MLA/GLA architecture.

Files

File	Size	Format
`Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf`	~57 GB	IQ4_NL

Running in llama.cpp

This model requires a custom llama.cpp branch with Bailing Hybrid architecture support:

https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2

While the mtp works (llama-server accepts '--spec-type mtp') atm it actually slows down the decode. So the speed test below are without mtp in. IDK why mtp does not help. (can think o reasonsf: the mtp implementation is poor or buggy, Ling-2.6 has only 1 extra head, giving only 1 extra token - does not suffice, or maybe the quantisation is detremental)

Build

git clone https://github.com/ljubomirj/llama.cpp.git
cd llama.cpp
git checkout LJ-Ling-2.6-flash-r2
mkdir -p build && cd build
cmake .. -DLLAMA_METAL=ON
make -j llama-cli llama-server llama-batched-bench

CLI

./bin/llama-cli \
  -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
  -st -p "The capital of France is"

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.013 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0 (Apple M2 Max)
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 92274.69 MB

Loading model...

> The capital of France is

The capital of France is Paris.

[ Prompt: 96.1 t/s | Generation: 33.3 t/s ]

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - MTL0 (Apple M2 Max) | 88000 = 27848 + (59447 = 58324 +     632 +     490) +         704 |
common_memory_breakdown_print: |   - Host                |                    653 =   345 +       0 +     308                |
ggml_metal_free: deallocating

Server

./bin/llama-server \
  -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
  -ctx 4096 -fa -ngl 99

Performance (MacBook Pro M2 Max, 96 GB)

Prefill: ~250-400 tok/s
Generation: ~30-45 tok/s

./bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.169 |   437.96 |    2.739 |    46.73 |    3.908 |   163.75 |
|  1024 |    128 |    1 |   1152 |    2.855 |   358.72 |    3.534 |    36.22 |    6.389 |   180.32 |
|  2048 |    128 |    1 |   2176 |    6.073 |   337.25 |    3.535 |    36.20 |    9.608 |   226.48 |
|  4096 |    128 |    1 |   4224 |   12.564 |   326.00 |    3.753 |    34.10 |   16.318 |   258.86 |
|  8192 |    128 |    1 |   8320 |   26.474 |   309.43 |    3.938 |    32.50 |   30.412 |   273.57 |
| 16384 |    128 |    1 |  16512 |   57.800 |   283.46 |    4.252 |    30.10 |   62.052 |   266.10 |
| 32768 |    128 |    1 |  32896 |  131.884 |   248.46 |    4.631 |    27.64 |  136.515 |   240.97 |

llama_perf_context_print:        load time =    7196.80 ms
llama_perf_context_print: prompt eval time =  239042.77 ms / 65040 tokens (    3.68 ms per token,   272.09 tokens per second)
llama_perf_context_print:        eval time =   26374.75 ms /   896 runs   (   29.44 ms per token,    33.97 tokens per second)
llama_perf_context_print:       total time =  272401.59 ms / 65936 tokens
llama_perf_context_print:    graphs reused =        889

Implementation Notes

Reference: `bailing_hybrid.py`

The docs/bailing_hybrid.py in the llama.cpp fork is the original MLX model implementation from mlx-lm PR #1227. It was the primary reference for porting the Bailing Hybrid architecture to llama.cpp — covering MLA attention, GLA (Gated Linear Attention) with the recurrent state kernel, MoE expert routing, and the MTP speculative decoding head.

GLA Slope Fix

The upstream model had an off-by-one bug in the GLA decay slope: (self.layer_idx - 1) was used instead of self.layer_idx in the layer-dependent decay scaling. This caused incorrect decay rates for GLA layers, with the most severe effect on layer 0 (which got a negative slope). Our llama.cpp implementation used the correct formula from the start: layer_factor = 1.0 - il / (n_layer - 1) + 1e-5.

MTP (Multi-Token Prediction)

The MTP speculative decoding head works (100% draft acceptance with greedy sampling) but provides no speedup for this model. Ling-2.6 has only 1 MTP head (nextn_predict_layers=1), limiting speculative decoding to 1 draft per trunk verification pass. With the MTP head on CPU, the extra draft overhead exceeds any trunk pass savings. Models with multiple MTP heads (e.g. DeepSeek-V3 with 3 heads) would benefit more.

Quantization Method

This GGUF quantization was developed entirely by AI coding agents reading the bailing_hybrid.py implementation from mlx-lm#1227 and adapting it for llama.cpp compatibility.

Agents / LLMs used to make this run on my M2 Max:

Claude / GLM-5.1
OpenCode / Kimi-K2.6
OpenCode / DeepSeek-V4-Pro

Credits

The OG llama.cpp making all this possible!
Original model inclusionAI/Ling-2.6-flash
The original bailing_hybrid.py implementation from mlx-lm#1227
MLX reference implementation: mlx-community/Ling-2.6-flash-mlx-4bit-DWQ
Custom llama.cpp fork ljubomirj/llama.cpp @ LJ-Ling-2.6-flash-r2

Downloads last month: 1,321

GGUF

Model size

107B params

Architecture

bailing_hybrid

Hardware compatibility

4-bit

Model tree for ljupco/Ling-2.6-flash-GGUF

Base model

inclusionAI/Ling-2.6-flash

Quantized

(9)

this model