---
library_name: mlc-llm
base_model: Qwen/Qwen3.5-2B
tags:
- mlc-llm
- qwen3.5
- gated-delta-net
- hybrid-attention
license: mit
---

# Qwen3.5-2B-q4f16_1-MLC

This is the [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) model in MLC format `q4f16_1`.

Qwen3.5 is a hybrid architecture: 75% GatedDeltaNet recurrent linear attention layers, 25% standard GQA softmax attention layers. This requires the `kHybrid` KVStateKind in MLC-LLM which manages both PagedKVCache and RNNState simultaneously.

Compiled with [mlc-llm](https://github.com/mlc-ai/mlc-llm) using the hybrid KVStateKind branch.

## Usage

### Python API

```python
from mlc_llm import MLCEngine

model = "HF://kinjani/Qwen3.5-2B-q4f16_1-MLC"
engine = MLCEngine(model, device="metal")

for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print()

engine.terminate()
```

### Chat CLI

```bash
mlc_llm chat HF://kinjani/Qwen3.5-2B-q4f16_1-MLC
```

## Model Details

| Parameter | Value |
|-----------|-------|
| Base model | [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) |
| Architecture | Qwen3.5 GatedDeltaNet (hybrid recurrent + attention) |
| Quantization | q4f16_1 |
| KV state kind | hybrid (PagedKVCache + RNNState) |
| Context window | 1024 (compile-time setting) |
| Conversation template | chatml |