--- library_name: mlc-llm base_model: Qwen/Qwen3.5-2B tags: - mlc-llm - qwen3.5 - gated-delta-net - hybrid-attention license: mit --- # Qwen3.5-2B-q4f16_1-MLC This is the [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) model in MLC format `q4f16_1`. Qwen3.5 is a hybrid architecture: 75% GatedDeltaNet recurrent linear attention layers, 25% standard GQA softmax attention layers. This requires the `kHybrid` KVStateKind in MLC-LLM which manages both PagedKVCache and RNNState simultaneously. Compiled with [mlc-llm](https://github.com/mlc-ai/mlc-llm) using the hybrid KVStateKind branch. ## Usage ### Python API ```python from mlc_llm import MLCEngine model = "HF://kinjani/Qwen3.5-2B-q4f16_1-MLC" engine = MLCEngine(model, device="metal") for response in engine.chat.completions.create( messages=[{"role": "user", "content": "What is the meaning of life?"}], model=model, stream=True, ): for choice in response.choices: print(choice.delta.content, end="", flush=True) print() engine.terminate() ``` ### Chat CLI ```bash mlc_llm chat HF://kinjani/Qwen3.5-2B-q4f16_1-MLC ``` ## Model Details | Parameter | Value | |-----------|-------| | Base model | [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) | | Architecture | Qwen3.5 GatedDeltaNet (hybrid recurrent + attention) | | Quantization | q4f16_1 | | KV state kind | hybrid (PagedKVCache + RNNState) | | Context window | 1024 (compile-time setting) | | Conversation template | chatml |