🤖 Qwen3.5 4B AWQ (INT4) - Optimized for Tool Calling

Qwen3.5-4B-awq-int4-optimized is a highly specialized, quantized version of the Qwen3.5-4B-Chat model. It has been compressed to 4-bit precision using AWQ (Activation-aware Weight Quantization) to maximize inference throughput and minimize VRAM usage, making it perfect for edge deployments and high-concurrency serving.

🚀 Quickstart with vLLM

This model is fully optimized for high-performance serving with vLLM.

Installation

pip install vllm

Serving the Model

You can immediately deploy this model as an OpenAI-compatible API server:

python -m vllm.entrypoints.openai.api_server \
    --model Faustus-Faber/Qwen3.5-4B-awq-int4-optimized \
    --quantization awq \
    --dtype auto \
    --max-model-len 4096

🧠 Model Details

Base Model: Qwen3.5-4B-Chat
Quantization Method: AWQ (Activation-aware Weight Quantization)
Precision: INT4 (Group Size: 128)
Primary Use Case: Agentic workflows, tool-calling, and highly structured JSON generation.

🛠️ Intended Use

This model is designed to act as the reasoning engine for AI agents. The quantization profile exhibits high fidelity in generating structured JSON tool calls and following strict system prompts, even at extreme 4-bit compression.

📜 License & Citation

This model is subject to the Tongyi Qianwen LICENSE agreement.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FaustusFaber31/Qwen3.5-4B-awq-int4-xlam

Base model

Qwen/Qwen1.5-4B-Chat

Finetuned

(3)

this model