π€ Qwen3.5 4B AWQ (INT4) - Optimized for Tool Calling
Qwen3.5-4B-awq-int4-optimized is a highly specialized, quantized version of the Qwen3.5-4B-Chat model. It has been compressed to 4-bit precision using AWQ (Activation-aware Weight Quantization) to maximize inference throughput and minimize VRAM usage, making it perfect for edge deployments and high-concurrency serving.
π Quickstart with vLLM
This model is fully optimized for high-performance serving with vLLM.
Installation
pip install vllm
Serving the Model
You can immediately deploy this model as an OpenAI-compatible API server:
python -m vllm.entrypoints.openai.api_server \
--model Faustus-Faber/Qwen3.5-4B-awq-int4-optimized \
--quantization awq \
--dtype auto \
--max-model-len 4096
π§ Model Details
- Base Model: Qwen3.5-4B-Chat
- Quantization Method: AWQ (Activation-aware Weight Quantization)
- Precision: INT4 (Group Size: 128)
- Primary Use Case: Agentic workflows, tool-calling, and highly structured JSON generation.
π οΈ Intended Use
This model is designed to act as the reasoning engine for AI agents. The quantization profile exhibits high fidelity in generating structured JSON tool calls and following strict system prompts, even at extreme 4-bit compression.
π License & Citation
This model is subject to the Tongyi Qianwen LICENSE agreement.
Model tree for FaustusFaber31/Qwen3.5-4B-awq-int4-xlam
Base model
Qwen/Qwen1.5-4B-Chat