Instructions to use bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4
- SGLang
How to use bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4 with Docker Model Runner:
docker model run hf.co/bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4
Qwen3.5-122B-A10B-abliterated-NVFP4
NVFP4 (4-bit floating point) quantized derivative of wangzhang/Qwen3.5-122B-A10B-abliterated, which itself is derived from Qwen/Qwen3.5-122B-A10B.
This repository provides a modified derivative checkpoint for local inference and serving. The primary changes in this repository are NVFP4 quantization, weight repacking / export formatting, and serving compatibility adjustments.
Model Details
| Property | Value |
|---|---|
| Intermediate Base Model | wangzhang/Qwen3.5-122B-A10B-abliterated |
| Original Base Model | Qwen/Qwen3.5-122B-A10B |
| Architecture | Qwen3.5 MoE (256 routed experts, 10B active) |
| Quantization | NVFP4 (compressed-tensors, nvfp4-pack-quantized) |
| Original Size | 228 GB (BF16) |
| Quantized Size | 71.2 GB (69% reduction) |
| Format | safetensors (2 shards) |
Quantization Method
This model was quantized using a template-based weight replacement approach:
- Reference Template: RedHatAI/Qwen3.5-122B-A10B-NVFP4 — a calibrated NVFP4 checkpoint of the original (non-abliterated) Qwen3.5-122B-A10B, produced by llm-compressor with proper calibration data.
- Weight Replacement: Each quantized tensor (
weight_packedandweight_scale) was regenerated from the abliterated BF16 weights using the reference checkpoint'sweight_global_scaleandinput_global_scalevalues. - Format Preservation: The reference checkpoint's
config.json,quantization_config, global scales, and all metadata were preserved unchanged, ensuring full compatibility with vLLM's CUTLASS NVFP4 MoE kernel.
What is Quantized
| Component | Format | Notes |
|---|---|---|
| Routed experts (gate/up/down_proj) | NVFP4 | 256 experts × 48 layers × 3 projections |
| Shared experts | NVFP4 | 48 layers × 3 projections |
| Self-attention (q/k/v/o_proj) | NVFP4 | 12 full-attention layers |
| Linear attention | BF16 | 36 layers, kept at full precision |
| Embeddings, norms, gates | BF16 | Kept at full precision |
Serving with vLLM
This model requires a text-only compatibility patch for vLLM since Qwen3.5 MoE is a multimodal architecture but this checkpoint contains only text weights.
Quick Start
# 1. Download the model
huggingface-cli download bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4
# 2. Apply the text-only patch before starting vLLM
python vllm_patches/patch_qwen35_moe_text.py
# 3. Serve with vLLM
vllm serve /path/to/model \
--served-model-name Qwen3.5-122B-A10B-abliterated-NVFP4 \
--max-model-len 131072 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--enable-prefix-caching \
--enable-chunked-prefill \
--reasoning-parser qwen3
### Docker Compose (Recommended)
A complete Docker Compose setup is provided in the `serving/` directory:
```bash
# Copy serving files
cp -r serving/ /path/to/your/vllm-setup/
# Edit .env to set MODEL_PATH
vim serving/.env
# Start
cd serving && docker compose --profile head up -d
See serving/ directory for:
docker-compose.yml— Full vLLM serving configuration.env.example— Environment variables templateentrypoint.sh— Entrypoint with automatic patch application
Hardware Requirements
| Configuration | Memory | max_model_len | Notes |
|---|---|---|---|
| 1× NVIDIA DGX Spark (GB10) | 121 GiB unified | 131,072 (128K) | Tested and verified |
| 1× GPU with 80+ GB VRAM | 80 GiB | ~65,536 | Estimated |
Performance (DGX Spark, TP=1)
| Metric | Value |
|---|---|
| Throughput | 14.5 tok/s average, 16.8 tok/s peak |
| KV Cache | 222K tokens (20.4 GiB) |
| Max Concurrency | 6.16× at 128K context |
| Model Loading | ~13 min (2 shards) |
Referenced Models
- Base model: wangzhang/Qwen3.5-122B-A10B-abliterated — Abliterated (uncensored) version of Qwen3.5-122B-A10B
- Original model: Qwen/Qwen3.5-122B-A10B — Official Qwen3.5 MoE model
- Quantization template: RedHatAI/Qwen3.5-122B-A10B-NVFP4 — Used as format reference for NVFP4 quantization structure
- FP8 variant: bjk110/Qwen3.5-122B-A10B-abliterated-FP8 — FP8 block-wise quantized version (116 GB, requires TP=2)
- Downloads last month
- 382
Model tree for bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4
Base model
Qwen/Qwen3.5-122B-A10B
docker model run hf.co/bjk110/Qwen3.5-122B-A10B-abliterated-NVFP4