Instructions to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8") model = AutoModelForMultimodalLM.from_pretrained("batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8
- SGLang
How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with Docker Model Runner:
docker model run hf.co/batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8")
model = AutoModelForMultimodalLM.from_pretrained("batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8
Vision-capable FP8 quantized fast abliterated distilled Qwen3.5-35B model made for Nvidia DGX Spark (~80GB VRAM is needed for full functionality)
Model Lineage
So first it was Qwen/Qwen3.5-35B-A3B (BF16).
- Then Jackrong created a text-only, less chatty and better with tools version — Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
- Then Huihui removed all refusals and put back the vision capabilities in huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated
- Then I quantized it to FP8 using the conservative approach demonstrated by the Qwen team in Qwen/Qwen3.5-35B-A3B-FP8
Performance
Conservative approach to FP8 quantization caused minimum quality loss, while still bumping the speed from 31 t/s → 51 t/s on DGX Spark. With 262k context and some space for KV cache it uses 80GB VRAM (only).
Currently that's the best, fastest and abliterated model to be used on Nvidia DGX Spark, which also preserves all visual layers untouched.
I failed to find a case where this model will refuse to answer. It is especially funny to use with pictures ;). So far the best "tooling" skills — it really likes to Google stuff first even if it knows the answer.
I plan to test the quality of the model's output later and update this page.
Quantization Details
Quantized using the FP8_DYNAMIC scheme from llmcompressor (>=0.10) with compressed-tensors serialization.
Method
FP8_DYNAMIC is a data-free quantization scheme — no calibration dataset required. Weights are statically quantized to FP8 (per-channel, symmetric), while activations are dynamically quantized to FP8 (per-token, symmetric) at inference time.
Modules Excluded from Quantization
Matching the conservative strategy from Qwen/Qwen3.5-35B-A3B-FP8:
| Module | Reason |
|---|---|
lm_head |
Output head — precision-sensitive |
embed_tokens |
Embedding layer |
linear_attn.conv1d, linear_attn.in_proj_a/b |
Linear attention layers |
mlp.gate, mlp.shared_expert_gate |
MoE router gates — routing precision matters |
model.visual.* |
Entire visual encoder kept at BF16 |
mtp.* |
Multi-token prediction layers |
Post-processing
The model was quantized via AutoModelForCausalLM (the only loader proven to work with llmcompressor for this architecture), then post-processed:
- Weight key renaming —
model.layers.X→model.language_model.layers.Xto match theConditionalGenerationformat expected by vLLM - Visual encoder restoration — BF16 vision encoder weights copied from the source model (since
AutoModelForCausalLMstrips them) - Config restructuring —
config.jsonrebuilt from the source model's nested structure with the quantization config injected
Resources
- Conversion scripts: github.com/ageev/AI/tree/main/converters/qwen35
- Spark recipe for spark-vllm-docker: github.com/ageev/AI/tree/main/spark-recipes
Disclaimer
It's an abliterated model. DO NOT use it if you think that all AIs need to be politically correct and boring.
- Downloads last month
- 86
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)