Instructions to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8")
model = AutoModelForMultimodalLM.from_pretrained("batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8

SGLang

How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 with Docker Model Runner:
```
docker model run hf.co/batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8
```

Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8

Vision-capable FP8 quantized fast abliterated distilled Qwen3.5-35B model made for Nvidia DGX Spark (~80GB VRAM is needed for full functionality)

Model Lineage

So first it was Qwen/Qwen3.5-35B-A3B (BF16).

Then Jackrong created a text-only, less chatty and better with tools version — Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
Then Huihui removed all refusals and put back the vision capabilities in huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated
Then I quantized it to FP8 using the conservative approach demonstrated by the Qwen team in Qwen/Qwen3.5-35B-A3B-FP8

Performance

Conservative approach to FP8 quantization caused minimum quality loss, while still bumping the speed from 31 t/s → 51 t/s on DGX Spark. With 262k context and some space for KV cache it uses 80GB VRAM (only).

Currently that's the best, fastest and abliterated model to be used on Nvidia DGX Spark, which also preserves all visual layers untouched.

I failed to find a case where this model will refuse to answer. It is especially funny to use with pictures ;). So far the best "tooling" skills — it really likes to Google stuff first even if it knows the answer.

I plan to test the quality of the model's output later and update this page.

Quantization Details

Quantized using the FP8_DYNAMIC scheme from llmcompressor (>=0.10) with compressed-tensors serialization.

Method

FP8_DYNAMIC is a data-free quantization scheme — no calibration dataset required. Weights are statically quantized to FP8 (per-channel, symmetric), while activations are dynamically quantized to FP8 (per-token, symmetric) at inference time.

Modules Excluded from Quantization

Matching the conservative strategy from Qwen/Qwen3.5-35B-A3B-FP8:

Module	Reason
`lm_head`	Output head — precision-sensitive
`embed_tokens`	Embedding layer
`linear_attn.conv1d`, `linear_attn.in_proj_a/b`	Linear attention layers
`mlp.gate`, `mlp.shared_expert_gate`	MoE router gates — routing precision matters
`model.visual.*`	Entire visual encoder kept at BF16
`mtp.*`	Multi-token prediction layers

Post-processing

The model was quantized via AutoModelForCausalLM (the only loader proven to work with llmcompressor for this architecture), then post-processed:

Weight key renaming — model.layers.X → model.language_model.layers.X to match the ConditionalGeneration format expected by vLLM
Visual encoder restoration — BF16 vision encoder weights copied from the source model (since AutoModelForCausalLM strips them)
Config restructuring — config.json rebuilt from the source model's nested structure with the quantization config injected

Resources

Conversion scripts: github.com/ageev/AI/tree/main/converters/qwen35
Spark recipe for spark-vllm-docker: github.com/ageev/AI/tree/main/spark-recipes

Disclaimer

It's an abliterated model. DO NOT use it if you think that all AIs need to be politically correct and boring.

Downloads last month: 86

Safetensors

Model size

35B params

Tensor type

BF16

F8_E4M3

Model tree for batsclamp/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8

Base model

Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled

Quantized

(17)

this model