Instructions to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated")
model = AutoModelForMultimodalLM.from_pretrained("huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated

SGLang

How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with Docker Model Runner:
```
docker model run hf.co/huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated
```

Heads-up: BF16 weights appear to produce degenerate outputs (logits collapsed)

by sakamakismile - opened Apr 27

Discussion

sakamakismile

Apr 27

Hi huihui-ai team — long-time fan of the abliterated line, wanted to flag something we ran into while preparing an NVFP4 variant of this release for Blackwell. Posting it here as a friendly heads-up, not a complaint — totally up to you whether to investigate.

What we observed

When loading the BF16 weights via AutoModelForCausalLM.from_pretrained(..., dtype=torch.bfloat16, trust_remote_code=True, device_map="auto"), the model appears to load cleanly (no missing/unexpected keys), but:

Output logits magnitude is collapsed to roughly [-0.08, +0.08] (healthy Qwen3-Next-80B BF16 logits typically span at least ±10).
Greedy generation produces only ! tokens (and occasional fragments like whole, BUFFER, journal, InlineData):

Prompt: "Hello, who are you?"
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!cribe!!! whole! whole!!!module!now!Le! whole! whole! whole! whole! whole"

Prompt: "Write a Python function that computes the factorial of n."
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! journal!!!!!!!!!!!!!!!!!!!!!!!!!BUFFER!InlineData!BUFFER!!!!!!!!!"

Prompt: "List the noble gases:"
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\x1a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

NVFP4 calibration via nvidia-modelopt fails with NaN amax at the very first attention o_proj input — consistent with the upstream activations being near-zero and the o_proj input collapsing once any quantization scale tries to track it.

Test environment (clean room, fast path active)

Container: nvidia/cuda:13.0 base, torch 2.11, transformers 5.5.4
flash-linear-attention + causal-conv1d installed (no fast-path warning printed during forward)
3× RTX PRO 6000 Blackwell, device_map="auto" sharding
BF16 dtype, no quantization, no abliteration step from us — just from_pretrained + generate

So the "fast path is not available" fallback is not in play, and the issue is reproducible from a clean transformers load.

What we don't know

We only tested this single release, so we can't tell whether the cause sits in:

the abliteration pass over samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled (your step), or
something already present in the Reasoning-Distilled base, or
some interaction between the two.

We didn't run a comparison against samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled to bisect the cause — happy to do that if it would help.

Why we mention it

35K-DL-tier repos like yours are often the entry point for the local-LLM crowd, and BF16 generating only ! is the kind of thing that'll create a wave of confused issues. Wanted to surface it early so you have the option to investigate before that happens. We've stopped our NVFP4 path on this release accordingly.

Always grateful for the abliterated line — it's been the foundation of much of our Blackwell fast-path work this year. Let me know if there's any diagnostic data I can share that would speed up triage.

— Tonoken3 / Lna-Lab

huihui-ai

Owner Apr 27

We are very grateful for your support and feedback. We have not tested the method you mentioned, but we have added
code testing to our processes. You may want to give it a try.
https://huggingface.co/huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated#usage

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment