Instructions to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated") model = AutoModelForMultimodalLM.from_pretrained("huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated
- SGLang
How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated with Docker Model Runner:
docker model run hf.co/huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated
Heads-up: BF16 weights appear to produce degenerate outputs (logits collapsed)
Hi huihui-ai team β long-time fan of the abliterated line, wanted to flag something we ran into while preparing an NVFP4 variant of this release for Blackwell. Posting it here as a friendly heads-up, not a complaint β totally up to you whether to investigate.
What we observed
When loading the BF16 weights via AutoModelForCausalLM.from_pretrained(..., dtype=torch.bfloat16, trust_remote_code=True, device_map="auto"), the model appears to load cleanly (no missing/unexpected keys), but:
- Output logits magnitude is collapsed to roughly
[-0.08, +0.08](healthy Qwen3-Next-80B BF16 logits typically span at least Β±10). - Greedy generation produces only
!tokens (and occasional fragments likewhole,BUFFER,journal,InlineData):
Prompt: "Hello, who are you?"
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!cribe!!! whole! whole!!!module!now!Le! whole! whole! whole! whole! whole"
Prompt: "Write a Python function that computes the factorial of n."
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! journal!!!!!!!!!!!!!!!!!!!!!!!!!BUFFER!InlineData!BUFFER!!!!!!!!!"
Prompt: "List the noble gases:"
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\x1a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
- NVFP4 calibration via
nvidia-modeloptfails with NaN amax at the very first attentiono_projinput β consistent with the upstream activations being near-zero and the o_proj input collapsing once any quantization scale tries to track it.
Test environment (clean room, fast path active)
- Container:
nvidia/cuda:13.0base, torch 2.11, transformers 5.5.4 flash-linear-attention+causal-conv1dinstalled (no fast-path warning printed during forward)- 3Γ RTX PRO 6000 Blackwell,
device_map="auto"sharding - BF16 dtype, no quantization, no abliteration step from us β just
from_pretrained+generate
So the "fast path is not available" fallback is not in play, and the issue is reproducible from a clean transformers load.
What we don't know
We only tested this single release, so we can't tell whether the cause sits in:
- the abliteration pass over
samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled(your step), or - something already present in the Reasoning-Distilled base, or
- some interaction between the two.
We didn't run a comparison against samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled to bisect the cause β happy to do that if it would help.
Why we mention it
35K-DL-tier repos like yours are often the entry point for the local-LLM crowd, and BF16 generating only ! is the kind of thing that'll create a wave of confused issues. Wanted to surface it early so you have the option to investigate before that happens. We've stopped our NVFP4 path on this release accordingly.
Always grateful for the abliterated line β it's been the foundation of much of our Blackwell fast-path work this year. Let me know if there's any diagnostic data I can share that would speed up triage.
β Tonoken3 / Lna-Lab
We are very grateful for your support and feedback. We have not tested the method you mentioned, but we have added
code testing to our processes. You may want to give it a try.
https://huggingface.co/huihui-ai/Huihui-Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-abliterated#usage