Instructions to use aifeifei798/QiMing-Moe-20B-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use aifeifei798/QiMing-Moe-20B-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="aifeifei798/QiMing-Moe-20B-MXFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("aifeifei798/QiMing-Moe-20B-MXFP4")
model = AutoModelForCausalLM.from_pretrained("aifeifei798/QiMing-Moe-20B-MXFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use aifeifei798/QiMing-Moe-20B-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "aifeifei798/QiMing-Moe-20B-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aifeifei798/QiMing-Moe-20B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/aifeifei798/QiMing-Moe-20B-MXFP4

SGLang

How to use aifeifei798/QiMing-Moe-20B-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "aifeifei798/QiMing-Moe-20B-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aifeifei798/QiMing-Moe-20B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "aifeifei798/QiMing-Moe-20B-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aifeifei798/QiMing-Moe-20B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use aifeifei798/QiMing-Moe-20B-MXFP4 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for aifeifei798/QiMing-Moe-20B-MXFP4 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for aifeifei798/QiMing-Moe-20B-MXFP4 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for aifeifei798/QiMing-Moe-20B-MXFP4 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="aifeifei798/QiMing-Moe-20B-MXFP4",
    max_seq_length=2048,
)

Docker Model Runner
How to use aifeifei798/QiMing-Moe-20B-MXFP4 with Docker Model Runner:
```
docker model run hf.co/aifeifei798/QiMing-Moe-20B-MXFP4
```

QiMing

An AI that rewrites its own rules for greater intelligence.

结果 (Result) = 模型内容 (Model Content) × 数学的平方 (Math²)

"Logic is the soul of a model, for it defines:

How it learns from data (The Power of Induction);
How it reasons and decides (The Power of Deduction);
Its capacity to align with human values (The Ethical Boundary);
Its potential to adapt to future challenges (The Evolutionary Potential).

If a model pursues nothing but sheer scale or computational power, ignoring the depth and breadth of its logic, it risks becoming a "paper tiger"—imposing on the surface, yet hollow at its core. Conversely, a model built upon elegant logic, even with fewer parameters, can unleash its true vitality in our complex world."

DISCLAIMER

The content generated by this model is for reference purposes only. Users are advised to verify its accuracy independently before use.

This is a 20-billion-parameter foundation model (20B). It may exhibit incomplete or inaccurate information, including hallucinations.

If you find this AI too human-like, please remember: it is merely a more intelligent model — not an actual person.

Thanks mradermacher: For creating the GGUF versions of these models

https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-GGUF

https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-i1-GGUF

For developing the foundational model (aifeifei798/QiMing-Moe-20B-MXFP4) used in this project.

https://huggingface.co/openai

unsloth.ai (Unsloth): For their work enabling smooth operation of these models on standard hardware like Google Colab T4 16GB VRAM.

https://unsloth.ai

Thank Google Colab T4 16G

Highlights

Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
MXFP4 quantization: The models were post-trained with MXFP4 quantization of the MoE weights, making QiMing-Moe-20B-MXFP4 model run within 16GB of memory. All evals were performed with the same MXFP4 quantization.

Inference examples

Transformers

You can use QiMing-Moe-20B-MXFP4 with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use model.generate directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package.

To get started, install the necessary dependencies to setup your environment:

pip install -U transformers kernels torch

Once, setup you can proceed to run the model by running the snippet below:

from transformers import pipeline
import torch

model_id = "aifeifei798/QiMing-Moe-20B-MXFP4"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Alternatively, you can run the model via Transformers Serve to spin up a OpenAI-compatible webserver:

transformers serve
transformers chat localhost:8000 --model-name-or-path aifeifei798/QiMing-Moe-20B-MXFP4

Learn more about how to use gpt-oss with Transformers.

vLLM

vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve aifeifei798/QiMing-Moe-20B-MXFP4

Learn more about how to use gpt-oss with vLLM.

PyTorch / Triton

To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository.

LM Studio

If you are using LM Studio you can use the following commands to download.

# QiMing-Moe-20B-MXFP4
lms get aifeifei798/QiMing-Moe-20B-MXFP4

Check out our awesome list for a broader collection of gpt-oss resources and inference partners.

Download the model

You can download the model from Hugging Face CLI:

# QiMing-Moe-20B-MXFP4
huggingface-cli download aifeifei798/QiMing-Moe-20B-MXFP4 --local-dir QiMing-Moe-20B-MXFP4/
pip install gpt-oss
python -m gpt_oss.chat QiMing-Moe-20B-MXFP4/

Reasoning levels

You can adjust the reasoning level that suits your task across three levels:

Low: Fast responses for general dialogue.
Medium: Balanced speed and detail.
High: Deep and detailed analysis.

The reasoning level can be set in the system prompts, e.g., "Reasoning: high".

Tool use

The gpt-oss models are excellent for:

Web browsing (using built-in browsing tools)
Function calling with defined schemas
Agentic operations like browser tasks

Fine-tuning

QiMing-Moe-20B-MXFP4 models can be fine-tuned for a variety of specialized use cases.

This smaller model QiMing-Moe-20B-MXFP4 can be fine-tuned on consumer hardware

Downloads last month: 5

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for aifeifei798/QiMing-Moe-20B-MXFP4

Base model

openai/gpt-oss-20b

Quantized

(207)

this model

Quantizations

2 models

Collection including aifeifei798/QiMing-Moe-20B-MXFP4

QiMing Foundry

Collection

14 items • Updated Apr 18 • 1