Instructions to use aifeifei798/QiMing-Moe-20B-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aifeifei798/QiMing-Moe-20B-MXFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="aifeifei798/QiMing-Moe-20B-MXFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("aifeifei798/QiMing-Moe-20B-MXFP4") model = AutoModelForCausalLM.from_pretrained("aifeifei798/QiMing-Moe-20B-MXFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use aifeifei798/QiMing-Moe-20B-MXFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "aifeifei798/QiMing-Moe-20B-MXFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aifeifei798/QiMing-Moe-20B-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/aifeifei798/QiMing-Moe-20B-MXFP4
- SGLang
How to use aifeifei798/QiMing-Moe-20B-MXFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "aifeifei798/QiMing-Moe-20B-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aifeifei798/QiMing-Moe-20B-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "aifeifei798/QiMing-Moe-20B-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aifeifei798/QiMing-Moe-20B-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use aifeifei798/QiMing-Moe-20B-MXFP4 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for aifeifei798/QiMing-Moe-20B-MXFP4 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for aifeifei798/QiMing-Moe-20B-MXFP4 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for aifeifei798/QiMing-Moe-20B-MXFP4 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="aifeifei798/QiMing-Moe-20B-MXFP4", max_seq_length=2048, ) - Docker Model Runner
How to use aifeifei798/QiMing-Moe-20B-MXFP4 with Docker Model Runner:
docker model run hf.co/aifeifei798/QiMing-Moe-20B-MXFP4
- QiMing
- DISCLAIMER
- The content generated by this model is for reference purposes only. Users are advised to verify its accuracy independently before use.
- This is a 20-billion-parameter foundation model (20B). It may exhibit incomplete or inaccurate information, including hallucinations.
- If you find this AI too human-like, please remember: it is merely a more intelligent model — not an actual person.
- Thanks mradermacher: For creating the GGUF versions of these models
- For developing the foundational model (aifeifei798/QiMing-Moe-20B-MXFP4) used in this project.
- unsloth.ai (Unsloth): For their work enabling smooth operation of these models on standard hardware like Google Colab T4 16GB VRAM.
- Thank Google Colab T4 16G
- The content generated by this model is for reference purposes only. Users are advised to verify its accuracy independently before use.
- Highlights
- Inference examples
- Download the model
- Reasoning levels
- Tool use
- Fine-tuning
QiMing
An AI that rewrites its own rules for greater intelligence.
结果 (Result) = 模型内容 (Model Content) × 数学的平方 (Math²)
"Logic is the soul of a model, for it defines:
- How it learns from data (The Power of Induction);
- How it reasons and decides (The Power of Deduction);
- Its capacity to align with human values (The Ethical Boundary);
- Its potential to adapt to future challenges (The Evolutionary Potential).
If a model pursues nothing but sheer scale or computational power, ignoring the depth and breadth of its logic, it risks becoming a "paper tiger"—imposing on the surface, yet hollow at its core. Conversely, a model built upon elegant logic, even with fewer parameters, can unleash its true vitality in our complex world."
DISCLAIMER
The content generated by this model is for reference purposes only. Users are advised to verify its accuracy independently before use.
This is a 20-billion-parameter foundation model (20B). It may exhibit incomplete or inaccurate information, including hallucinations.
If you find this AI too human-like, please remember: it is merely a more intelligent model — not an actual person.
Thanks mradermacher: For creating the GGUF versions of these models
https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-GGUF
https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-i1-GGUF
For developing the foundational model (aifeifei798/QiMing-Moe-20B-MXFP4) used in this project.
unsloth.ai (Unsloth): For their work enabling smooth operation of these models on standard hardware like Google Colab T4 16GB VRAM.
Thank Google Colab T4 16G
Highlights
- Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
- Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
- Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
- Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
- Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
- MXFP4 quantization: The models were post-trained with MXFP4 quantization of the MoE weights, making
QiMing-Moe-20B-MXFP4model run within 16GB of memory. All evals were performed with the same MXFP4 quantization.
Inference examples
Transformers
You can use QiMing-Moe-20B-MXFP4 with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use model.generate directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package.
To get started, install the necessary dependencies to setup your environment:
pip install -U transformers kernels torch
Once, setup you can proceed to run the model by running the snippet below:
from transformers import pipeline
import torch
model_id = "aifeifei798/QiMing-Moe-20B-MXFP4"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
Alternatively, you can run the model via Transformers Serve to spin up a OpenAI-compatible webserver:
transformers serve
transformers chat localhost:8000 --model-name-or-path aifeifei798/QiMing-Moe-20B-MXFP4
Learn more about how to use gpt-oss with Transformers.
vLLM
vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
vllm serve aifeifei798/QiMing-Moe-20B-MXFP4
Learn more about how to use gpt-oss with vLLM.
PyTorch / Triton
To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository.
LM Studio
If you are using LM Studio you can use the following commands to download.
# QiMing-Moe-20B-MXFP4
lms get aifeifei798/QiMing-Moe-20B-MXFP4
Check out our awesome list for a broader collection of gpt-oss resources and inference partners.
Download the model
You can download the model from Hugging Face CLI:
# QiMing-Moe-20B-MXFP4
huggingface-cli download aifeifei798/QiMing-Moe-20B-MXFP4 --local-dir QiMing-Moe-20B-MXFP4/
pip install gpt-oss
python -m gpt_oss.chat QiMing-Moe-20B-MXFP4/
Reasoning levels
You can adjust the reasoning level that suits your task across three levels:
- Low: Fast responses for general dialogue.
- Medium: Balanced speed and detail.
- High: Deep and detailed analysis.
The reasoning level can be set in the system prompts, e.g., "Reasoning: high".
Tool use
The gpt-oss models are excellent for:
- Web browsing (using built-in browsing tools)
- Function calling with defined schemas
- Agentic operations like browser tasks
Fine-tuning
QiMing-Moe-20B-MXFP4 models can be fine-tuned for a variety of specialized use cases.
This smaller model QiMing-Moe-20B-MXFP4 can be fine-tuned on consumer hardware
- Downloads last month
- 5
docker model run hf.co/aifeifei798/QiMing-Moe-20B-MXFP4