--- license: apache-2.0 language: - en - zh library_name: gguf pipeline_tag: text-generation tags: - gguf - minicpm - minicpm5 - llama - text-generation - long-context - tool-calling - on-device - edge-ai - quantized - llama-cpp - ollama - lm-studio datasets: - openbmb/Ultra-FineWeb - openbmb/Ultra-FineWeb-L3 - openbmb/UltraData-Math - openbmb/UltraData-SFT-2605 ---
MiniCPM Paper | GitHub Repo | 中文 | UltraData | MiniCPM Desk Pet
> This repository hosts the GGUF (llama.cpp) versions of **MiniCPM5-1B**, including **F16**, **Q8_0**, and **Q4_K_M**. For the BF16 Hugging Face weights and the full model card, please refer to [MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B). ## Highlights We are releasing **MiniCPM5-1B**, the first model in the **MiniCPM5** series. It is a dense 1B Transformer built for on-device, local deployment, and resource-constrained scenarios, reaching 1B-class open-source SOTA on the benchmark suite. 🏆 **1B-class open-source SOTA**: compared with strong open-source models in the same size class, MiniCPM5-1B reaches SOTA within this comparison set. Its advantage is most visible in agentic tool use, code generation, and difficult reasoning.  🧠 **Dual Mode Reasoning**: built-in `
**Project repo**: [OpenBMB/MiniCPM-Desk-Pet](https://github.com/OpenBMB/MiniCPM-Desk-Pet)
## Model List
Use this directory to choose the model format that matches your runtime:
- **[MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)** · [ModelScope](https://www.modelscope.cn/models/OpenBMB/MiniCPM5-1B) · BF16 final release (post-trained with RL + OPD)
- **[MiniCPM5-1B-SFT](https://huggingface.co/openbmb/MiniCPM5-1B-SFT)** · [ModelScope](https://www.modelscope.cn/models/OpenBMB/MiniCPM5-1B-SFT) · BF16 SFT-only checkpoint (before RL / OPD)
- **[MiniCPM5-1B-Base](https://huggingface.co/openbmb/MiniCPM5-1B-Base)** · [ModelScope](https://www.modelscope.cn/models/OpenBMB/MiniCPM5-1B-Base) · BF16 base checkpoint (pre-training only)
- **[MiniCPM5-1B-GGUF](https://huggingface.co/openbmb/MiniCPM5-1B-GGUF)** · [ModelScope](https://www.modelscope.cn/models/OpenBMB/MiniCPM5-1B-GGUF) · GGUF for llama.cpp / Ollama / LM Studio **👈 you are here**
- **[MiniCPM5-1B-MLX](https://huggingface.co/openbmb/MiniCPM5-1B-MLX)** · [ModelScope](https://www.modelscope.cn/models/OpenBMB/MiniCPM5-1B-MLX) · MLX / 4bit for Apple Silicon
## Model Information
MiniCPM5-1B has the following features:
- **Type**: Causal Language Model
- **Architecture**: Standard `LlamaForCausalLM`
- **Number of Parameters**: 1,080,632,832
- **Number of Non-Embedding Parameters**: 679,552,512
- **Number of Layers**: 24
- **Number of Attention Heads (GQA)**: 16 for Q and 2 for KV
- **Context Length**: 131,072
## Introduction
MiniCPM5-1B is the first checkpoint in the MiniCPM5 series. It is designed for local assistants, coding agents, tool-use workflows, and reasoning scenarios where a compact model is preferred. The model keeps a small deployment footprint while providing native long-context support and both Think / No Think chat modes through the same checkpoint.
## Evaluation Results
We compare MiniCPM5-1B with strong open-source models in the same size class, including **LFM2.5-1.2B-Thinking**, **Qwen3-0.6B/think** and **Qwen3.5-0.8B/think**. These are capable baselines; within this comparison set, MiniCPM5-1B reaches 1B-class open-source SOTA, with its advantage most visible in tool use, code generation, and difficult reasoning. This makes it a practical choice for local coding agents, tool assistants, and reasoning assistants.

## Training Recipe
The training of MiniCPM5-1B is a full-stack practice of **[UltraData Tiered Data Management](https://ultradata.openbmb.cn/)**, covering three stages: base training, mid-training, and post-training.
During **base training**, the model goes through stable training and decay training to build core language capability and training stability. It then enters **mid-training** to further strengthen target capabilities and adapt to the target data distribution. The training corpus is released alongside the model as [Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb), [Ultra-FineWeb-L3](https://huggingface.co/datasets/openbmb/Ultra-FineWeb-L3), and [UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math).
During **post-training**, we proceed in three steps: **SFT**, **RL**, and **OPD**. We first use **200B tokens of deep-thinking SFT** and **200B tokens of hybrid-thinking SFT** to establish deep-thinking, hybrid-thinking, and general chat abilities; the SFT data is released as [UltraData-SFT-2605](https://huggingface.co/datasets/openbmb/UltraData-SFT-2605). We then train specialized **RL teachers** for math, code, closed-book QA, writing, and related domains, and use **On-Policy Distillation (OPD)** to distill these teachers back into one release model.

### What does RL + OPD bring?
**RL + OPD** is a key part of MiniCPM5-1B post-training. On math, code and instruction-following tasks, RL + OPD raises the average score by **↑16 points** while cutting the share of responses that hit the max-tokens budget by **↓29 percentage points**. The figures below show the two-stage Reasoning RL pipeline, score gains, and the drop in overlong responses.
**RL** combines several complementary training signals. Reasoning RL uses [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) to strengthen mathematical reasoning. Closed-book QA uses [TriviaQA](https://huggingface.co/datasets/mandarjoshi/trivia_qa) and [NQ-Open](https://huggingface.co/datasets/google-research-datasets/nq_open), with a system prompt that encourages the model to acknowledge uncertainty instead of guessing. Writing is trained with [LongWriter-Zero-RLData](https://huggingface.co/datasets/THU-KEG/LongWriter-Zero-RLData); instruction following and long-context comprehension use verifiable RLVR data synthesized from general corpora. For general dialogue, we build pair-wise RLHF signals from anchor responses and use a Generative Reward Model for preference judgment.

**OPD** builds on Thinking Machines Lab's [On-Policy Distillation](https://thinkingmachines.ai/blog/on-policy-distillation/) and incorporates implementation improvements from [Rethinking On-Policy Distillation](https://arxiv.org/pdf/2604.13016). In the RL framework, we use reverse KL divergence as the advantage estimate, replacing the original verification-based advantage. At each response position, we take top-k logits from both the student and teacher models, compute reverse KL on the union of the two token sets, and balance the accuracy of the RKL signal with training efficiency. OPD reuses the in-domain prompts used to train each RL teacher as distillation data, so no additional data curation is required.


## GGUF Files
This repository ships three quantizations of the MiniCPM5-1B 0517 checkpoint. All three are ready to use with vanilla `llama.cpp` / `Ollama` / `LM Studio` / `llama-cpp-python` / `llama-server` — no patches required.
| File | Size | Quantization | Recommended for |
| --- | ---: | --- | --- |
| `MiniCPM5-1B-Q4_K_M.gguf` | 657 MB | Q4_K_M | Laptops / edge devices (start here) |
| `MiniCPM5-1B-Q8_0.gguf` | 1.1 GB | Q8_0 | Near-F16 quality with smaller footprint |
| `MiniCPM5-1B-F16.gguf` | 2.1 GB | F16 | Reference precision; source for further quantization |
## Quickstart
### llama.cpp (CLI)
```bash
llama-cli -m MiniCPM5-1B-Q4_K_M.gguf -n 2048 --temp 0.7 --top-p 0.8 -c 8192
```
### llama.cpp (OpenAI-compatible HTTP server)
```bash
llama-server -m MiniCPM5-1B-Q4_K_M.gguf --port 8080 -c 8192 --jinja
```
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniCPM5-1B",
"messages": [{"role":"user","content":"Who are you? Please briefly introduce yourself."}],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 256
}'
```
### Ollama / LM Studio
Both Ollama and LM Studio can import the GGUF files directly — point them at any of the three files above and pick a model name; the bundled chat template is recognized natively.
### Think / No-Think control
MiniCPM5-1B is a thinking model. The chat template exposes an `enable_thinking` switch via `chat_template_kwargs`:
| Mode | `chat_template_kwargs` | Behaviour |
| --- | --- | --- |
| **Auto** (default) | omit | Model decides whether to use `