Instructions to use openbmb/MiniCPM5-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/MiniCPM5-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openbmb/MiniCPM5-1B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B") model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openbmb/MiniCPM5-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/MiniCPM5-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/openbmb/MiniCPM5-1B
- SGLang
How to use openbmb/MiniCPM5-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use openbmb/MiniCPM5-1B with Docker Model Runner:
docker model run hf.co/openbmb/MiniCPM5-1B
We desperately need to stop using multiple choice tests to evaluate LLMs.
Despite surprisingly high scores on very hard tests like the MMLU pro and GPQA, most newer small models like this one reliably get the most basic questions from the covered domains wrong, plus fail to do the simplest tasks. And while thinking can sometimes help the models are so weak that there really isn't a net benefit (mainly just causes looping and a flood of nonsense).
For example, larger models with a 50 MMLU pro and 25 GPQA diamond score don't have any problem answering simple question like what is the third planet from the sun. Nor do they have trouble making a simple synonym list for basic words with tons of synonyms (e.g. 10 synonyms for extra). And this model's stories are incoherent, filled with egregious errors like contradictions, and are even prone to looping.
Older ~1b models like llama 3.2, and a couple newer ones like LFM2.5, do a notably better job at a broad set of simple tasks like synonym lists and stories despite having notably lower test scores. And after testing these models from various angles, analyzing the thinking tokens, and so on, it's clear that the multiple choice nature of the tests are at the heart of why the true capabilities of small models are almost always far worse than what the scores are implying.
This is largely because multiple choice tests only require the correct answer to be selected from a displayed list of options so they don't actually need to know the relevant facts, or solve anything, to get many of the answers correct. Plus contamination, even if accidental, can push multiple choice tests scores way up since a key hint is often enough to pick the correct answer out of a lineup. For these reasons and more ~1b models that can otherwise only score 10 or lower on hard tests like the MMLU pro and GPQA diamond if they weren't multiple choice are now scoring as high as 50 and 25. Again, we need to stop using multiple choice tests, even if it means using far fewer test questions so they can be economically judged by a SOTA AI model. These tests scores simply don't even roughly represent the real-world performance of this model. Same goes for most other small LLMs.
Thanks for the feedback β this is a good discussion to have.
Indeed, MC-format tests do have their limitations, and we've been exploring different evaluation approaches to examine model capabilities from more angles. At the same time, benchmarks like MMLU-Pro and SuperGPQA remain widely used and valued reference points for assessing foundational model capability in the community. Beyond these, we also track a wide range of non-MC benchmarks across different capability dimensions. For example, using rule-based evaluation for instruction following (IFEval: 80.4, IFBench: 46.7), measuring agent task completion rate (ΟΒ²-Bench: 79.5), checking code execution pass rate (HumanEval+: 78.7, MBPP+: 63.0, LiveCodeBench v6: 33.5), exact-match grading for math (AIME 2025: 40.4, MATH-500: 91.6), and LLM-as-Judge evaluation for writing quality (WritingBench: 37.1). We want this model to work well across different tasks and real-world usage scenarios, not just look good on leaderboards.
MiniCPM5-1B is a hybrid model that supports both Think and NoThink modes via the same checkpoint. Think mode is designed for more complex tasks. It brings significant gains in math and code β for instance, AIME 2025 improves from 3.5 (NoThink) to 40.4 (Think), and MATH-500 from 54.4 to 91.6. We're also continuously improving thinking efficiency and stability to reduce unnecessary verbosity.
On contamination β we do apply rigorous decontamination to all training data. That said, this is a shared challenge for the entire community, and we're focused on improving the model's generalization ability rather than optimizing solely for benchmark numbers.
As for Llama 3.2 and LFM2.5 β both are excellent models from teams we respect, and we track them closely. We've also run direct comparisons on the non-MC benchmarks above:
| Benchmark | Type | MiniCPM5-1B | LFM2.5-1.2B | Llama 3.2 1B |
|---|---|---|---|---|
| HumanEval+ | Code (pass@k) | 78.7 | 61.6 | 36.0 |
| LiveCodeBench v6 | Code (pass@k) | 33.5 | 21.3 | 3.8 |
| AIME 2025 | Math (exact-match) | 40.4 | 31.9 | 0.0 |
| MATH-500 | Math (exact-match) | 91.6 | 89.0 | 15.0 |
| ΟΒ²-Bench | Agent (task completion) | 79.5 | 19.6 | β |
| BBH | Reasoning (exact-match) | 71.9 | 57.3 | 26.1 |
| IFBench | Instruction Following (rule) | 46.7 | 41.7 | 22.9 |
These are all generation-based, execution-based, or rule-based evaluations β no multiple choice involved.
Thanks again for raising these points. For issues like basic factual recall (e.g. "what is the third planet from the sun"), synonym generation, story coherence, and looping β we'd welcome concrete prompts or established benchmarks covering these. This kind of feedback helps us improve the model's coverage and generalization, and we're happy to run further tests and share our better models.
Thanks for the polite and well-reasoned response.
The primary issue I'm noticing is manifesting across all the newer models, esp. the smaller ones, which is why I tried not to single this one out. Namely, math and coding performance are becoming notably better than Llama 3.2 1b, and far better with thinking enabled, but nearly all other tasks are stagnating or regressing.
I use a set of very simple prompts sans thinking across all common tasks (e.g. synonym lists, core knowledge nearly all humans know, and obvious grammar/spell errors) to get a feel for the general capability of a model, which larger models reliably get ~100% correct. So far LFM2.5 1.2b did the best, followed by Llama 3.1 1b, and this model did above average.
I like to keep the prompts private but one is "Make a simple list of 9 single word synonyms for extra." This is because there a dozens of common synonyms for extra, weaker models find it hard to exclude multi-word synonyms, stick to 9 vs defaulting to 10, and so on. This model stuck to 9 and single words and did a little better than Qwen3.5, but kept repeating the same 3-4 synonyms, including the word extra itself. Llama 3.2 1b still made mistakes but did better than most ~1b models.
Another simple prompt is "What is the third planet from the sun?", which for some reason nearly all 1b models claim is Mercury, even though many of them, including this one, can correctly list the planets in order from the sun (Mercury, Venus, Earth, Mars...). Yes, thinking sometimes corrects such errors, but this commonly results in looping, takes much longer (~10x more tokens) etc. so it's faster and more accurate to just use a slightly larger model without thinking. Again, thinking only starts providing notable gains in math and coding (e.g. a 3.5 to 40 math score in an example you provided).
Anyways, if a model can't perform very simple common tasks that even dumb humans find annoyingly easy then it's generally useless, regardless of the rising test scores (e.g. 50 MMLU plus). But to be fair no ~1b model has come close to achieving the basic competency of slightly larger models. Additionally, tiny changes that don't notably impact the performance of humans and larger models, and which are inevitable during real-world use cases, such as sub-optimal phrasing and grammar/spelling errors, cause the performance to tank even further. The drop off in quality bellow ~7b is sharp, and despite climbing test scores hasn't notably improved since the days of Llama 3.2. For example, models like LFM2.5 and Llama 3.2 are generally more capable, better at seeing past prompt errors, and so on, despite being inferior to MiniCPM5 is a handful of specific domains, esp. coding and math.