Instructions to use MiniMaxAI/MiniMax-M3-MXFP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MiniMaxAI/MiniMax-M3-MXFP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="MiniMaxAI/MiniMax-M3-MXFP8", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-M3-MXFP8", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("MiniMaxAI/MiniMax-M3-MXFP8", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MiniMaxAI/MiniMax-M3-MXFP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MiniMaxAI/MiniMax-M3-MXFP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M3-MXFP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/MiniMaxAI/MiniMax-M3-MXFP8

SGLang

How to use MiniMaxAI/MiniMax-M3-MXFP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MiniMaxAI/MiniMax-M3-MXFP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M3-MXFP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MiniMaxAI/MiniMax-M3-MXFP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M3-MXFP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use MiniMaxAI/MiniMax-M3-MXFP8 with Docker Model Runner:
```
docker model run hf.co/MiniMaxAI/MiniMax-M3-MXFP8
```

xuebi commited on about 13 hours ago

Commit

2a60e16

0 Parent(s):

initial commit

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +36 -0
LICENSE +17 -0
README.md +88 -0
added_tokens.json +63 -0
chat_template.jinja +247 -0
config.json +356 -0
configuration_minimax_m3_vl.py +111 -0
figures/benchmark.jpeg +3 -0
figures/efficiency_gqa_vs_msa.png +0 -0
figures/logo.svg +16 -0
generation_config.json +8 -0
image_processor.py +223 -0
merges.txt +0 -0
model-00001-of-00031.safetensors +3 -0
model-00002-of-00031.safetensors +3 -0
model-00003-of-00031.safetensors +3 -0
model-00004-of-00031.safetensors +3 -0
model-00005-of-00031.safetensors +3 -0
model-00006-of-00031.safetensors +3 -0
model-00007-of-00031.safetensors +3 -0
model-00008-of-00031.safetensors +3 -0
model-00009-of-00031.safetensors +3 -0
model-00010-of-00031.safetensors +3 -0
model-00011-of-00031.safetensors +3 -0
model-00012-of-00031.safetensors +3 -0
model-00013-of-00031.safetensors +3 -0
model-00014-of-00031.safetensors +3 -0
model-00015-of-00031.safetensors +3 -0
model-00016-of-00031.safetensors +3 -0
model-00017-of-00031.safetensors +3 -0
model-00018-of-00031.safetensors +3 -0
model-00019-of-00031.safetensors +3 -0
model-00020-of-00031.safetensors +3 -0
model-00021-of-00031.safetensors +3 -0
model-00022-of-00031.safetensors +3 -0
model-00023-of-00031.safetensors +3 -0
model-00024-of-00031.safetensors +3 -0
model-00025-of-00031.safetensors +3 -0
model-00026-of-00031.safetensors +3 -0
model-00027-of-00031.safetensors +3 -0
model-00028-of-00031.safetensors +3 -0
model-00029-of-00031.safetensors +3 -0
model-00030-of-00031.safetensors +3 -0
model-00031-of-00031.safetensors +3 -0
preprocessor_config.json +32 -0
processing_minimax.py +254 -0
special_tokens_map.json +16 -0
tokenizer.json +0 -0
tokenizer_config.json +501 -0
video_processor.py +208 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+figures/benchmark.jpeg filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,17 @@

+MINIMAX COMMUNITY LICENSE
+Copyright (c) 2026 MiniMax
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software for non-commercial purposes, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or provide copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+1. The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+2. If the Software (or any derivative works thereof) is used for any Commercial Use for your products or services:
+  1. you shall prominently display “Built with MiniMax M3” on a related website, user interface, blogpost, about page or product documentation.
+  2. you shall obtain a separate, prior written authorization from MiniMax by contacting api@minimax.io with the subject line “M3 licensing - authorization request”, if such products and services generate more than 20 million US dollars (or equivalent in other currencies) in yearly revenue; otherwise, you only need to send a one-time notice to api@minimax.io with the subject “M3 licensing — notice”.
+3. “Commercial Use” means any use of the Software or any derivative work thereof that is primarily intended for commercial advantage or monetary compensation, which includes, without limitation: (i) offering products or services to third parties for a fee, which utilize, incorporate, or rely on the Software or its derivatives, (ii) the commercial use of APIs provided by or for the Software or its derivatives, including to support or enable commercial products, services, or operations, whether in a cloud-based, hosted, or other similar environment, and (iii) the deployment or provision of the Software or its derivatives that have been subjected to post-training, fine-tuning, instruction-tuning, or any other form of modification, for any commercial purpose.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+Appendix: Prohibited Uses
+You agree you will not use, or allow others to use, the Software or any derivatives of the Software to:
+1. Generate or disseminate content prohibited by applicable laws or regulations.
+2. Assist with, engage in or otherwise support any military purpose.
+3. Exploit, harm, or attempt to exploit or harm minors.
+4. Generate or disseminate false or misleading information with the intent to cause harm.
+5. Promote discrimination, hate speech, or harmful behavior against individuals or groups based on race or ethnic origin, religion, disability, age, nationality and national origin, veteran status, sexual orientation, gender or gender identity, caste, immigration status, or any other characteristic that is associated with systemic discrimination or marginalization.

README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+---
+pipeline_tag: image-text-to-text
+license: other
+license_name: minimax-community
+license_link: LICENSE
+library_name: transformers
+tags:
+  - multimodal
+  - moe
+  - agent
+  - coding
+  - video
+---
+<div align="center">
+  <img width="60%" src="figures/logo.svg" alt="MiniMax">
+</div>
+<hr>
+<p align="center">
+  <a href="https://agent.minimax.io/" target="_blank"><img src="https://img.shields.io/badge/MiniMax%20Agent-FF6C37?style=for-the-badge&logo=minimax&logoColor=white" alt="MiniMax Agent"></a>
+  <a href="https://platform.minimax.io/docs/guides/text-generation" target="_blank"><img src="https://img.shields.io/badge/API-FF6C37?style=for-the-badge&logo=minimax&logoColor=white" alt="API"></a>
+  <a href="https://www.minimax.io" target="_blank"><img src="https://img.shields.io/badge/MiniMax%20Website-FF6C37?style=for-the-badge&logo=minimax&logoColor=white" alt="MiniMax Website"></a>
+  <br>
+  <a href="https://platform.minimaxi.com/docs/faq/contact-us" target="_blank"><img src="https://img.shields.io/badge/WeChat-07C160?style=for-the-badge&logo=wechat&logoColor=white" alt="WeChat"></a>
+  <a href="https://discord.com/invite/DPC4AHFCBw" target="_blank"><img src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Discord"></a>
+  <a href="https://huggingface.co/MiniMaxAI" target="_blank"><img src="https://img.shields.io/badge/Hugging%20Face-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black" alt="Hugging Face"></a>
+  <a href="https://github.com/MiniMax-AI/MiniMax-M3" target="_blank"><img src="https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white" alt="GitHub"></a>
+  <a href="https://arxiv.org/abs/2606.13392" target="_blank"><img src="https://img.shields.io/badge/arXiv-2606.13392-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white" alt="arXiv Paper"></a>
+  <a href="https://huggingface.co/MiniMaxAI/MiniMax-M3/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/LICENSE-4CAF50?style=for-the-badge&logo=creativecommons&logoColor=white" alt="LICENSE"></a>
+</p>
+MiniMax-M3 is a native multimodal model with 1M context. It has ~428B parameters and ~23B activated parameters.
+**Highlights:**
+- **Native Multimodality:** M3 undergoes mixed-modality training from the very first step, enabling deeper semantic fusion across text, image, and video.
+- **Context Scaling via Sparse Attention:** M3 introduces MiniMax Sparse Attention (MSA) to improve long context efficiency. M3 delivers 9× prefill and 15× decode speedups compared to M2 at 1M context, reducing per-token compute to 1/20.
+- **Coding & Cowork Capability:** M3 achieves frontier-level performance across long-horizon agentic benchmarks, excelling in both coding and cowork.
+MiniMax-M3-MXFP8 is the MXFP8 quantized variant of [MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3), a native multimodal model with 1M context. It has ~428B parameters and ~23B activated parameters.
+<p align="center">
+  <img width="100%" src="figures/benchmark.jpeg">
+</p>
+## MiniMax Sparse Attention (MSA)
+M3 is powered by [**MiniMax Sparse Attention (MSA)**](https://github.com/MiniMax-AI/MSA), a high-performance sparse attention operator designed for million-token contexts. Compared with GQA, MSA dramatically reduces the attention compute and memory footprint while preserving model quality.
+<p align="center">
+  <img width="100%" src="figures/efficiency_gqa_vs_msa.png" alt="GQA vs MSA Efficiency Comparison">
+</p>
+> 📄 Read the technical report: [arXiv:2606.13392](https://arxiv.org/abs/2606.13392) · [Hugging Face Papers](https://huggingface.co/papers/2606.13392)
+## How to Use
+- [MiniMax Agent](https://agent.minimax.io/)
+- [MiniMax API](https://platform.minimax.io/)
+M3 supports two reasoning modes:
+- **thinking** — for complex reasoning, agentic tasks, and long-horizon collaboration.
+- **non-thinking** — for latency-sensitive scenarios such as chat and code completion.
+## Local Deployment
+Download the model:
+```bash
+hf download MiniMaxAI/MiniMax-M3 --local-dir MiniMax-M3
+```
+We recommend the following inference frameworks (listed alphabetically) to serve the model:
+- [SGLang](https://docs.sglang.io/) - see  [SGLang cookbook](https://docs.sglang.io/cookbook/autoregressive/MiniMax/MiniMax-M3).
+- [vLLM](https://github.com/vllm-project/vllm) - see [vLLM recipes](https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3).
+- [Transformers](https://github.com/huggingface/transformers) - see [Transformers docs](https://huggingface.co/docs/transformers/model_doc/minimax_m3_vl).
+### Inference Parameters
+We recommend the following parameters for best performance: `temperature=1.0`, `top_p=0.95`, `top_k=40`.
+## Contact Us
+Contact us at [model@minimax.io](mailto:model@minimax.io).

added_tokens.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "]!p~[": 200000,
+  "<fim_prefix>": 200001,
+  "<fim_middle>": 200002,
+  "<fim_suffix>": 200003,
+  "<fim_pad>": 200004,
+  "<reponame>": 200005,
+  "<filename>": 200006,
+  "<gh_stars>": 200007,
+  "<issue_start>": 200008,
+  "<issue_comment>": 200009,
+  "<issue_closed>": 200010,
+  "<jupyter_start>": 200011,
+  "<jupyter_text>": 200012,
+  "<jupyter_code>": 200013,
+  "<jupyter_output>": 200014,
+  "<empty_output>": 200015,
+  "<commit_before>": 200016,
+  "<commit_msg>": 200017,
+  "<commit_after>": 200018,
+  "]~b]": 200019,
+  "[e~[": 200020,
+  "]!d~[": 200021,
+  "<function_call>": 200022,
+  "<code_interpreter>": 200023,
+  "]<]speech[>[": 200024,
+  "]<]image[>[": 200025,
+  "]<]video[>[": 200026,
+  "]<]start of speech[>[": 200027,
+  "]<]end of speech[>[": 200028,
+  "]<]start of image[>[": 200029,
+  "]<]end of image[>[": 200030,
+  "]<]start of video[>[": 200031,
+  "]<]end of video[>[": 200032,
+  "]<]vision pad[>[": 200033,
+  "]~!b[": 200034,
+  "<jupyter_error>": 200035,
+  "<add_file>": 200036,
+  "<delete_file>": 200037,
+  "<rename_file>": 200038,
+  "<edit_file>": 200039,
+  "<commit_message>": 200040,
+  "<empty_source_file>": 200041,
+  "<repo_struct>": 200042,
+  "<code_context>": 200043,
+  "<file_content>": 200044,
+  "<source_files>": 200045,
+  "<pr_start>": 200046,
+  "<review_comment>": 200047,
+  "<filepath>": 200048,
+  "<file_sep>": 200049,
+  "<think>": 200050,
+  "</think>": 200051,
+  "<tool_call>": 200052,
+  "</tool_call>": 200053,
+  "]<]frame[>[": 200054,
+  "]<]start of frame[>[": 200055,
+  "]<]end of frame[>[": 200056,
+  "<|content_altered_placeholder|>": 200057,
+  "]<]minimax[>[": 200058,
+  "<mm:think>": 200059,
+  "</mm:think>": 200060
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,247 @@

+{# ---------- special token variables ---------- #}
+{%- set ns_token               = ']<]minimax[>['                  -%}
+{%- set bod_token              = ']~!b['                          -%}
+{%- set bos_token              = ']~b]'                           -%}
+{%- set eos_token              = '[e~['                           -%}
+{%- set toolcall_begin_token   = ns_token ~ '<tool_call>'         -%}
+{%- set toolcall_end_token     = ns_token ~ '</tool_call>'        -%}
+{%- set think_begin_token      = '<mm:think>'                     -%}
+{%- set think_end_token        = '</mm:think>'                    -%}
+{%- set image_token            = ']<]image[>['                    -%}
+{%- set video_token            = ']<]video[>['                    -%}
+{#- Thinking mode: "enabled" / "disabled" / "adaptive" / not defined -#}
+{#- Recursive XML renderer for tool_call arguments ======================== -#}
+{#- None values are intentionally skipped in mapping iteration so that
+    `<key>null</key>` (which would round-trip to the literal string "null")
+    never appears in the rendered tool_call. The convention is: omit the
+    field entirely. The top-level `_args` loop applies the same rule.
+    The `val is none` branch below is a safety net only — upstream cleaning
+    (drop_none_in_tool_arguments) should ensure no None ever reaches here. -#}
+{%- macro to_xml(val, ns) -%}
+{%- if val is mapping -%}
+{%- for k, v in val.items() if v is not none -%}
+{{ ns }}<{{ k }}>{{ to_xml(v, ns) }}{{ ns }}</{{ k }}>
+{%- endfor -%}
+{%- elif val is iterable and val is not string -%}
+{%- for item in val -%}
+{{ ns }}<item>{{ to_xml(item, ns) }}{{ ns }}</item>
+{%- endfor -%}
+{%- elif val is none -%}
+{#- Should be unreachable when upstream cleaning is applied. -#}
+{%- elif val is boolean -%}
+{{ val | tojson }}
+{%- else -%}
+{{ val }}
+{%- endif -%}
+{%- endmacro -%}
+{#- Tool Rendering Functions ============================================== -#}
+{%- macro render_tool_namespace(namespace_name, tool_list) -%}
+{%- for tool in tool_list -%}
+<tool>{{ tool.function | tojson(ensure_ascii=False) }}</tool>
+{% endfor -%}
+{%- endmacro -%}
+{%- macro visible_text(content) -%}
+    {%- if content is string -%}
+        {{ content }}
+    {%- elif content is iterable and content is not mapping -%}
+        {%- for item in content -%}
+            {%- if item is mapping and item.type == 'text' -%}
+                {{- item.text }}
+            {%- elif item is mapping and item.type == 'image' -%}
+                {{- image_token }}
+            {%- elif item is mapping and item.type == 'video' -%}
+                {{- video_token}}
+            {%- elif item is string -%}
+                {{- item }}
+            {%- endif -%}
+        {%- endfor -%}
+    {%- elif content is none -%}
+        {{- '' }}
+    {%- else -%}
+        {{- content }}
+    {%- endif -%}
+{%- endmacro -%}
+{#- System Message Construction ============================================ -#}
+{%- macro build_system_message(system_message) -%}
+    {%- if system_message and system_message.content -%}
+        {{- visible_text(system_message.content) }}
+    {%- else -%}
+        {{- 'Your model version is MiniMax-M3, developed by MiniMax. Knowledge cutoff: January 2026. Founded in early 2022, MiniMax is a global AI foundation model company committed to advancing the frontiers of AI towards AGI.' }}
+    {%- endif -%}
+    {#- Thinking mode instructions -#}
+    {{- '\n\n<thinking_instructions>\n' }}
+    {{- 'You have a thinking capability that allows you to reason step by step before responding. When thinking is enabled, wrap your reasoning in ' ~ think_begin_token ~ think_end_token ~ ' tags before your response. When thinking is disabled, begin your response directly after the ' ~ think_end_token ~ ' prefix. When thinking is adaptive, decide on your own whether to think for the current turn.\n' }}
+    {%- if thinking_mode is defined -%}
+        {%- if thinking_mode == "enabled" -%}
+            {{- 'Current thinking mode: enabled. You MUST think step by step before every response, including after receiving function/tool results.\n' }}
+        {%- elif thinking_mode == "disabled" -%}
+            {{- 'Current thinking mode: disabled. Do not output any thinking process.\n' }}
+        {%- elif thinking_mode == "adaptive" -%}
+            {{- 'Current thinking mode: adaptive. You are encouraged to think for complex decision-making, multi-step reasoning, or when analyzing function/tool results.\n' }}
+        {%- endif -%}
+    {%- else -%}
+        {{- 'Current thinking mode: adaptive. You are encouraged to think for complex decision-making, multi-step reasoning, or when analyzing function/tool results.\n' }}
+    {%- endif -%}
+    {{- '</thinking_instructions>' }}
+{%- endmacro -%}
+{%- macro build_developer_message(developer_message) -%}
+    {%- if developer_message and developer_message.content -%}
+        {{- visible_text(developer_message.content) }}
+    {%- else -%}
+        {%- if model_identity is not defined -%}
+            {%- set model_identity = "You are a helpful assistant." -%}
+        {%- endif -%}
+        {{- model_identity }}
+    {%- endif -%}
+{%- endmacro -%}
+{#- Main Template Logic ================================================= -#}
+{#- Role mapping: root -> system sp (high priority), system/developer -> developer sp (low priority) -#}
+{%- set system_message = none -%}
+{%- set developer_message = none -%}
+{%- set conversation_messages = messages -%}
+{%- if messages and messages[0].role == "root" -%}
+    {%- set system_message = messages[0] -%}
+    {%- set conversation_messages = messages[1:] -%}
+    {%- if conversation_messages and conversation_messages[0].role in ["system", "developer"] -%}
+        {%- set developer_message = conversation_messages[0] -%}
+        {%- set conversation_messages = conversation_messages[1:] -%}
+    {%- endif -%}
+{%- elif messages and messages[0].role in ["system", "developer"] -%}
+    {%- set developer_message = messages[0] -%}
+    {%- set conversation_messages = messages[1:] -%}
+{%- endif -%}
+{#- Render system sp (higher priority, root role only) -#}
+{{- bod_token ~ bos_token ~ 'system' ~ '\n' }}
+{{- build_system_message(system_message) }}
+{{- eos_token ~ '\n' }}
+{#- Render developer sp (lower priority: system/developer role + tools) -#}
+{{- bos_token ~ 'developer' ~ '\n' }}
+{{- build_developer_message(developer_message) }}
+{%- if tools -%}
+    {{- '\n\n' ~ '# Tools' ~ '\n' ~ 'You may call one or more tools to assist with the user query.\nHere are the tools available in JSONSchema format:' ~ '\n' }}
+    {{- '\n' ~ '<tools>' ~ '\n' }}
+    {{- render_tool_namespace("functions", tools) }}
+    {{- '</tools>' ~ '\n\n' }}
+    {{- 'To call tools, wrap all invocations in a single ' ~ toolcall_begin_token ~ toolcall_end_token ~ ' block. Parameter values containing nested objects or arrays are recursively expanded into XML elements. Example:\n' }}
+    {{- '\n' ~ toolcall_begin_token ~ '\n' }}
+    {{- ns_token + '<invoke name="tool-name-1">' }}
+    {{- ns_token + '<param-1>value-1' + ns_token + '</param-1>' }}
+    {{- ns_token + '<param-2>' }}
+    {{- ns_token + '<item>' }}
+    {{- ns_token + '<key-a>val-a' + ns_token + '</key-a>' }}
+    {{- ns_token + '<key-b>val-b' + ns_token + '</key-b>' }}
+    {{- ns_token + '</item>' }}
+    {{- ns_token + '</param-2>' }}
+    {{- ns_token + '</invoke>\n' }}
+    {{- ns_token + '<invoke name="tool-name-2">' }}
+    {{- ns_token + '<param-1>value-1' + ns_token + '</param-1>' }}
+    {{- ns_token + '</invoke>\n' }}
+    {{- toolcall_end_token }}
+{%- endif -%}
+{{- eos_token ~ '\n' }}
+{#- Render messages -#}
+{%- set last_tool_call = namespace(name=none) -%}
+{%- for message in conversation_messages -%}
+    {%- if message.role == 'assistant' -%}
+        {{- bos_token ~ 'ai' ~ '\n' }}
+        {%- set reasoning_content = '' %}
+        {%- set content = visible_text(message.content) %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if think_end_token in content %}
+                {%- set reasoning_content = content.split(think_end_token)[0].strip('\n').split(think_begin_token)[-1].strip('\n') %}
+                {%- set content = content.split(think_end_token)[-1].strip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if reasoning_content -%}
+            {#- Render thinking for every assistant turn (all-turn visible) -#}
+            {{- think_begin_token ~ reasoning_content ~ think_end_token }}
+        {%- else -%}
+            {#- No thinking rendered → prefix with think_end_token -#}
+            {{- think_end_token }}
+        {%- endif -%}
+        {%- if content -%}
+            {{- content }}
+        {%- endif -%}
+        {%- if message.tool_calls -%}
+            {{- toolcall_begin_token ~ '\n' }}
+            {%- for tool_call in message.tool_calls -%}
+                {%- if tool_call.function -%}
+                    {%- set tool_call = tool_call.function -%}
+                {%- endif -%}
+{{- ns_token + '<invoke name="' + tool_call.name + '">' }}
+{%- set _args = tool_call.arguments -%}
+{%- for k, v in _args.items() if v is not none %}
+{{- ns_token + '<' + k + '>' -}}
+{{- to_xml(v, ns_token) -}}
+{{- ns_token + '</' + k + '>' }}
+{%- endfor -%}
+{{- ns_token + '</invoke>' ~ '\n' }}
+            {%- endfor -%}
+            {{- toolcall_end_token }}
+            {%- if message.tool_calls[-1].function -%}
+                {%- set last_tool_call.name = message.tool_calls[-1].function.name -%}
+            {%- else -%}
+                {%- set last_tool_call.name = message.tool_calls[-1].name -%}
+            {%- endif -%}
+        {%- else -%}
+            {%- set last_tool_call.name = none -%}
+        {%- endif -%}
+        {{- eos_token ~ '\n' }}
+    {%- elif message.role == 'tool' -%}
+        {%- if last_tool_call.name is none -%}
+            {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
+        {%- endif -%}
+        {%- if loop.first or (conversation_messages[loop.index0 - 1].role != 'tool') -%}
+            {{- bos_token ~ 'tool' }}
+        {%- endif -%}
+        {{- '\n<response>' }}
+        {%- if message.content is string -%}
+            {{- message.content }}
+        {%- else -%}
+            {%- for tr in message.content -%}
+                {%- if tr is mapping and tr.type is defined and tr.type == 'image' -%}
+                    {{- image_token }}
+                {%- elif tr is mapping and tr.type is defined and tr.type == 'video' -%}
+                    {{- video_token }}
+                {%- else -%}
+                    {{- tr.output if tr.output is defined else (tr.text if tr.type == 'text' and tr.text is defined else tr) }}
+                {%- endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+        {{- '</response>' }}
+        {%- if loop.last or (conversation_messages[loop.index0 + 1].role != 'tool') -%}
+            {{- eos_token ~ '\n' -}}
+        {%- endif -%}
+    {%- elif message.role == 'user' -%}
+        {{- bos_token ~ 'user' ~ '\n' }}
+        {{- visible_text(message.content) }}
+        {{- eos_token ~ '\n' }}
+    {%- endif -%}
+{%- endfor -%}
+{#- Generation prompt -#}
+{%- if add_generation_prompt -%}
+{{- bos_token ~ 'ai' ~ '\n' }}
+{%- if thinking_mode is defined and thinking_mode == "disabled" -%}
+    {{- think_end_token }}
+{%- elif thinking_mode is defined and thinking_mode == "adaptive" -%}
+    {#- adaptive: no prefix, let model decide -#}
+{%- elif thinking_mode is defined and thinking_mode == "enabled" -%}
+    {#- enabled or not defined: default to think -#}
+    {{- think_begin_token }}
+{%- else -%}
+    {#- adaptive: no prefix, let model decide -#}
+{%- endif -%}
+{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,356 @@

+{
+  "architectures": [
+    "MiniMaxM3SparseForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_minimax_m3_vl.MiniMaxM3VLConfig"
+  },
+  "model_type": "minimax_m3_vl",
+  "text_config": {
+    "dtype": "bfloat16",
+    "hidden_size": 6144,
+    "intermediate_size": 3072,
+    "num_hidden_layers": 60,
+    "num_attention_heads": 64,
+    "num_key_value_heads": 4,
+    "head_dim": 128,
+    "vocab_size": 200064,
+    "max_position_embeddings": 1048576,
+    "rms_norm_eps": 1e-06,
+    "use_gemma_norm": true,
+    "attention_output_gate": false,
+    "rope_theta": 5000000,
+    "rotary_dim": 64,
+    "partial_rotary_factor": 0.5,
+    "hidden_act": "swigluoai",
+    "use_qk_norm": true,
+    "tie_word_embeddings": false,
+    "dense_intermediate_size": 12288,
+    "shared_intermediate_size": 3072,
+    "num_local_experts": 128,
+    "num_experts_per_tok": 4,
+    "n_shared_experts": 1,
+    "scoring_func": "sigmoid",
+    "use_routing_bias": true,
+    "moe_layer_freq": [
+      0,
+      0,
+      0,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1
+    ],
+    "qk_norm_type": "per_head",
+    "num_mtp_modules": 1,
+    "swiglu_alpha": 1.702,
+    "swiglu_limit": 7.0,
+    "routed_scaling_factor": 2.0,
+    "sparse_attention_config": {
+      "use_sparse_attention": true,
+      "sparse_index_dim": 128,
+      "sparse_num_index_heads": 4,
+      "sparse_topk_blocks": 16,
+      "sparse_block_size": 128,
+      "sparse_disable_index_value": [
+        0,
+        0,
+        0,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1
+      ],
+      "sparse_score_type": "max",
+      "sparse_init_block": 0,
+      "sparse_local_block": 1,
+      "sparse_attention_freq": [
+        0,
+        0,
+        0,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1
+      ]
+    },
+    "architectures": [
+      "MiniMaxM3SparseForCausalLM"
+    ]
+  },
+  "vision_config": {
+    "hidden_size": 1280,
+    "num_attention_heads": 16,
+    "num_hidden_layers": 32,
+    "intermediate_size": 5120,
+    "patch_size": 14,
+    "image_size": 2016,
+    "projection_dim": 6144,
+    "position_embedding_type": "rope",
+    "rope_mode": "3d",
+    "rope_theta": 10000.0,
+    "attention_dropout": 0.0,
+    "hidden_act": "gelu",
+    "initializer_factor": 1.0,
+    "initializer_range": 0.02,
+    "layer_norm_eps": 1e-05,
+    "model_type": "clip_vision_model",
+    "num_channels": 3,
+    "vocab_size": 32000,
+    "img_token_compression_config": {
+      "image_token_compression_method": "patch_merge",
+      "spatial_merge_size": 2,
+      "temporal_patch_size": 2
+    },
+    "vision_segment_max_frames": 4
+  },
+  "img_token_compression_config": {
+    "image_token_compression_method": "patch_merge",
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "image_grid_pinpoints": "[(336, 336), (336, 672), (336, 1008), (336, 1344), (336, 1680), (336, 2016), (672, 336), (672, 672), (672, 1008), (672, 1344), (672, 1680), (672, 2016), (1008, 336), (1008, 672), (1008, 1008), (1008, 1344), (1008, 1680), (1008, 2016), (1344, 336), (1344, 672), (1344, 1008), (1344, 1344), (1344, 1680), (1344, 2016), (1680, 336), (1680, 672), (1680, 1008), (1680, 1344), (1680, 1680), (1680, 2016), (2016, 336), (2016, 672), (2016, 1008), (2016, 1344), (2016, 1680), (2016, 2016)]",
+  "image_seq_length": 576,
+  "image_token_index": 200025,
+  "video_token_index": 200026,
+  "multimodal_projector_bias": true,
+  "num_reward_heads": 0,
+  "process_image_mode": "dynamic_res",
+  "projector_hidden_act": "gelu",
+  "vision_feature_layer": -1,
+  "vision_feature_select_strategy": "full",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.52.4",
+  "projector_hidden_size": 6144,
+  "quantization_config": {
+    "quant_method": "mxfp8",
+    "activation_scheme": "dynamic",
+    "weight_block_size": [
+      1,
+      32
+    ],
+    "ignored_layers": [
+      "lm_head",
+      "model.embed_tokens",
+      "vision_tower",
+      "multi_modal_projector",
+      "patch_merge_mlp",
+      "language_model.model.layers.10.block_sparse_moe.gate",
+      "language_model.model.layers.11.block_sparse_moe.gate",
+      "language_model.model.layers.12.block_sparse_moe.gate",
+      "language_model.model.layers.13.block_sparse_moe.gate",
+      "language_model.model.layers.14.block_sparse_moe.gate",
+      "language_model.model.layers.15.block_sparse_moe.gate",
+      "language_model.model.layers.16.block_sparse_moe.gate",
+      "language_model.model.layers.17.block_sparse_moe.gate",
+      "language_model.model.layers.18.block_sparse_moe.gate",
+      "language_model.model.layers.19.block_sparse_moe.gate",
+      "language_model.model.layers.20.block_sparse_moe.gate",
+      "language_model.model.layers.21.block_sparse_moe.gate",
+      "language_model.model.layers.22.block_sparse_moe.gate",
+      "language_model.model.layers.23.block_sparse_moe.gate",
+      "language_model.model.layers.24.block_sparse_moe.gate",
+      "language_model.model.layers.25.block_sparse_moe.gate",
+      "language_model.model.layers.26.block_sparse_moe.gate",
+      "language_model.model.layers.27.block_sparse_moe.gate",
+      "language_model.model.layers.28.block_sparse_moe.gate",
+      "language_model.model.layers.29.block_sparse_moe.gate",
+      "language_model.model.layers.3.block_sparse_moe.gate",
+      "language_model.model.layers.30.block_sparse_moe.gate",
+      "language_model.model.layers.31.block_sparse_moe.gate",
+      "language_model.model.layers.32.block_sparse_moe.gate",
+      "language_model.model.layers.33.block_sparse_moe.gate",
+      "language_model.model.layers.34.block_sparse_moe.gate",
+      "language_model.model.layers.35.block_sparse_moe.gate",
+      "language_model.model.layers.36.block_sparse_moe.gate",
+      "language_model.model.layers.37.block_sparse_moe.gate",
+      "language_model.model.layers.38.block_sparse_moe.gate",
+      "language_model.model.layers.39.block_sparse_moe.gate",
+      "language_model.model.layers.4.block_sparse_moe.gate",
+      "language_model.model.layers.40.block_sparse_moe.gate",
+      "language_model.model.layers.41.block_sparse_moe.gate",
+      "language_model.model.layers.42.block_sparse_moe.gate",
+      "language_model.model.layers.43.block_sparse_moe.gate",
+      "language_model.model.layers.44.block_sparse_moe.gate",
+      "language_model.model.layers.45.block_sparse_moe.gate",
+      "language_model.model.layers.46.block_sparse_moe.gate",
+      "language_model.model.layers.47.block_sparse_moe.gate",
+      "language_model.model.layers.48.block_sparse_moe.gate",
+      "language_model.model.layers.49.block_sparse_moe.gate",
+      "language_model.model.layers.5.block_sparse_moe.gate",
+      "language_model.model.layers.50.block_sparse_moe.gate",
+      "language_model.model.layers.51.block_sparse_moe.gate",
+      "language_model.model.layers.52.block_sparse_moe.gate",
+      "language_model.model.layers.53.block_sparse_moe.gate",
+      "language_model.model.layers.54.block_sparse_moe.gate",
+      "language_model.model.layers.55.block_sparse_moe.gate",
+      "language_model.model.layers.56.block_sparse_moe.gate",
+      "language_model.model.layers.57.block_sparse_moe.gate",
+      "language_model.model.layers.58.block_sparse_moe.gate",
+      "language_model.model.layers.59.block_sparse_moe.gate",
+      "language_model.model.layers.6.block_sparse_moe.gate",
+      "language_model.model.layers.7.block_sparse_moe.gate",
+      "language_model.model.layers.8.block_sparse_moe.gate",
+      "language_model.model.layers.9.block_sparse_moe.gate"
+    ]
+  }
+}

configuration_minimax_m3_vl.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""HuggingFace configs for the MiniMax VL family (M2 VL / M3 VL).
+This file is bundled into every converted HF checkpoint so that loading via
+``AutoConfig.from_pretrained(..., trust_remote_code=True)`` works without any
+runtime dependency on sglang or other internal packages — only stock
+``transformers`` is required.
+The class definitions intentionally mirror
+``sglang.srt.configs.minimax_vl``; if either side changes, keep them in sync.
+The file is named ``configuration_minimax_m3_vl.py`` (matching the legacy
+``model_type="minimax_m3_vl"`` and the converter's ``auto_map`` entry) so
+that ckpts produced by this converter remain loadable by older sglang versions
+that only know the ``MiniMaxM3VL*`` names. The canonical class is
+``MiniMaxM3VLConfig``; ``MiniMaxM3VLConfig`` is a thin BC alias whose only
+purpose is to be referenced from ``auto_map``.
+"""
+from typing import Optional
+from transformers.configuration_utils import PretrainedConfig
+from transformers.models.auto import CONFIG_MAPPING
+def _coerce_sub_config(
+    sub_config: Optional[dict], default_model_type: str
+) -> Optional[PretrainedConfig]:
+    """Convert a config dict to a ``PretrainedConfig`` instance.
+    If ``model_type`` is registered in HF ``CONFIG_MAPPING`` the corresponding
+    config class is used; otherwise we fall back to a generic
+    ``PretrainedConfig`` so all dict keys still become real attributes (M3's
+    text backbone uses ``model_type="minimax_m2"`` which is not in
+    ``CONFIG_MAPPING``).
+    """
+    if not isinstance(sub_config, dict):
+        return sub_config
+    model_type = sub_config.get("model_type", default_model_type)
+    cls = CONFIG_MAPPING.get(model_type, PretrainedConfig)
+    return cls(**sub_config)
+class MiniMaxVLBaseConfig(PretrainedConfig):
+    """Base config shared by every MiniMax VL variant.
+    Handles vision/text sub-config coercion. Concrete subclasses only need to
+    declare a unique ``model_type`` string.
+    """
+    def __init__(
+        self,
+        vision_config: Optional[dict] = None,
+        text_config: Optional[dict] = None,
+        image_token_index: int = 200025,
+        video_token_index: int = 200026,
+        image_seq_length: int = 576,
+        process_image_mode: str = "dynamic_res",
+        projector_hidden_act: str = "gelu",
+        multimodal_projector_bias: bool = True,
+        vision_feature_layer: int = -1,
+        vision_feature_select_strategy: str = "full",
+        img_token_compression_config: Optional[dict] = None,
+        image_grid_pinpoints: Optional[str] = None,
+        **kwargs,
+    ):
+        self.vision_config = _coerce_sub_config(vision_config, "clip_vision_model")
+        self.text_config = _coerce_sub_config(text_config, "mixtral")
+        self.image_token_index = image_token_index
+        self.video_token_index = video_token_index
+        self.image_seq_length = image_seq_length
+        self.process_image_mode = process_image_mode
+        self.projector_hidden_act = projector_hidden_act
+        self.multimodal_projector_bias = multimodal_projector_bias
+        self.vision_feature_layer = vision_feature_layer
+        self.vision_feature_select_strategy = vision_feature_select_strategy
+        self.img_token_compression_config = img_token_compression_config or {}
+        self.image_grid_pinpoints = image_grid_pinpoints
+        super().__init__(**kwargs)
+    def __post_init__(self, **kwargs):
+        super().__post_init__(**kwargs)
+        if hasattr(self, "vision_config"):
+            self.vision_config = _coerce_sub_config(self.vision_config, "clip_vision_model")
+        if hasattr(self, "text_config"):
+            self.text_config = _coerce_sub_config(self.text_config, "mixtral")
+class MiniMaxM2VLConfig(MiniMaxVLBaseConfig):
+    """MiniMax M2 VL: vision tower + M2 (Mixtral-style MoE) text backbone."""
+    model_type = "minimax_m2_vl"
+class MiniMaxM3VLConfig(MiniMaxVLBaseConfig):
+    """MiniMax M3 VL: vision tower + M3 (mixed sparse/dense MoE) text backbone."""
+    model_type = "minimax_m3_vl"
+class MiniMaxM2MiniVLConfig(MiniMaxM2VLConfig):
+    """Legacy alias kept so old ``model_type="minimax_m2_mini_vl"`` ckpts load."""
+    model_type = "minimax_m2_mini_vl"
+class MiniMaxM3VLConfig(MiniMaxM3VLConfig):
+    """Legacy alias kept so old ``model_type="minimax_m3_vl"`` ckpts load."""
+    model_type = "minimax_m3_vl"

figures/benchmark.jpeg ADDED Viewed

Git LFS Details

SHA256: b4bc02e54f508f540e71a9286905477c780934bb79c0b17cd5892b6338313e57
Pointer size: 132 Bytes
Size of remote file: 4.42 MB

figures/efficiency_gqa_vs_msa.png ADDED Viewed

figures/logo.svg ADDED Viewed

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "bos_token_id": 200019,
+  "do_sample": true,
+  "eos_token_id": 200020,
+  "temperature": 1.0,
+  "top_p": 0.95,
+  "transformers_version": "4.46.1"
+}

image_processor.py ADDED Viewed

	@@ -0,0 +1,223 @@

+# Copyright 2023-2024 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+"""
+MiniMax VL family HuggingFace-compatible Processor, ImageProcessor, VideoProcessor.
+"""
+import math
+from typing import List, Tuple
+import torch
+from torchvision.transforms import InterpolationMode
+from transformers import BatchFeature
+from transformers.image_processing_utils_fast import (
+    BaseImageProcessorFast,
+    group_images_by_shape,
+    reorder_images,
+)
+from transformers.image_utils import PILImageResampling, SizeDict
+from transformers.processing_utils import (
+    ImagesKwargs,
+    Unpack,
+)
+from transformers.utils import TensorType
+MAX_RATIO = 200
+def round_by_factor(number: int, factor: int) -> int:
+    return round(number / factor) * factor
+def ceil_by_factor(number: int, factor: int) -> int:
+    return math.ceil(number / factor) * factor
+def floor_by_factor(number: int, factor: int) -> int:
+    return math.floor(number / factor) * factor
+def smart_resize(
+    height: int,
+    width: int,
+    factor: int = 28,
+    min_pixels: int = 4 * 28 * 28,
+    max_pixels: int = 451584,
+) -> tuple[int, int]:
+    if max(height, width) / min(height, width) > MAX_RATIO:
+        raise ValueError(
+            f"absolute aspect ratio must be smaller than {MAX_RATIO}, "
+            f"got {max(height, width) / min(height, width)}"
+        )
+    h_bar = max(factor, round_by_factor(height, factor))
+    w_bar = max(factor, round_by_factor(width, factor))
+    if h_bar * w_bar > max_pixels:
+        beta = math.sqrt((height * width) / max_pixels)
+        h_bar = floor_by_factor(height / beta, factor)
+        w_bar = floor_by_factor(width / beta, factor)
+    elif h_bar * w_bar < min_pixels:
+        beta = math.sqrt(min_pixels / (height * width))
+        h_bar = ceil_by_factor(height * beta, factor)
+        w_bar = ceil_by_factor(width * beta, factor)
+    return h_bar, w_bar
+# ==============================================================================
+# MiniMax M3 VL Image Processor Fast (Fast Mode - Torch based)
+# ==============================================================================
+class MiniMaxM3VLImageProcessorKwargs(ImagesKwargs, total=False):
+    patch_size: int
+    temporal_patch_size: int
+    merge_size: int
+    max_pixels: int
+class MiniMaxM3VLImageProcessor(BaseImageProcessorFast):
+    do_resize = True
+    resample = PILImageResampling.BICUBIC
+    size = {"height": 672, "width": 672}  # required by base class validation, not used as resize bound
+    default_to_square = False
+    do_rescale = True
+    rescale_factor = 1 / 255
+    do_normalize = True
+    image_mean = [0.48145466, 0.4578275, 0.40821073]
+    image_std = [0.26862954, 0.26130258, 0.27577711]
+    do_convert_rgb = True
+    patch_size = 14
+    temporal_patch_size = 2
+    merge_size = 2
+    max_pixels = 451584             # 672*672
+    valid_kwargs = MiniMaxM3VLImageProcessorKwargs
+    model_input_names = ["pixel_values", "image_grid_thw"]
+    def __init__(self, **kwargs: Unpack[MiniMaxM3VLImageProcessorKwargs]):
+        super().__init__(**kwargs)
+    def preprocess(
+        self, images, **kwargs: Unpack[MiniMaxM3VLImageProcessorKwargs]
+    ) -> BatchFeature:
+        return super().preprocess(images, **kwargs)
+    def _preprocess(
+        self,
+        images: List[torch.Tensor],
+        do_resize: bool,
+        size: SizeDict,
+        resample: PILImageResampling | InterpolationMode | int | None,
+        do_rescale: bool,
+        rescale_factor: float,
+        do_normalize: bool,
+        image_mean: float | List[float] | None,
+        image_std: float | List[float] | None,
+        patch_size: int,
+        temporal_patch_size: int,
+        merge_size: int,
+        max_pixels: int,
+        disable_grouping: bool | None,
+        return_tensors: str | TensorType | None,
+        **kwargs,
+    ) -> BatchFeature:
+        grouped_images, grouped_images_index = group_images_by_shape(
+            images, disable_grouping=disable_grouping
+        )
+        resized_images_grouped = {}
+        factor = patch_size * merge_size
+        for shape, stacked_images in grouped_images.items():
+            height, width = stacked_images.shape[-2:]
+            if do_resize:
+                resized_height, resized_width = smart_resize(
+                    height, width, factor=factor,
+                    max_pixels=max_pixels,
+                )
+                stacked_images = self.resize(
+                    stacked_images,
+                    size=SizeDict(height=resized_height, width=resized_width),
+                    resample=resample,
+                )
+            resized_images_grouped[shape] = stacked_images
+        resized_images = reorder_images(resized_images_grouped, grouped_images_index)
+        grouped_images, grouped_images_index = group_images_by_shape(
+            resized_images, disable_grouping=disable_grouping
+        )
+        processed_images_grouped = {}
+        processed_grids = {}
+        for shape, stacked_images in grouped_images.items():
+            resized_height, resized_width = stacked_images.shape[-2:]
+            patches = self.rescale_and_normalize(
+                stacked_images,
+                do_rescale,
+                rescale_factor,
+                do_normalize,
+                image_mean,
+                image_std,
+            )
+            if patches.ndim == 4:
+                patches = patches.unsqueeze(1)
+            if patches.shape[1] % temporal_patch_size != 0:
+                repeats = patches[:, -1:].repeat(
+                    1,
+                    temporal_patch_size - (patches.shape[1] % temporal_patch_size),
+                    1,
+                    1,
+                    1,
+                )
+                patches = torch.cat([patches, repeats], dim=1)
+            batch_size, grid_t, channel = patches.shape[:3]
+            grid_t = grid_t // temporal_patch_size
+            grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
+            patches = patches.view(
+                batch_size,
+                grid_t,
+                temporal_patch_size,
+                channel,
+                grid_h // merge_size,
+                merge_size,
+                patch_size,
+                grid_w // merge_size,
+                merge_size,
+                patch_size,
+            )
+            patches = patches.permute(0, 1, 4, 7, 5, 8, 3, 2, 6, 9)
+            flatten_patches = patches.reshape(
+                batch_size,
+                grid_t * grid_h * grid_w,
+                channel * temporal_patch_size * patch_size * patch_size,
+            )
+            processed_images_grouped[shape] = flatten_patches
+            processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size
+        processed_images = reorder_images(
+            processed_images_grouped, grouped_images_index
+        )
+        processed_grids = reorder_images(processed_grids, grouped_images_index)
+        pixel_values = torch.cat(processed_images, dim=0)
+        image_grid_thw = torch.tensor(processed_grids, dtype=torch.long)
+        return BatchFeature(
+            data={"pixel_values": pixel_values, "image_grid_thw": image_grid_thw},
+            tensor_type=return_tensors,
+        )
+    def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):
+        images_kwargs = images_kwargs or {}
+        patch_size = images_kwargs.get("patch_size", self.patch_size)
+        merge_size = images_kwargs.get("merge_size", self.merge_size)
+        max_pixels = images_kwargs.get("max_pixels", self.max_pixels)
+        resized_height, resized_width = smart_resize(
+            height, width, factor=patch_size * merge_size,
+            max_pixels=max_pixels,
+        )
+        grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
+        return grid_h * grid_w

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:989c678e6df4e9a8587e1ac2b7eea6705d439864e9466952766dc074ffc3a852
+size 8303674712

model-00002-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d322e81689aac0c78844b8b02563fa8e5caf390712564309d2394de7a1a5e66c
+size 16098618792

model-00003-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f95a025468296d895c7bf140dec15e6b1ed19123bcd2b200783f1411cde032a7
+size 16098618872

model-00004-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e314deac313a4982ad550495d39d3e5c69cee5801f147e9bf8e598f186fddb4
+size 16098619208

model-00005-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dcf591bdbd30aad81b68d805cacc27da765b56da9d120e22c9db4143e3883bae
+size 16098619672

model-00006-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0f78bcc3b7a01df84d5937373255078d47ecad71a35139ed83db69266e0d1503
+size 16098619128

model-00007-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2565c3f341c0500871075b9aff5120148aeef975a213356467982d20782372c4
+size 16098619848

model-00008-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec933f1c62df1e58d01caf0799c3ea95890ff92be49b416196d6f1da22fee73a
+size 16098619768

model-00009-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b9c09bec8c9b5a5b04a275bfe7611e5a5ddaf1840ef5f1afac3bf03b53d0d00
+size 16098619848

model-00010-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a11d745ba9b720208903ff702b7e2242270768b2e142ea5d6836461c5ff2fb2e
+size 16098619800

model-00011-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a29f8db47eab1c696944a52e19baabf889671ab6d65c0984c000804fd87cc762
+size 16098619800

model-00012-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:86ba996a026126a4511f15d97b388b7e165b0a5fb62e298661d90bc44b49bc81
+size 16098619768

model-00013-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:59b7686d653c0cfd71422893b9257e0e0adc3086c908a5e4494bb9ef84f4a15b
+size 16098619800

model-00014-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:72043c0985a67fdb7c616c45229d89a50565ec709125736a7d2934b659af471c
+size 16098619768

model-00015-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b7079291acf884db350919f076c79a803fbebd00d5bedf6f1109d5eb24636cab
+size 16098619768

model-00016-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d9621408065fe642f8a66463fb8b377270b941388974fb8686604b34ac42f732
+size 16098619768

model-00017-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:810cbc1f23df7cf70951c1825b0d38278f6d0d1b2baff3d749251da2a9fb7245
+size 16098619768

model-00018-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9259e7fb22e50e247d46e2be42bb21c3d217c008d5971a9e5ec69b65463a113a
+size 16098619768

model-00019-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb981287715b905ea1549fe486800f00e93cc297419dd1b318452d80b8e966e3
+size 16098619768

model-00020-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b8476f1c8960c48ddb331f245907c1a156b0e028010d1c72554ec7112c7636b
+size 16098619760

model-00021-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3fe2b51ce92a857d228c82622a25a692b373452d59015ad2cf88802513809ea
+size 16098619768

model-00022-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:13d4bfc0bb381e5eb7a45e301086d5f30d958ece5f4eeb0a8aca8557fb73c827
+size 16098619768

model-00023-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:866c0f29b5fc5736e93055fc3fc3b2b3038cc778ed4d468bc60e2a4150b686b6
+size 16098619768

model-00024-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:139eb26aafd48036f975817cdf91589b03e40c4a55d95b3f1bc7bcb596a21ad3
+size 16098619768

model-00025-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4ae53cd8750d06cd6dd740ad12f5500e13038ba33eeb80c1b5c1986f7cb1b4c6
+size 16098619768

model-00026-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22acb94253850e683d9f913f283526efa41c2c12a78d829396fa0e5d6412e453
+size 16098619768

model-00027-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a5e05dbb0a5983471adca1e6c6a93ada4cb262c16f2ada0527015597868f211f
+size 16098619768

model-00028-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e1f73b7b51d07c1ada5ac918cce5efe00113917c01c9244cd97d43618e28a24c
+size 506815888

model-00029-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:176732fbd63cf7b3ae4308b58b52e02c04d1be16c5e0d090687ba6d114ee3b1b
+size 12246524552

model-00030-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb938a5398f07f22e94eb0493aa6b0553af0ed7988885fd7579469ebd4d86e83
+size 2063975528

model-00031-of-00031.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:61bda86ec2d769a254586184f36e4eff64f17730bbe2c72c7da0c0682db28f35
+size 2063975528

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "processor_class": "MiniMaxVLProcessor",
+  "auto_map": {
+    "AutoImageProcessor": "image_processor.MiniMaxM3VLImageProcessor",
+    "AutoProcessor": "processing_minimax.MiniMaxVLProcessor",
+    "AutoVideoProcessor": "video_processor.MiniMaxM3VLVideoProcessor"
+  },
+  "process_image_mode": "dynamic_res",
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "size": [
+    672,
+    672
+  ],
+  "patch_size": 14,
+  "img_token_compression_config": {
+    "image_token_compression_threshold": 1.1,
+    "image_token_compression_method": "patch_merge",
+    "max_image_resolution": 1008,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "add_start_end_special_tokens": true
+}

processing_minimax.py ADDED Viewed

	@@ -0,0 +1,254 @@

+# Copyright 2023-2024 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+"""
+MiniMax VL family HuggingFace-compatible Processor, ImageProcessor, VideoProcessor.
+"""
+import math
+import re
+from typing import List, Optional, Tuple, Union
+import torch
+import torchvision
+from torchvision.transforms import InterpolationMode
+from transformers import BatchFeature
+from transformers.image_processing_utils_fast import (
+    BaseImageProcessorFast,
+    group_images_by_shape,
+    reorder_images,
+)
+from transformers.image_utils import PILImageResampling, SizeDict
+from transformers.processing_utils import (
+    ImagesKwargs,
+    ProcessingKwargs,
+    ProcessorMixin,
+    Unpack,
+    VideosKwargs,
+)
+from transformers.utils import TensorType
+from transformers.video_processing_utils import BaseVideoProcessor
+from transformers.video_utils import group_videos_by_shape, reorder_videos
+class MiniMaxVLProcessorKwargs(ProcessingKwargs, total=False):
+    _defaults = {
+        "videos_kwargs": {
+            "do_resize": False,
+            "return_metadata": True,
+        },
+    }
+class MiniMaxVLProcessor(ProcessorMixin):
+    IMAGE_TOKEN = "]<]image[>["
+    VIDEO_TOKEN = "]<]video[>["
+    VISION_START_TOKEN = "]<]start of image[>["
+    VISION_END_TOKEN = "]<]end of image[>["
+    def __init__(
+        self, image_processor=None, tokenizer=None, video_processor=None, **kwargs
+    ):
+        self.image_token_id = tokenizer.convert_tokens_to_ids(self.IMAGE_TOKEN)
+        self.video_token_id = tokenizer.convert_tokens_to_ids(self.VIDEO_TOKEN)
+        super().__init__(image_processor, tokenizer, video_processor)
+        # Video expansion also uses image start/end tokens. Separate video
+        # start/end tokens exist in the tokenizer, but the original MiniMax
+        # serving path did not use them; keep that behavior for compatibility.
+        self.vision_start_token_id = tokenizer.convert_tokens_to_ids(
+            self.VISION_START_TOKEN
+        )
+        self.vision_end_token_id = tokenizer.convert_tokens_to_ids(
+            self.VISION_END_TOKEN
+        )
+    def _prune_video_tokens(
+        self,
+        input_text: str,
+        video_segments: List[int],
+        video_token: str,
+    ) -> str:
+        """
+        Prune video tokens by temporal_patch_size (e.g., 2:1).
+        Expects the prompt to carry exactly sum(video_segments) video
+        tokens — i.e. one token per *sampled* frame. Then drops token.
+        Args:
+            input_text: prompt with N video_tokens per segment
+            video_segments: actual sampled frame count per video segment
+            video_token: the video token string, e.g. ']<]video[>['
+        Returns:
+            Pruned input_text with ~N/temporal_patch_size tokens per segment.
+        """
+        # If no videos or temporal_patch_size <= 1, no pruning needed
+        if not video_segments or self.video_processor.temporal_patch_size <= 1:
+            return input_text
+        # Split while keeping delimiters
+        special_tokens = [video_token]  # , image_token]
+        pattern = "|".join(map(re.escape, special_tokens))
+        parts = re.split(f"({pattern})", input_text)
+        def is_timestamp(text: str) -> bool:
+            """Check if text ends with timestamp format like ']<]0.0 seconds[>['"""
+            return (
+                text.endswith("seconds[>[")
+                or text.endswith("seconds[>[ ")
+                or text.endswith("seconds [>[")
+                or text.endswith("seconds [>[ ")
+            )
+        def extract_timestamp(text: str) -> str:
+            """Extract timestamp text from the end, starting from ']<]'"""
+            start_index = text.rfind("]<]")
+            if start_index == -1:
+                raise ValueError(f"Failed to extract timestamp: {text}")
+            return text[start_index:]
+        # Build new text with pruned video tokens
+        final_parts = []
+        current_seg_idx = 0  # Which video segment we're in
+        frame_in_seg = 0  # Frame index within current segment
+        last_timestamp_len = 0  # Length of timestamp to potentially remove
+        for part in parts:
+            if part == video_token:
+                if current_seg_idx < len(video_segments):
+                    if frame_in_seg % self.video_processor.temporal_patch_size == 0:
+                        # Keep this video token
+                        final_parts.append(part)
+                        frame_in_seg += 1
+                        if frame_in_seg >= video_segments[current_seg_idx]:
+                            current_seg_idx += 1
+                            frame_in_seg = 0
+                        last_timestamp_len = 0
+                    else:
+                        # Skip this video token
+                        frame_in_seg += 1
+                        if frame_in_seg >= video_segments[current_seg_idx]:
+                            current_seg_idx += 1
+                            frame_in_seg = 0
+                        # Remove the timestamp that was already appended
+                        if last_timestamp_len > 0:
+                            # Truncate the last part to remove timestamp
+                            assert len(final_parts) > 0
+                            final_parts[-1] = final_parts[-1][:-last_timestamp_len]
+                            last_timestamp_len = 0
+                else:
+                    # No more video segments, keep as is
+                    final_parts.append(part)
+                    last_timestamp_len = 0
+            else:
+                # Text part
+                final_parts.append(part)
+                # Check if this text ends with a timestamp
+                if is_timestamp(part):
+                    last_timestamp_len = len(extract_timestamp(part))
+                else:
+                    last_timestamp_len = 0
+        return "".join(final_parts)
+    def __call__(
+        self,
+        images=None,
+        text=None,
+        videos=None,
+        **kwargs: Unpack[MiniMaxVLProcessorKwargs],
+    ) -> BatchFeature:
+        output_kwargs = self._merge_kwargs(
+            MiniMaxVLProcessorKwargs,
+            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
+            **kwargs,
+        )
+        if images is not None:
+            images_kwargs = output_kwargs["images_kwargs"]
+            image_inputs = self.image_processor(images=images, **images_kwargs)
+            image_grid_thw = image_inputs["image_grid_thw"]
+        else:
+            image_inputs = {}
+            image_grid_thw = None
+        if videos is not None:
+            videos_kwargs = output_kwargs["videos_kwargs"]
+            video_inputs = self.video_processor(videos=videos, **videos_kwargs)
+            video_grid_thw = video_inputs["video_grid_thw"]
+            if not kwargs.get("return_metadata"):
+                video_metadata = video_inputs.pop("video_metadata")
+            else:
+                video_metadata = video_inputs["video_metadata"]
+        else:
+            video_inputs = {}
+            video_grid_thw = None
+        if not isinstance(text, list):
+            text = [text]
+        text = text.copy()
+        # Expand image tokens
+        if image_grid_thw is not None:
+            merge_length = self.image_processor.merge_size**2
+            placeholder = "]<]placeholder[>["
+            index = 0
+            for i in range(len(text)):
+                while self.IMAGE_TOKEN in text[i]:
+                    num_tokens = image_grid_thw[index].prod() // merge_length
+                    text[i] = text[i].replace(
+                        self.IMAGE_TOKEN,
+                        self.VISION_START_TOKEN
+                        + placeholder * num_tokens
+                        + self.VISION_END_TOKEN,
+                        1,
+                    )
+                    index += 1
+                text[i] = text[i].replace(placeholder, self.IMAGE_TOKEN)
+        # Expand video tokens
+        if video_grid_thw is not None:
+            merge_length = self.image_processor.merge_size**2
+            placeholder = "]<]placeholder[>["
+            index = 0
+            for i in range(len(text)):
+                while self.VIDEO_TOKEN in text[i]:
+                    metadata = video_metadata[index]
+                    grid_t = video_grid_thw[index][0]
+                    frame_seqlen = video_grid_thw[index][1:].prod() // merge_length
+                    video_placeholder = ""
+                    for frame_idx in range(grid_t):
+                        if (
+                            metadata.fps is not None
+                            and metadata.frames_indices is not None
+                        ):
+                            ts = (
+                                metadata.frames_indices[
+                                    min(
+                                        frame_idx
+                                        * self.video_processor.temporal_patch_size,
+                                        len(metadata.frames_indices) - 1,
+                                    )
+                                ]
+                                / metadata.fps
+                            )
+                            video_placeholder += f"]<]{ts:.1f} seconds[>["
+                        video_placeholder += (
+                            self.VISION_START_TOKEN
+                            + placeholder * frame_seqlen
+                            + self.VISION_END_TOKEN
+                        )
+                    text[i] = text[i].replace(self.VIDEO_TOKEN, video_placeholder, 1)
+                    index += 1
+                text[i] = text[i].replace(placeholder, self.VIDEO_TOKEN)
+        # Tokenize
+        return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
+        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
+        return BatchFeature(
+            data={**text_inputs, **image_inputs, **video_inputs},
+            tensor_type=return_tensors,
+        )

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "bos_token": {
+    "content": "]~b]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "[e~[",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,501 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "200000": {
+      "content": "]!p~[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200001": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200002": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200003": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200004": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200005": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200006": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200007": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200008": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200009": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200010": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200011": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200012": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200013": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200014": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200015": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200016": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200017": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200018": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200019": {
+      "content": "]~b]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200020": {
+      "content": "[e~[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200021": {
+      "content": "]!d~[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200022": {
+      "content": "<function_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200023": {
+      "content": "<code_interpreter>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200024": {
+        "content": "]<]speech[>[",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+    },
+    "200025": {
+      "content": "]<]image[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200026": {
+      "content": "]<]video[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200027": {
+      "content": "]<]start of speech[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200028": {
+      "content": "]<]end of speech[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200029": {
+      "content": "]<]start of image[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200030": {
+      "content": "]<]end of image[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200031": {
+      "content": "]<]start of video[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200032": {
+      "content": "]<]end of video[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200033": {
+      "content": "]<]vision pad[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200034": {
+      "content": "]~!b[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200035": {
+      "content": "<jupyter_error>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200036": {
+      "content": "<add_file>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200037": {
+      "content": "<delete_file>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200038": {
+      "content": "<rename_file>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200039": {
+      "content": "<edit_file>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200040": {
+      "content": "<commit_message>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200041": {
+      "content": "<empty_source_file>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200042": {
+      "content": "<repo_struct>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200043": {
+      "content": "<code_context>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200044": {
+      "content": "<file_content>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200045": {
+      "content": "<source_files>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200046": {
+      "content": "<pr_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200047": {
+      "content": "<review_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200048": {
+      "content": "<filepath>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200049": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200050": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "200051": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "200052": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "200053": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "200054": {
+      "content": "]<]frame[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200055": {
+      "content": "]<]start of frame[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200056": {
+      "content": "]<]end of frame[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200057": {
+      "content": "<|content_altered_placeholder|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200058": {
+      "content": "]<]minimax[>[",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "200059": {
+      "content": "<mm:think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "200060": {
+      "content": "</mm:think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "]~b]",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "[e~[",
+  "pad_token": "]!p~[",
+  "model_max_length": 40960000,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[e~["
+}

video_processor.py ADDED Viewed

	@@ -0,0 +1,208 @@

+# Copyright 2023-2024 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+"""
+MiniMax VL family HuggingFace-compatible VideoProcessor.
+"""
+import math
+from typing import List, Optional, Tuple, Union
+import torch
+import torchvision
+from torchvision.transforms import InterpolationMode
+from transformers import BatchFeature
+from transformers.image_utils import PILImageResampling, SizeDict
+from transformers.processing_utils import (
+    Unpack,
+    VideosKwargs,
+)
+from transformers.utils import TensorType
+from transformers.video_processing_utils import BaseVideoProcessor
+from transformers.video_utils import group_videos_by_shape, reorder_videos
+MAX_RATIO = 200
+def round_by_factor(number: int, factor: int) -> int:
+    return round(number / factor) * factor
+def ceil_by_factor(number: int, factor: int) -> int:
+    return math.ceil(number / factor) * factor
+def floor_by_factor(number: int, factor: int) -> int:
+    return math.floor(number / factor) * factor
+def smart_resize(
+    height: int,
+    width: int,
+    factor: int = 28,
+    min_pixels: int = 4 * 28 * 28,
+    max_pixels: int = 451584,
+) -> tuple[int, int]:
+    if max(height, width) / min(height, width) > MAX_RATIO:
+        raise ValueError(
+            f"absolute aspect ratio must be smaller than {MAX_RATIO}, "
+            f"got {max(height, width) / min(height, width)}"
+        )
+    h_bar = max(factor, round_by_factor(height, factor))
+    w_bar = max(factor, round_by_factor(width, factor))
+    if h_bar * w_bar > max_pixels:
+        beta = math.sqrt((height * width) / max_pixels)
+        h_bar = floor_by_factor(height / beta, factor)
+        w_bar = floor_by_factor(width / beta, factor)
+    elif h_bar * w_bar < min_pixels:
+        beta = math.sqrt(min_pixels / (height * width))
+        h_bar = ceil_by_factor(height * beta, factor)
+        w_bar = ceil_by_factor(width * beta, factor)
+    return h_bar, w_bar
+class MiniMaxM3VLVideoProcessorKwargs(VideosKwargs, total=False):
+    patch_size: int
+    temporal_patch_size: int
+    merge_size: int
+    min_pixels: int
+    max_pixels: int
+    total_pixels: int
+    min_frames: int
+    max_frames: int
+    fps: float | int
+class MiniMaxM3VLVideoProcessor(BaseVideoProcessor):
+    do_resize = True
+    resample = PILImageResampling.BICUBIC
+    size = {"height": 672, "width": 672}
+    default_to_square = False
+    do_rescale = True
+    rescale_factor = 1 / 255
+    do_normalize = True
+    image_mean = [0.48145466, 0.4578275, 0.40821073]
+    image_std = [0.26862954, 0.26130258, 0.27577711]
+    do_convert_rgb = True
+    do_sample_frames = False
+    patch_size = 14
+    temporal_patch_size = 2
+    merge_size = 2
+    min_pixels = 4 * 28 * 28
+    max_pixels = 768 * 28 * 28                  # 602,112
+    total_pixels = int(64000 * 28 * 28 * 0.9)   # ~45M, ~64k tokens budget
+    fps = 1.0
+    min_frames = 4
+    max_frames = 768
+    valid_kwargs = MiniMaxM3VLVideoProcessorKwargs
+    model_input_names = ["pixel_values_videos", "video_grid_thw"]
+    def __init__(self, **kwargs: Unpack[MiniMaxM3VLVideoProcessorKwargs]):
+        super().__init__(**kwargs)
+    def _preprocess(
+        self,
+        videos: List[torch.Tensor],
+        do_convert_rgb: bool,
+        do_resize: bool,
+        size: SizeDict,
+        resample: PILImageResampling | InterpolationMode | int | None,
+        do_rescale: bool,
+        rescale_factor: float,
+        do_normalize: bool,
+        image_mean: float | List[float] | None,
+        image_std: float | List[float] | None,
+        patch_size: int,
+        temporal_patch_size: int,
+        merge_size: int,
+        min_pixels: int,
+        max_pixels: int,
+        return_tensors: str | TensorType | None = None,
+        **kwargs,
+    ) -> BatchFeature:
+        grouped_videos, grouped_videos_index = group_videos_by_shape(videos)
+        resized_videos_grouped = {}
+        factor = patch_size * merge_size
+        for shape, stacked_videos in grouped_videos.items():
+            batch_size, num_frames, channels, height, width = stacked_videos.shape
+            resized_height, resized_width = height, width
+            if do_resize:
+                resized_height, resized_width = smart_resize(
+                    height, width, factor=factor,
+                    min_pixels=min_pixels, max_pixels=max_pixels,
+                )
+                stacked_videos = stacked_videos.view(
+                    batch_size * num_frames, channels, height, width
+                )
+                stacked_videos = self.resize(
+                    stacked_videos,
+                    size=SizeDict(height=resized_height, width=resized_width),
+                    resample=resample,
+                )
+                stacked_videos = stacked_videos.view(
+                    batch_size,
+                    num_frames,
+                    channels,
+                    resized_height,
+                    resized_width,
+                )
+            resized_videos_grouped[shape] = stacked_videos
+        resized_videos = reorder_videos(resized_videos_grouped, grouped_videos_index)
+        grouped_videos, grouped_videos_index = group_videos_by_shape(resized_videos)
+        processed_videos_grouped = {}
+        processed_grids = {}
+        for shape, stacked_videos in grouped_videos.items():
+            resized_height, resized_width = stacked_videos.shape[-2:]
+            patches = self.rescale_and_normalize(
+                stacked_videos,
+                do_rescale,
+                rescale_factor,
+                do_normalize,
+                image_mean,
+                image_std,
+            )
+            if pad := -patches.shape[1] % temporal_patch_size:
+                repeats = patches[:, -1:].expand(-1, pad, -1, -1, -1)
+                patches = torch.cat([patches, repeats], dim=1)
+            batch_size, grid_t, channels = patches.shape[:3]
+            grid_t = grid_t // temporal_patch_size
+            grid_h, grid_w = resized_height // patch_size, resized_width // patch_size
+            patches = patches.view(
+                batch_size,
+                grid_t,
+                temporal_patch_size,
+                channels,
+                grid_h // merge_size,
+                merge_size,
+                patch_size,
+                grid_w // merge_size,
+                merge_size,
+                patch_size,
+            )
+            patches = patches.permute(0, 1, 4, 7, 5, 8, 3, 2, 6, 9)
+            flatten_patches = patches.reshape(
+                batch_size,
+                grid_t * grid_h * grid_w,
+                channels * temporal_patch_size * patch_size * patch_size,
+            )
+            processed_videos_grouped[shape] = flatten_patches
+            processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size
+        processed_videos = reorder_videos(
+            processed_videos_grouped, grouped_videos_index
+        )
+        processed_grids = reorder_videos(processed_grids, grouped_videos_index)
+        pixel_values_videos = torch.cat(processed_videos, dim=0)
+        video_grid_thw = torch.tensor(processed_grids, dtype=torch.long)
+        return BatchFeature(
+            data={
+                "pixel_values_videos": pixel_values_videos,
+                "video_grid_thw": video_grid_thw,
+            },
+            tensor_type=return_tensors,
+        )