Instructions to use openai/gpt-oss-120b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openai/gpt-oss-120b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openai/gpt-oss-120b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b")
model = AutoModelForMultimodalLM.from_pretrained("openai/gpt-oss-120b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use openai/gpt-oss-120b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openai/gpt-oss-120b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openai/gpt-oss-120b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/openai/gpt-oss-120b

SGLang

How to use openai/gpt-oss-120b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openai/gpt-oss-120b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openai/gpt-oss-120b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openai/gpt-oss-120b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openai/gpt-oss-120b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use openai/gpt-oss-120b with Docker Model Runner:
```
docker model run hf.co/openai/gpt-oss-120b
```

running mxfp4 on H100 using tranformers with triton_kernel: make_default_matmul_mxfp4_w_layout not found

#64

by uillliu - opened Aug 6, 2025

Discussion

uillliu

Aug 6, 2025

Has anyone gotten mxfp4 to run on H100 using transformers and triton kernel?

System Info

transformers version: 4.55.0
Platform: Linux-5.15.0-144-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.34.3
Safetensors version: 0.5.3
Accelerate version: 1.9.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 80GB HBM3

Reproduction

I tried to run the openai gpt-oss-120B model in mxfp4 on H100, following this setup command instruction as given by this link
pip install -U transformers accelerate torch triton kernelspip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

I ran the script provided here)

(And I had to manually upgrade triton to 3.4.0)

The error message states:
raceback (most recent call last): File "/workspace/projects/gpt_oss/generate.py", line 6, in <module> model = AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 600, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 316, in _wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5061, in from_pretrained ) = cls._load_pretrained_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5524, in _load_pretrained_model _error_msgs, disk_offload_index, cpu_offload_index = load_shard_file(args) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in load_shard_file disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 882, in _load_state_dict_into_meta_model hf_quantizer.create_quantized_param( File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/quantizers/quantizer_mxfp4.py", line 223, in create_quantized_param load_and_swizzle_mxfp4( File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/integrations/mxfp4.py", line 375, in load_and_swizzle_mxfp4 triton_weight_tensor, weight_scale = swizzle_mxfp4( ^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/integrations/mxfp4.py", line 64, in swizzle_mxfp4 value_layout, value_layout_opts = layout.make_default_matmul_mxfp4_w_layout(mx_axis=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'triton_kernels.tensor_details.layout' has no attribute 'make_default_matmul_mxfp4_w_layout'

emaadmanzoor

Aug 6, 2025

Can you run this on a separate line by itself?

pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

uillliu

Aug 6, 2025

Can you run this on a separate line by itself?

pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Hi Yes, i did run this on a separate line by itself. This seems to be a typo on the original post but I copied it over verbatim for consistency

emaadmanzoor

Aug 6, 2025

I think I got it! You need torch 2.8:

pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/test/cu128

And I'm reasonably sure you need Python 3.12.

I actually installed the torch nightly: torch==2.9.0.dev20250804+cu128

I checked your other packages and the versions match with mine. I have a H100 96GB and it works with vLLM. Below is my vLLM install command:

uv pip install --pre vllm==0.10.1+gptoss     --extra-index-url https://wheels.vllm.ai/gpt-oss/     --extra-index-url https://download.pytorch.org/whl/nightly/cu128     --index-strategy unsafe-best-match

marcsun13

Aug 7, 2025

With transformers main, it should even work on a T4 ! Please try to following google colab: https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing

stargazerx0

Aug 10, 2025

I got the error "No module named 'triton.tools.ragged_tma' and for some reason, I can't build triton from source. Has anyone solved this issue? Thanks a lot

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment