Instructions to use openai/gpt-oss-120b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openai/gpt-oss-120b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openai/gpt-oss-120b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b") model = AutoModelForMultimodalLM.from_pretrained("openai/gpt-oss-120b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use openai/gpt-oss-120b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openai/gpt-oss-120b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/openai/gpt-oss-120b
- SGLang
How to use openai/gpt-oss-120b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openai/gpt-oss-120b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openai/gpt-oss-120b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use openai/gpt-oss-120b with Docker Model Runner:
docker model run hf.co/openai/gpt-oss-120b
running mxfp4 on H100 using tranformers with triton_kernel: make_default_matmul_mxfp4_w_layout not found
Has anyone gotten mxfp4 to run on H100 using transformers and triton kernel?
System Info
transformersversion: 4.55.0- Platform: Linux-5.15.0-144-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.34.3
- Safetensors version: 0.5.3
- Accelerate version: 1.9.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA H100 80GB HBM3
Reproduction
I tried to run the openai gpt-oss-120B model in mxfp4 on H100, following this setup command instruction as given by this linkpip install -U transformers accelerate torch triton kernelspip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
I ran the script provided here)
(And I had to manually upgrade triton to 3.4.0)
The error message states:raceback (most recent call last): File "/workspace/projects/gpt_oss/generate.py", line 6, in <module> model = AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 600, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 316, in _wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5061, in from_pretrained ) = cls._load_pretrained_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5524, in _load_pretrained_model _error_msgs, disk_offload_index, cpu_offload_index = load_shard_file(args) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in load_shard_file disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 882, in _load_state_dict_into_meta_model hf_quantizer.create_quantized_param( File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/quantizers/quantizer_mxfp4.py", line 223, in create_quantized_param load_and_swizzle_mxfp4( File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/integrations/mxfp4.py", line 375, in load_and_swizzle_mxfp4 triton_weight_tensor, weight_scale = swizzle_mxfp4( ^^^^^^^^^^^^^^ File "/workspace/projects/trainnew/lib/python3.11/site-packages/transformers/integrations/mxfp4.py", line 64, in swizzle_mxfp4 value_layout, value_layout_opts = layout.make_default_matmul_mxfp4_w_layout(mx_axis=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'triton_kernels.tensor_details.layout' has no attribute 'make_default_matmul_mxfp4_w_layout'
Can you run this on a separate line by itself?
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Can you run this on a separate line by itself?
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Hi Yes, i did run this on a separate line by itself. This seems to be a typo on the original post but I copied it over verbatim for consistency
I think I got it! You need torch 2.8:
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/test/cu128
And I'm reasonably sure you need Python 3.12.
I actually installed the torch nightly: torch==2.9.0.dev20250804+cu128
I checked your other packages and the versions match with mine. I have a H100 96GB and it works with vLLM. Below is my vLLM install command:
uv pip install --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128 --index-strategy unsafe-best-match
With transformers main, it should even work on a T4 ! Please try to following google colab: https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing
I got the error "No module named 'triton.tools.ragged_tma' and for some reason, I can't build triton from source. Has anyone solved this issue? Thanks a lot