K-EXAONE-236B-A23B-NVFP4A16

This repository contains an NVFP4-quantized build of LGAI-EXAONE/K-EXAONE-236B-A23B, together with a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM. The base model also runs on other frameworks (such as vLLM, SGLang, and Transformers); for usage with those, see the upstream LGAI-EXAONE/K-EXAONE-236B-A23B model card.

Overview

K-EXAONE is a large-scale multilingual language model developed by LG AI Research. It is an auto-regressive Mixture-of-Experts (MoE) transformer with 236B total parameters and 23B active per token (128 experts, 8 activated plus 1 shared), using a hybrid attention scheme that interleaves sliding-window and global attention layers. The model covers six languages — Korean, English, Spanish, German, Japanese, and Vietnamese — and supports both reasoning and non-reasoning chat. Its intended use is the same as the upstream LGAI-EXAONE/K-EXAONE-236B-A23B, and it is released under the K-EXAONE AI Model License.

  • Architecture: ExaoneMoE (Mixture-of-Experts)
  • Input / Output: Text / Text
  • Supported Inference Engine: Furiosa LLM
  • Supported Hardware: FuriosaAI RNGD

Quantization

The weights are quantized to NVFP4 (4-bit floating point), while activations and the KV cache remain in 16-bit precision (NVFP4A16).

Features

  • Reasoning. K-EXAONE is a reasoning model and thinking is enabled by default. Launch the server with --reasoning-parser deepseek_v3 and --default-chat-template-kwargs '{"enable_thinking": true}' to have the chain of thought returned in a separate field. To use non-reasoning mode, pass "chat_template_kwargs": {"enable_thinking": false} per request.
  • Tool calling. The model supports tool (function) calling through the hermes tool-call parser.

Parallelism Strategy

On RNGD, K-EXAONE-236B-A23B-NVFP4A16 runs with a tensor-parallel size of 32 PEs, which maps to four RNGD cards (8 PEs per card).

Usage

To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.

Launch the server

The simplest way to serve the model is:

# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/K-EXAONE-236B-A23B-NVFP4A16 \
  --reasoning-parser deepseek_v3 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

The --reasoning-parser deepseek_v3 flag separates the model's chain of thought from the final answer (see Reasoning below). The --default-chat-template-kwargs '{"enable_thinking": true}' flag keeps the chat template and the reasoning parser aligned: K-EXAONE's chat template enables thinking by default, but deepseek_v3 treats reasoning as disabled unless enable_thinking is set, so without this flag a request that omits enable_thinking would leak the raw <think>...</think> text into the response.

When the server is ready, you will see:

INFO:     Started server process [27507]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Launch the server with tool calling

To enable tool (function) calling, start the server with the hermes tool-call parser:

furiosa-llm serve furiosa-ai/K-EXAONE-236B-A23B-NVFP4A16 \
  --reasoning-parser deepseek_v3 \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Query the server

The server exposes an OpenAI-compatible API. You can send a request with curl:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "furiosa-ai/K-EXAONE-236B-A23B-NVFP4A16",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
    }' \
    | python -m json.tool

Reasoning

With --reasoning-parser deepseek_v3, K-EXAONE returns its reasoning separately from the final answer:

  • response.choices[].message.reasoning (non-streaming)
  • response.choices[].delta.reasoning (streaming)

K-EXAONE thinks by default. For latency-sensitive tasks you can switch to non-reasoning mode per request:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="furiosa-ai/K-EXAONE-236B-A23B-NVFP4A16",
    messages=[{"role": "user", "content": "How many r's are in 'strawberry'?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

print("Reasoning:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Note: The reasoning field is not part of the OpenAI API specification, but it is the convention OpenAI recommends for returning the chain-of-thought (CoT) in Chat Completions-compatible APIs. The OpenAI Agents SDK uses reasoning as its primary property for the CoT, and many LLM serving frameworks (such as vLLM) follow the same convention. It appears only in responses that contain reasoning content; accessing it on a response without reasoning content raises an AttributeError.

Tool calling

With the server launched using --enable-auto-tool-choice --tool-call-parser hermes, you can pass tools and let the model decide when to call them. See the Tool Calling guide for a complete client example and details on tool-choice options.

Learn more

Downloads last month
652
Safetensors
Model size
138B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for furiosa-ai/K-EXAONE-236B-A23B-NVFP4A16

Quantized
(11)
this model

Collection including furiosa-ai/K-EXAONE-236B-A23B-NVFP4A16