Instructions to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF", dtype="auto")

llama-cpp-python

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF",
	filename="EXAONE-4.0-1.2B-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

SGLang

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with Ollama:
```
ollama run hf.co/LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M
```

Unsloth Studio

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF to start chatting

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with Docker Model Runner:
```
docker model run hf.co/LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M
```

Lemonade

How to use LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.EXAONE-4.0-1.2B-GGUF-Q4_K_M

List all available models

lemonade list

🎉 License Updated! We are pleased to announce our more flexible licensing terms 🤗
✈️ Try on FriendliAI (licensed under commercial purposes)

📢 EXAONE 4.0 is officially supported by llama.cpp! Please check the guide below

EXAONE-4.0-1.2B-GGUF

Introduction

We introduce EXAONE 4.0, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.

The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.

In the EXAONE 4.0 architecture, we apply new architectural changes compared to previous EXAONE models as below:

Hybrid Attention: For the 32B model, we adopt hybrid attention scheme, which combines Local attention (sliding window attention) with Global attention (full attention) in a 3:1 ratio. We do not use RoPE (Rotary Positional Embedding) for global attention for better global context understanding.
QK-Reorder-Norm: We reorder the LayerNorm position from the traditional Pre-LN scheme by applying LayerNorm directly to the attention and MLP outputs, and we add RMS normalization right after the Q and K projection. It helps yield better performance on downstream tasks despite consuming more computation.

For more details, please refer to our technical report, HuggingFace paper, blog, and GitHub.

Model Configuration

Number of Parameters (without embeddings): 1.07B
Number of Layers: 30
Number of Attention Heads: GQA with 32-heads and 8-KV heads
Vocab Size: 102,400
Context Length: 65,536 tokens
Quantization: Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS in GGUF format (also includes BF16 weights)

Quickstart

llama.cpp

You can run EXAONE models locally using llama.cpp by following these steps:

Install the latest version of llama.cpp (version >= b5932). Please check the official installation guide from llama.cpp.

Download the EXAONE 4.0 model weights in GGUF format.

huggingface-cli download LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF \
    --include "EXAONE-4.0-1.2B-Q4_K_M.gguf" \
    --local-dir .

Generation with `llama-cli`

Apply chat template using transformers.

This process is necessary to avoid issues with current EXAONE modeling code in llama.cpp. This is work in progress at our PR. We will update this once these issues are solved.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "LGAI-EXAONE/EXAONE-4.0-1.2B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Let's work together on local system!"}
]
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print(repr(input_text))
with open("inputs.txt", "w") as f:
    f.write(input_text)

Generate result with greedy decoding.

llama-cli -m EXAONE-4.0-1.2B-Q4_K_M.gguf \
    -fa -ngl 31 \
    --temp 0.0 --top-k 1 \
    -f inputs.txt -no-cnv

OpenAI compatible server with `llama-server`

Run llama-server with EXAONE 4.0 Jinja template. You can find the chat template file in this repository.

llama-server -m EXAONE-4.0-1.2B-Q4_K_M.gguf \
    -c 131072 -fa -ngl 31 \
    --temp 0.6 --top-p 0.95 \
    --jinja --chat-template-file chat_template.jinja \
    --host 0.0.0.0 --port 8820 \
    -a EXAONE-4.0-1.2B-Q4_K_M

Use OpenAI chat completion to test the GGUF model.

curl -X POST http://localhost:8820/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "EXAONE-4.0-1.2B-Q4_K_M",
        "messages": [
            {"role": "user", "content": "Let'\''s work together on server!"}
        ],
        "max_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.95,
        "chat_template_kwargs": {"enable_thinking": false}
    }'

Performance

The following tables show the evaluation results of each model, with reasoning and non-reasoning mode. The evaluation details can be found in the technical report.

✅ denotes the model has a hybrid reasoning capability, evaluated by selecting reasoning / non-reasoning on the purpose.
To assess Korean practical and professional knowledge, we adopt both the KMMLU-Redux and KMMLU-Pro benchmarks. Both datasets are publicly released!
The evaluation results are based on the original model, not quantized model.

32B Reasoning Mode

	EXAONE 4.0 32B	Phi 4 reasoning-plus	Magistral Small-2506	Qwen 3 32B	Qwen 3 235B	DeepSeek R1-0528
Model Size	32.0B	14.7B	23.6B	32.8B	235B	671B
Hybrid Reasoning	✅			✅	✅
World Knowledge
MMLU-Redux	92.3	90.8	86.8	90.9	92.7	93.4
MMLU-Pro	81.8	76.0	73.4	80.0	83.0	85.0
GPQA-Diamond	75.4	68.9	68.2	68.4	71.1	81.0
Math/Coding
AIME 2025	85.3	78.0	62.8	72.9	81.5	87.5
HMMT Feb 2025	72.9	53.6	43.5	50.4	62.5	79.4
LiveCodeBench v5	72.6	51.7	55.8	65.7	70.7	75.2
LiveCodeBench v6	66.7	47.1	47.4	60.1	58.9	70.3
Instruction Following
IFEval	83.7	84.9	37.9	85.0	83.4	80.8
Multi-IF (EN)	73.5	56.1	27.4	73.4	73.4	72.0
Agentic Tool Use
BFCL-v3	63.9	N/A	40.4	70.3	70.8	64.7
Tau-Bench (Airline)	51.5	N/A	38.5	34.5	37.5	53.5
Tau-Bench (Retail)	62.8	N/A	10.2	55.2	58.3	63.9
Multilinguality
KMMLU-Pro	67.7	55.8	51.5	61.4	68.1	71.7
KMMLU-Redux	72.7	62.7	54.6	67.5	74.5	77.0
KSM	87.6	79.8	71.9	82.8	86.2	86.7
MMMLU (ES)	85.6	84.3	68.9	82.8	86.7	88.2
MATH500 (ES)	95.8	94.2	83.5	94.3	95.1	96.0

32B Non-Reasoning Mode

	EXAONE 4.0 32B	Phi 4	Mistral-Small-2506	Gemma3 27B	Qwen3 32B	Qwen3 235B	Llama-4-Maverick	DeepSeek V3-0324
Model Size	32.0B	14.7B	24.0B	27.4B	32.8B	235B	402B	671B
Hybrid Reasoning	✅				✅	✅
World Knowledge
MMLU-Redux	89.8	88.3	85.9	85.0	85.7	89.2	92.3	92.3
MMLU-Pro	77.6	70.4	69.1	67.5	74.4	77.4	80.5	81.2
GPQA-Diamond	63.7	56.1	46.1	42.4	54.6	62.9	69.8	68.4
Math/Coding
AIME 2025	35.9	17.8	30.2	23.8	20.2	24.7	18.0	50.0
HMMT Feb 2025	21.8	4.0	16.9	10.3	9.8	11.9	7.3	29.2
LiveCodeBench v5	43.3	24.6	25.8	27.5	31.3	35.3	43.4	46.7
LiveCodeBench v6	43.1	27.4	26.9	29.7	28.0	31.4	32.7	44.0
Instruction Following
IFEval	84.8	63.0	77.8	82.6	83.2	83.2	85.4	81.2
Multi-IF (EN)	71.6	47.7	63.2	72.1	71.9	72.5	77.9	68.3
Long Context
HELMET	58.3	N/A	61.9	58.3	54.5	63.3	13.7	N/A
RULER	88.2	N/A	71.8	66.0	85.6	90.6	2.9	N/A
LongBench v1	48.1	N/A	51.5	51.5	44.2	45.3	34.7	N/A
Agentic Tool Use
BFCL-v3	65.2	N/A	57.7	N/A	63.0	68.0	52.9	63.8
Tau-Bench (Airline)	25.5	N/A	36.1	N/A	16.0	27.0	38.0	40.5
Tau-Bench (Retail)	55.9	N/A	35.5	N/A	47.6	56.5	6.5	68.5
Multilinguality
KMMLU-Pro	60.0	44.8	51.0	50.7	58.3	64.4	68.8	67.3
KMMLU-Redux	64.8	50.1	53.6	53.3	64.4	71.7	76.9	72.2
KSM	59.8	29.1	35.5	36.1	41.3	46.6	40.6	63.5
Ko-LongBench	76.9	N/A	55.4	72.0	73.9	74.6	65.6	N/A
MMMLU (ES)	80.6	81.2	78.4	78.7	82.1	83.7	86.9	86.7
MATH500 (ES)	87.3	78.2	83.4	86.8	84.7	87.2	78.7	89.2
WMT24++ (ES)	90.7	89.3	92.2	93.1	91.4	92.9	92.7	94.3

1.2B Reasoning Mode

	EXAONE 4.0 1.2B	EXAONE Deep 2.4B	Qwen 3 0.6B	Qwen 3 1.7B	SmolLM 3 3B
Model Size	1.28B	2.41B	596M	1.72B	3.08B
Hybrid Reasoning	✅		✅	✅	✅
World Knowledge
MMLU-Redux	71.5	68.9	55.6	73.9	74.8
MMLU-Pro	59.3	56.4	38.3	57.7	57.8
GPQA-Diamond	52.0	54.3	27.9	40.1	41.7
Math/Coding
AIME 2025	45.2	47.9	15.1	36.8	36.7
HMMT Feb 2025	34.0	27.3	7.0	21.8	26.0
LiveCodeBench v5	44.6	47.2	12.3	33.2	27.6
LiveCodeBench v6	45.3	43.1	16.4	29.9	29.1
Instruction Following
IFEval	67.8	71.0	59.2	72.5	71.2
Multi-IF (EN)	53.9	54.5	37.5	53.5	47.5
Agentic Tool Use
BFCL-v3	52.9	N/A	46.4	56.6	37.1
Tau-Bench (Airline)	20.5	N/A	22.0	31.0	37.0
Tau-Bench (Retail)	28.1	N/A	3.3	6.5	5.4
Multilinguality
KMMLU-Pro	42.7	24.6	21.6	38.3	30.5
KMMLU-Redux	46.9	25.0	24.5	38.0	33.7
KSM	60.6	60.9	22.8	52.9	49.7
MMMLU (ES)	62.4	51.4	48.8	64.5	64.7
MATH500 (ES)	88.8	84.5	70.6	87.9	87.5

1.2B Non-Reasoning Mode

	EXAONE 4.0 1.2B	Qwen 3 0.6B	Gemma 3 1B	Qwen 3 1.7B	SmolLM 3 3B
Model Size	1.28B	596M	1.00B	1.72B	3.08B
Hybrid Reasoning	✅	✅		✅	✅
World Knowledge
MMLU-Redux	66.9	44.6	40.9	63.4	65.0
MMLU-Pro	52.0	26.6	14.7	43.7	43.6
GPQA-Diamond	40.1	22.9	19.2	28.6	35.7
Math/Coding
AIME 2025	23.5	2.6	2.1	9.8	9.3
HMMT Feb 2025	13.0	1.0	1.5	5.1	4.7
LiveCodeBench v5	26.4	3.6	1.8	11.6	11.4
LiveCodeBench v6	30.1	6.9	2.3	16.6	20.6
Instruction Following
IFEval	74.7	54.5	80.2	68.2	76.7
Multi-IF (EN)	62.1	37.5	32.5	51.0	51.9
Long Context
HELMET	41.2	21.1	N/A	33.8	38.6
RULER	77.4	55.1	N/A	65.9	66.3
LongBench v1	36.9	32.4	N/A	41.9	39.9
Agentic Tool Use
BFCL-v3	55.7	44.1	N/A	52.2	47.3
Tau-Bench (Airline)	10.0	31.5	N/A	13.5	38.0
Tau-Bench (Retail)	21.7	5.7	N/A	4.6	6.7
Multilinguality
KMMLU-Pro	37.5	24.6	9.7	29.5	27.6
KMMLU-Redux	40.4	22.8	19.4	29.8	26.4
KSM	26.3	0.1	22.8	16.3	16.1
Ko-LongBench	69.8	16.4	N/A	57.1	15.7
MMMLU (ES)	54.6	39.5	35.9	54.3	55.1
MATH500 (ES)	71.2	38.5	41.2	66.0	62.4
WMT24++ (ES)	65.9	58.2	76.9	76.7	84.0

Usage Guideline

To achieve the expected performance, we recommend using the following configurations:

For non-reasoning mode, we recommend using a lower temperature value such as temperature<0.6 for better performance.

For reasoning mode (using <think> block), we recommend using temperature=0.6 and top_p=0.95.

If you suffer from the model degeneration, we recommend using presence_penalty=1.5.

For Korean general conversation with 1.2B model, we suggest to use temperature=0.1 to avoid code switching.

Limitation

The EXAONE language model has certain limitations and may occasionally generate inappropriate responses. The language model generates responses based on the output probability of tokens, and it is determined during learning from training data. While we have made every effort to exclude personal, harmful, and biased information from the training data, some problematic content may still be included, potentially leading to undesirable responses. Please note that the text generated by EXAONE language model does not reflect the views of LG AI Research.

Inappropriate answers may be generated, which contain personal, harmful or other inappropriate information.
Biased responses may be generated, which are associated with age, gender, race, and so on.
The generated responses rely heavily on statistics from the training data, which can result in the generation of semantically or syntactically incorrect sentences.
Since the model does not reflect the latest information, the responses may be false or contradictory.

LG AI Research strives to reduce potential risks that may arise from EXAONE language models. Users are not allowed to engage in any malicious activities (e.g., keying in illegal information) that may induce the creation of inappropriate outputs violating LG AI's ethical principles when using EXAONE language models.

License

The model is licensed under EXAONE AI Model License Agreement 1.2 - NC

The main difference from the older version is as below:

We removed the claim of model output ownership from the license.

We restrict the model use against the development of models that compete with EXAONE.

We allow the model to be used for educational purposes, not just research.

Citation

@article{exaone-4.0,
  title={EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes},
  author={{LG AI Research}},
  journal={arXiv preprint arXiv:2507.11407},
  year={2025}
}

Contact

LG AI Research Technical Support: contact_us@lgresearch.ai

Downloads last month: 1,062

GGUF

Model size

1B params

Architecture

exaone4

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF

Base model

LGAI-EXAONE/EXAONE-4.0-1.2B

Quantized

(32)

this model

Collection including LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF

EXAONE-4.0

Collection

EXAONE unified model series of 1.2B and 32B, integrating non-reasoning and reasoning modes. • 12 items • Updated Mar 2 • 56

Paper for LGAI-EXAONE/EXAONE-4.0-1.2B-GGUF

EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

Paper • 2507.11407 • Published Jul 15, 2025 • 62