Instructions to use tiiuae/Falcon-H1-34B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/Falcon-H1-34B-Instruct-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("tiiuae/Falcon-H1-34B-Instruct-GGUF", dtype="auto")

llama-cpp-python

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="tiiuae/Falcon-H1-34B-Instruct-GGUF",
	filename="BF16/Falcon-H1-34B-Instruct-BF16-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/Falcon-H1-34B-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon-H1-34B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

SGLang

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/Falcon-H1-34B-Instruct-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon-H1-34B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/Falcon-H1-34B-Instruct-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon-H1-34B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with Ollama:
```
ollama run hf.co/tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M
```

Unsloth Studio

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tiiuae/Falcon-H1-34B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tiiuae/Falcon-H1-34B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for tiiuae/Falcon-H1-34B-Instruct-GGUF to start chatting

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use tiiuae/Falcon-H1-34B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull tiiuae/Falcon-H1-34B-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Falcon-H1-34B-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

Falcon-H1-34B-Instruct-GGUF / README.md

ibrahim-khadraoui-TII

Update README.md

4ea1d41 verified about 1 year ago

preview code

Raw

History Blame

5.24 kB

	---
	library_name: transformers
	tags:
	- falcon-h1
	license: other
	license_name: falcon-llm-license
	license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html
	base_model: tiiuae/Falcon-H1-34B-Instruct
	inference: true
	---

	<img src="https://huggingface.co/datasets/tiiuae/documentation-images/resolve/main/falcon_mamba/falcon-h1-logo.png" alt="drawing" width="800"/>

	# Table of Contents

	0. [TL;DR](#TL;DR)
	1. [Model Details](#model-details)
	2. [Training Details](#training-details)
	3. [Usage](#usage)
	4. [Evaluation](#evaluation)
	5. [Citation](#citation)

	# TL;DR

	# Model Details

	## Model Description

	- Developed by: [https://www.tii.ae](https://www.tii.ae)
	- Model type: Causal decoder-only
	- Architecture: Hybrid Transformers + Mamba architecture
	- Language(s) (NLP): English, Multilingual
	- License: Falcon-LLM License

	# Training details

	For more details about the training protocol of this model, please refer to the [Falcon-H1 technical blogpost](https://falcon-lm.github.io/blog/falcon-h1/).

	# Usage

	Currently to use this model you can either rely on Hugging Face `transformers`, `vLLM` or our custom fork of `llama.cpp` library.

	## Inference

	Make sure to install the latest version of `transformers` or `vllm`, eventually install these packages from source:

	```bash
	pip install git+https://github.com/huggingface/transformers.git
	```

	Refer to [the official vLLM documentation for more details on building vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#build-wheel-from-source).

	### 🤗 transformers

	Refer to the snippet below to run H1 models using 🤗 transformers:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "tiiuae/Falcon-H1-1B-Base"

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Perform text generation
	```

	### vLLM

	For vLLM, simply start a server by executing the command below:

	```
	# pip install vllm
	vllm serve tiiuae/Falcon-H1-1B-Instruct --tensor-parallel-size 2 --data-parallel-size 1
	```

	### 🦙 llama.cpp

	While we are working on integrating our architecture directly into `llama.cpp` library, you can install our fork of the library and use it directly: https://github.com/tiiuae/llama.cpp-Falcon-H1
	Use the same installing guidelines as `llama.cpp`.

	# Evaluation

	Falcon-H1 series perform very well on a variety of tasks, including reasoning tasks.

	\| Tasks \| Falcon-H1-34B \| Qwen3-32B \| Qwen2.5-72B \| Qwen2.5-32B \| Gemma3-27B \| Llama3.3-70B \| Llama4-scout \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| General \| \| \| \| \| \| \|
	\| BBH \| 70.68 \| 62.47 \| 72.52 \| 68.72 \| 67.28 \| 69.15 \| 64.9 \|
	\| ARC-C \| 61.01 \| 48.98 \| 46.59 \| 44.54 \| 54.52 \| 63.65 \| 56.14 \|
	\| TruthfulQA \| 65.27 \| 58.58 \| 69.8 \| 70.28 \| 64.26 \| 66.15 \| 62.74 \|
	\| HellaSwag \| 81.94 \| 68.89 \| 68.79 \| 73.95 \| 57.25 \| 70.24 \| 65.03 \|
	\| MMLU \| 84.05 \| 80.89 \| 84.42 \| 82.8 \| 78.01 \| 82.08 \| 80.4 \|
	\| Math \| \| \| \| \| \| \|
	\| GSM8k \| 83.62 \| 88.78 \| 82.26 \| 78.47 \| 90.37 \| 93.71 \| 90.37 \|
	\| MATH-500 \| 83.8 \| 82.0 \| 83.6 \| 82.2 \| 90.0 \| 70.6 \| 83.2 \|
	\| AMC-23 \| 69.38 \| 67.34 \| 67.34 \| 68.75 \| 77.81 \| 39.38 \| 69.06 \|
	\| AIME-24 \| 23.75 \| 27.71 \| 17.29 \| 17.92 \| 27.5 \| 12.92 \| 27.92 \|
	\| AIME-25 \| 16.67 \| 19.79 \| 15.21 \| 11.46 \| 22.71 \| 1.25 \| 8.96 \|
	\| Science \| \| \| \| \| \| \|
	\| GPQA \| 41.53 \| 30.2 \| 37.67 \| 34.31 \| 36.49 \| 31.99 \| 31.8 \|
	\| GPQA_Diamond \| 49.66 \| 49.49 \| 44.95 \| 40.74 \| 47.47 \| 42.09 \| 51.18 \|
	\| MMLU-Pro \| 58.73 \| 54.68 \| 56.35 \| 56.63 \| 47.81 \| 53.29 \| 55.58 \|
	\| MMLU-stem \| 83.57 \| 81.64 \| 82.59 \| 82.37 \| 73.55 \| 74.88 \| 75.2 \|
	\| Code \| \| \| \| \| \| \|
	\| HumanEval \| 87.2 \| 90.85 \| 87.2 \| 90.24 \| 86.59 \| 83.53 \| 85.4 \|
	\| HumanEval+ \| 81.71 \| 85.37 \| 80.49 \| 82.32 \| 78.05 \| 79.87 \| 78.7 \|
	\| MBPP \| 83.86 \| 86.24 \| 89.68 \| 87.83 \| 88.36 \| 88.09 \| 81.5 \|
	\| MBPP+ \| 71.43 \| 71.96 \| 75.4 \| 74.07 \| 74.07 \| 73.81 \| 64.8 \|
	\| LiveCodeBench \| 49.71 \| 45.01 \| 54.6 \| 49.12 \| 39.53 \| 40.31 \| 40.12 \|
	\| CRUXEval \| 73.07 \| 78.45 \| 75.63 \| 73.5 \| 74.82 \| 69.53 \| 68.32 \|
	\| Instruction Following \| \| \| \| \| \| \|
	\| IFEval \| 89.37 \| 86.97 \| 86.35 \| 81.79 \| 83.19 \| 89.94 \| 86.32 \|
	\| Alpaca-Eval \| 48.32 \| 64.21 \| 49.29 \| 39.26 \| 56.16 \| 38.27 \| 36.26 \|
	\| MTBench \| 9.2 \| 9.05 \| 9.16 \| 9.09 \| 8.75 \| 8.98 \| 8.98 \|
	\| LiveBench \| 46.26 \| 63.05 \| 54.03 \| 52.92 \| 55.41 \| 53.11 \| 54.21 \|

	You can check more in detail on our [our release blogpost](https://falcon-lm.github.io/blog/falcon-h1/), detailed benchmarks.

	# Useful links

	- View [our release blogpost](https://falcon-lm.github.io/blog/falcon-h1/).
	- Feel free to join [our discord server](https://discord.gg/trwMYP9PYm) if you have any questions or to interact with our researchers and developers.

	# Citation

	If the Falcon-H1 family of models were helpful to your work, feel free to give us a cite.

	```
	@misc{tiifalconh1,
	title = {Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance},
	url = {https://falcon-lm.github.io/blog/falcon-h1},
	author = {Falcon-LLM Team},
	month = {May},
	year = {2025}
	}
	```