Instructions to use meta-llama/Llama-4-Scout-17B-16E-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
model = AutoModelForMultimodalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meta-llama/Llama-4-Scout-17B-16E-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

SGLang

How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Docker Model Runner:
```
docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
```

13 B and34 B Pleeease!!! Most people cannot even run this.

#52

by UniversalLove333 - opened Apr 9, 2025

Discussion

UniversalLove333

Apr 9, 2025

This was sooo disappointing. 😢

MrDevolver

Apr 9, 2025

They don't really care about us...

Doctor-Chad-PhD

Apr 9, 2025

13B Llama 4 would be amazing. Even an 8B upgrade would be so nice.

phil111

Apr 9, 2025

@Doctor-Chad-PhD While the speed of 8B LLMs is great, it's just not enough parameters to make a usable general purpose AI model.

By far the most broadly capable and knowledgeable ~8b English LLMs are Llama 3.1 8b and Gemma 2 9b, yet they score below 10 on SimpleQA and only 69/100 on my easy popular English knowledge test. Plus prompted stories are reliably filled with blatant contradictions to both the prompt and what they already wrote, even at lower temperatures.

So what Meta did here makes a lot of sense. 70b+ parameters is absolutely essential for a general purpose multimodal AI model, and 17b active parameters can run at a reasonable 4+ tokens/second with entry level 8-core AMD and Intel CPUs, and eventually GPUs will have more RAM. It's much cheaper and power efficient to increase RAM than compute.

What's the alternative? Other model families tanked their general knowledge and abilities while trying to boost their coding, math and other STEM scores. A perfect example is Qwen2.5 72b. It's predecessor Qwen2 72b scored 85.9/100 on my easy broad knowledge test, and nearly as good as Llama 3.1 70b on SimpleQA, yet it lost a full 3.5 generations of broad English knowledge (9-10 on SimpleQA and 68.4/100 on my test) in order to make small gains on coding, math, and other STEM tests which were barely discernible in real-world use cases. This may have fooled coding obsessed first adopters into thinking Qwen2.5 was an improvement, but as general purpose English AI models the Qwen2.5 family is astonishingly bad.

In short, going much smaller really isn't an option unless you're willing to trade a notable amount of broad knowledge and abilities for relatively small domain specific gains (e.g. coding and math).

LLaMA-lover

Apr 9, 2025

@UniversalLove333 You can easily run GGUF version of this and even with spilling to swap it will be still faster than dense models that are more than twice smaller than its file size that doesn't spill to swap.

Stop the claims of it not running on most systems.
It's a MoE, don't forget that, have you ever ran a MoE and compared the speed to a Dense model?
Also you are wrong on the "most people cannot even run this", I can run a 2 trillion parameter model if I want on a Intel Celeron laptop by running the LLM from disk/swap (extreme example I know but referencing it to prove a point). so its not the "run-ability" that matters, It's the speed that matters when you run it and it is faster than Dense LLM's that are more than 2 times smaller than it (Gemma 3 27B Q4 QAT [16GB] runs slower than LLaMA 4 Scout Q2_K_XL [42.6GB] on a system with 8GB vram and 32GB system ram [40GB total memory]).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment