Instructions to use meta-llama/Llama-4-Scout-17B-16E-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
model = AutoModelForMultimodalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meta-llama/Llama-4-Scout-17B-16E-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

SGLang

How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Docker Model Runner:
```
docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
```

Does LLama4 have chunked attention in generation phase ?

#64

by vanshils - opened Apr 15, 2025

Discussion

vanshils

Apr 15, 2025

Same as title.
I know chunked attention mask is there for context phase. But does LLama4 implement chunked attention mask in generation phase too ?

ArthurZ

May 20, 2025

yes

bayatdariush90

Jun 12, 2025

Sexy girl

dbl0207

Aug 31, 2025

Yes, Llama 4 implements chunked attention during the generation phase, but only on specific layers. This is part of the innovative "iRoPE" architecture, which allows for an extremely large context length of 10 million tokens while managing memory efficiently

dbl0207

Aug 31, 2025

Here's how it works:
Interleaved architecture: The Llama 4 model uses two different types of attention layers in an alternating pattern:

RoPE layers: These layers apply a chunked attention mask, meaning they can only attend to a fixed-size window of recent tokens (e.g., 8K tokens). During the generation phase, the KV cache for these layers is also fixed in size and only stores keys and values for the current chunk.

NoPE layers: These layers have no positional encoding and use a full causal mask, allowing them to access the entire context history. This is critical for long-range reasoning.
Memory efficiency: By applying chunked attention on most layers, Llama 4 avoids the massive memory growth that typically occurs with long context windows during the generation phase. This makes it possible to run models with enormous context lengths on commercially available GPUs.

Balancing efficiency and performance: The interleaved design is a compromise. The NoPE layers handle the long-range context, while the chunked RoPE layers provide local, high-fidelity attention more efficiently. This gives the model the capability to handle extremely long sequences without a massive increase in hardware requirements.

In summary, Llama 4's approach to attention during generation is not uniformly chunked. Instead, it strategically uses chunked attention on some layers and full causal attention on others. This innovative design is a key reason for its high performance on long-context tasks with relatively modest resource requirements.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment