Instructions to use meta-llama/Llama-4-Scout-17B-16E-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="meta-llama/Llama-4-Scout-17B-16E-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct") model = AutoModelForMultimodalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "meta-llama/Llama-4-Scout-17B-16E-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
- SGLang
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Docker Model Runner:
docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
Does LLama4 have chunked attention in generation phase ?
Same as title.
I know chunked attention mask is there for context phase. But does LLama4 implement chunked attention mask in generation phase too ?
yes
Sexy girl
Yes, Llama 4 implements chunked attention during the generation phase, but only on specific layers. This is part of the innovative "iRoPE" architecture, which allows for an extremely large context length of 10 million tokens while managing memory efficiently
Here's how it works:
Interleaved architecture: The Llama 4 model uses two different types of attention layers in an alternating pattern:
RoPE layers: These layers apply a chunked attention mask, meaning they can only attend to a fixed-size window of recent tokens (e.g., 8K tokens). During the generation phase, the KV cache for these layers is also fixed in size and only stores keys and values for the current chunk.
NoPE layers: These layers have no positional encoding and use a full causal mask, allowing them to access the entire context history. This is critical for long-range reasoning.
Memory efficiency: By applying chunked attention on most layers, Llama 4 avoids the massive memory growth that typically occurs with long context windows during the generation phase. This makes it possible to run models with enormous context lengths on commercially available GPUs.
Balancing efficiency and performance: The interleaved design is a compromise. The NoPE layers handle the long-range context, while the chunked RoPE layers provide local, high-fidelity attention more efficiently. This gives the model the capability to handle extremely long sequences without a massive increase in hardware requirements.
In summary, Llama 4's approach to attention during generation is not uniformly chunked. Instead, it strategically uses chunked attention on some layers and full causal attention on others. This innovative design is a key reason for its high performance on long-context tasks with relatively modest resource requirements.