Instructions to use meta-llama/Llama-4-Scout-17B-16E-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="meta-llama/Llama-4-Scout-17B-16E-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct") model = AutoModelForMultimodalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "meta-llama/Llama-4-Scout-17B-16E-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
- SGLang
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use meta-llama/Llama-4-Scout-17B-16E-Instruct with Docker Model Runner:
docker model run hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
13 B and34 B Pleeease!!! Most people cannot even run this.
This was sooo disappointing. 😢
They don't really care about us...
13B Llama 4 would be amazing. Even an 8B upgrade would be so nice.
@Doctor-Chad-PhD While the speed of 8B LLMs is great, it's just not enough parameters to make a usable general purpose AI model.
By far the most broadly capable and knowledgeable ~8b English LLMs are Llama 3.1 8b and Gemma 2 9b, yet they score below 10 on SimpleQA and only 69/100 on my easy popular English knowledge test. Plus prompted stories are reliably filled with blatant contradictions to both the prompt and what they already wrote, even at lower temperatures.
So what Meta did here makes a lot of sense. 70b+ parameters is absolutely essential for a general purpose multimodal AI model, and 17b active parameters can run at a reasonable 4+ tokens/second with entry level 8-core AMD and Intel CPUs, and eventually GPUs will have more RAM. It's much cheaper and power efficient to increase RAM than compute.
What's the alternative? Other model families tanked their general knowledge and abilities while trying to boost their coding, math and other STEM scores. A perfect example is Qwen2.5 72b. It's predecessor Qwen2 72b scored 85.9/100 on my easy broad knowledge test, and nearly as good as Llama 3.1 70b on SimpleQA, yet it lost a full 3.5 generations of broad English knowledge (9-10 on SimpleQA and 68.4/100 on my test) in order to make small gains on coding, math, and other STEM tests which were barely discernible in real-world use cases. This may have fooled coding obsessed first adopters into thinking Qwen2.5 was an improvement, but as general purpose English AI models the Qwen2.5 family is astonishingly bad.
In short, going much smaller really isn't an option unless you're willing to trade a notable amount of broad knowledge and abilities for relatively small domain specific gains (e.g. coding and math).
@UniversalLove333 You can easily run GGUF version of this and even with spilling to swap it will be still faster than dense models that are more than twice smaller than its file size that doesn't spill to swap.
Stop the claims of it not running on most systems.
It's a MoE, don't forget that, have you ever ran a MoE and compared the speed to a Dense model?
Also you are wrong on the "most people cannot even run this", I can run a 2 trillion parameter model if I want on a Intel Celeron laptop by running the LLM from disk/swap (extreme example I know but referencing it to prove a point). so its not the "run-ability" that matters, It's the speed that matters when you run it and it is faster than Dense LLM's that are more than 2 times smaller than it (Gemma 3 27B Q4 QAT [16GB] runs slower than LLaMA 4 Scout Q2_K_XL [42.6GB] on a system with 8GB vram and 32GB system ram [40GB total memory]).