Instructions to use LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2")
model = AutoModelForMultimodalLM.from_pretrained("LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2

SGLang

How to use LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2 with Docker Model Runner:
```
docker model run hf.co/LoneStriker/Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2
```

eos_token should be <|eot_id|>

by AUTOMATIC - opened Apr 19, 2024

Discussion

AUTOMATIC

Apr 19, 2024

tokenizer_config.json should list "eos_token" as "<|eot_id|>", othwerwise the chat is spammed with .assistant things and never ends.

gtkunit

Apr 19, 2024

I had to change it in both tokenizer_config.json as well as in special_tokens_map.json.

LoneStriker

Owner Apr 19, 2024

Is that the accepted fix? The files were just copied from the original Meta L3 files.

Knightcodin

Apr 19, 2024

I don't believe so as changing this in exl2 quant affected the way model behaved and followed instructions

gtkunit

Apr 20, 2024

This MR is merged: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4
It looks like the most official fix to me.

chameleon-lizard

Apr 20, 2024

I've opened the pull request for this fix in #2, hope that it will be merged. Amazing model, shame that it has this tokenizer problem on the start.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment