Instructions to use HaileyStorm/llama3-5.4b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HaileyStorm/llama3-5.4b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HaileyStorm/llama3-5.4b-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("HaileyStorm/llama3-5.4b-instruct")
model = AutoModelForMultimodalLM.from_pretrained("HaileyStorm/llama3-5.4b-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use HaileyStorm/llama3-5.4b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HaileyStorm/llama3-5.4b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaileyStorm/llama3-5.4b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HaileyStorm/llama3-5.4b-instruct

SGLang

How to use HaileyStorm/llama3-5.4b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HaileyStorm/llama3-5.4b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaileyStorm/llama3-5.4b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HaileyStorm/llama3-5.4b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaileyStorm/llama3-5.4b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HaileyStorm/llama3-5.4b-instruct with Docker Model Runner:
```
docker model run hf.co/HaileyStorm/llama3-5.4b-instruct
```

llama3-5.4b-instruct / README.md

HaileyStorm

Update README.md

9f7c22e verified about 2 years ago

preview code

raw

history blame contribute delete

5.4 kB

	---
	base_model:
	- meta-llama/Meta-Llama-3-8B-Instruct
	library_name: transformers
	tags:
	- mergekit
	- prune
	- dpo
	- instruct
	datasets:
	- mlabonne/orpo-dpo-mix-40k
	license: llama3
	pipeline_tag: text-generation

	model-index:
	- name: llama3-5.4b-instruct
	results:
	- task:
	type: text-generation
	dataset:
	name: truthfulqa_mc2
	type: truthfulqa_mc2
	metrics:
	- name: TruthfulQA (0-Shot)
	type: TruthfulQA (0-Shot)
	value: 0.517686926475562
	- task:
	type: text-generation
	dataset:
	name: ai2_arc
	type: ai2_arc
	metrics:
	- name: AI2 Reasoning Challenge (25-Shot)
	type: AI2 Reasoning Challenge (25-Shot)
	value: 0.360068259385666
	- task:
	type: text-generation
	dataset:
	name: hellaswag
	type: hellaswag
	metrics:
	- name: HellaSwag (10-Shot)
	type: HellaSwag (10-Shot)
	value: 0.503485361481777
	- task:
	type: text-generation
	dataset:
	name: winogrande
	type: winogrande
	metrics:
	- name: Winogrande (5-Shot)
	type: Winogrande (5-Shot)
	value: 0.633780584056827
	- task:
	type: text-generation
	dataset:
	name: mmlu
	type: mmlu
	metrics:
	- name: MMLU (5-Shot)
	type: MMLU (5-Shot)
	value: 0.290912975359635
	---
	# GGUFs

	Quantized versions of this model are available:
	- https://huggingface.co/HaileyStorm/llama3-5.4b-instruct-Q8_0-GGUF
	- https://huggingface.co/HaileyStorm/llama3-5.4b-instruct-Q6_K-GGUF
	- https://huggingface.co/HaileyStorm/llama3-5.4b-instruct-Q5_K_M-GGUF
	- https://huggingface.co/HaileyStorm/llama3-5.4b-instruct-Q4_0-GGUF

	# Pruned & Tuned

	This is a "merge" of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
	It is a prune of Meta-Llama-3-8B-Instruct from 32 layers down to 20, or about 5.4B parameter -- it's about 67% the size of the original.
	Mostly, this is a test of (significant) pruning & healing an instruct-tuned model.

	## Healing / Finetune
	I healed the model by doing a full weight DPO finetune for 139k samples (3.15 epochs), and then a LoRA with r=128 a=256 for 73k samples (1.67 epochs). Both had 8k sequence length.

	Prior to healing, the model returned absolute gibberish to any prompt, rarely two real words together. For example, give "2+2=" it might return "Mahmisan Pannpyout Na RMITa CMI TTi GP BP GP RSi TBi DD PS..."

	The results are pretty good! The model has issues, but could have legitimate uses. It can carry on a conversation. It's certainly usable, if not useful.

	Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.
	This model has 67% the parameters of the original, and has:
	- ~100% the TruthfulQA score of the original
	- ~60% the ARC Challenge score
	- ~65% the Hellaswag score
	- ~85% the Winogrande score
	- ~45% the the MMLU score

	An average of 69% the benchmark scores for 67% the parameters, not bad! (Note, I had issues running the GSM8K and BBH benchmarks.)
	I do believe it could be much better, by doing the pruning in stages (say, 4 layers at a time) with some healing in between, and longer healing at the end with a more diverse dataset.

	### Benchmarks
	![Comparative Benchmarks](benchmarks.png)
	Figure 1: Benchmark results for the pruned model, the original 8B model, and other models of similar size. Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.

	![Model Size vs Performance](relative.png)
	Figure 2: Model size vs average benchmark performance. Llama3-5.4b-instruct may not be fully healed, but its performance scales linearly with its size.

	## Why 5.4B?

	This size should allow for:
	- bf16 inference on 24GB VRAM
	- Q8 or Q6 inference on 6GB VRAM
	- Q5 inference on 4GB VRAM
	- Fine-tuning on ... well, with less VRAM than an 8B model

	And of course, as stated, it was a test of significant pruning, and of pruning&healing an instruct-tuned model. As a test, I think it's definitely successful.

	## Mergekit Details
	### Merge Method

	This model was merged using the passthrough merge method.

	### Models Merged

	The following models were included in the merge:
	* [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

	### Configuration

	The following YAML configuration was used to produce this model:

	```yaml
	dtype: bfloat16
	merge_method: passthrough
	slices:
	- sources:
	- layer_range: [0, 16]
	model: meta-llama/Meta-Llama-3-8B-Instruct
	- sources:
	- layer_range: [20, 21]
	model: meta-llama/Meta-Llama-3-8B-Instruct
	- sources:
	- layer_range: [29, 32]
	model: meta-llama/Meta-Llama-3-8B-Instruct
	```

	## Weights & Biases Logs
	Here are the logs for the full weight fine tune:
	- https://wandb.ai/haileycollet/llama3-5b/runs/ryyqhc97
	- https://wandb.ai/haileycollet/llama3-5b/runs/fpj2sct3
	- https://wandb.ai/haileycollet/llama3-5b/runs/k9z6n9em
	- https://wandb.ai/haileycollet/llama3-5b/runs/r3xqyhm2

	And the LoRA logs:
	- https://wandb.ai/haileycollet/llama3-5b/runs/rseithn1
	- https://wandb.ai/haileycollet/llama3-5b/runs/g26232ei