Instructions to use YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth") model = AutoModelForMultimodalLM.from_pretrained("YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth
- SGLang
How to use YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth with Docker Model Runner:
docker model run hf.co/YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth
Use Docker
docker model run hf.co/YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unslothUnsloth x Qwen2
Unsloth can speed up training LLM and reduce memory usage, but currently it only supports Llama3, Mistral, Gemma, ORPR, Phi-3 and TinyLlama. We can't train Qwen2 with Unsloth, even though Qwen2 is popular in community.
It's exciting that we succeed to make Unsloth support Qwen2, it can speed up training and reduce much memory usage. If you want to train Qwen2 with Unsloth, you can use our repo rather than the official one. And we will commit our code to the official repo.
Install our Unsloth:
pip install git+https://github.com/yangjianxin1/unsloth.git
Firefly already supports training Qwen2 with Unsloth, and the subsequent models are trained with Firefly, you can try it.
Model Card for Firefly-Qwen1.5-Unsloth
firefly-qwen1.5-en-7b-unsloth and firefly-qwen1.5-en-7b-dpo-v0.1-unloth are trained based on Qwen1.5-7B to act as a helpful and harmless AI assistant. We use Firefly to train our models on a single V100 GPU with QLoRA and Unsloth. firefly-qwen1.5-en-7b-unsloth is fine-tuned based on Qwen1.5-7B with English instruction data, and firefly-qwen1.5-en-7b-dpo-v0.1-unsloth is trained with Direct Preference Optimization (DPO) based on firefly-qwen1.5-en-7b-unsloth.
Our models outperform official Qwen1.5-7B-Chat, Gemma-7B-it, Zephyr-7B-Beta on Open LLM Leaderboard.
Although our models are trained with English data, you can also try to chat with models in Chinese because Qwen1.5 is also good at Chinese. But we have not evaluated the performance in Chinese yet.
We advise you to install transformers>=4.37.0.
Performance
We have evaluated the training gain of Qwen1.5-7B, we use QLoRA and Unsloth to train model for 20 steps on a single V100. The result can be listed as follows. Unsloth can reduce GPU memory by 39.13% and training time by 32.12%, and the training speed can increase by 47.32%.
| max_seq_length | per_device_train_batch_size | gradient_accumulation_steps | use_unsloth | rank | GPU | Time |
|---|---|---|---|---|---|---|
| 1024 | 1 | 16 | false | 8 | 13.72GB | 448s |
| 1024 | 1 | 16 | true | 8 | 8.43GB(-38.56%) | 308s(-31.25%) |
| 1024 | 1 | 16 | false | 64 | 16.01GB | 452s |
| 1024 | 1 | 16 | true | 64 | 11.07GB(-30.86%) | 311s(-31.19%) |
| 2048 | 1 | 16 | false | 64 | 18.55GB | 840s |
| 2048 | 1 | 16 | true | 64 | 12.99GB(-29.97%) | 596s(-29.05%) |
| 1024 | 4 | 4 | false | 64 | 24.70GB | 357s |
| 1024 | 4 | 4 | true | 64 | 14.36GB(-41.86%) | 253s(-29.13%) |
| 2048 | 4 | 4 | false | 64 | 32.51GB | 741s |
| 2048 | 4 | 4 | true | 64 | 19.79GB(-39.13%) | 503s(-32.12%) |
We evaluate our sft and dpo models on Open LLM Leaderboard, they achieve good performance.
| Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
|---|---|---|---|---|---|---|---|
| firefly-gemma-7b | 62.93 | 62.12 | 79.77 | 61.57 | 49.41 | 75.45 | 49.28 |
| firefly-qwen1.5-en-7b-dpo-v0.1-unsloth | 62.65 | 56.14 | 75.5 | 60.87 | 58.09 | 70.72 | 54.59 |
| zephyr-7b-beta | 61.95 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 29.04 |
| firefly-qwen1.5-en-7b-unsloth | 61.81 | 54.27 | 76.22 | 61.55 | 50.62 | 70.48 | 57.7 |
| vicuna-13b-v1.5 | 55.41 | 57.08 | 81.24 | 56.67 | 51.51 | 74.66 | 11.3 |
| Xwin-LM-13B-V0.1 | 55.29 | 62.54 | 82.8 | 56.53 | 45.96 | 74.27 | 9.63 |
| Qwen1.5-7B-Chat | 55.15 | 55.89 | 78.56 | 61.65 | 53.54 | 67.72 | 13.57 |
| gemma-7b-it | 53.56 | 51.45 | 71.96 | 53.52 | 47.29 | 67.96 | 29.19 |
Usage
The chat templates of our chat models are the same as Official Qwen1.5-7B-Chat:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hello, who are you?<|im_end|>
<|im_start|>assistant
I am a AI program developed by Firefly<|im_end|>
You can use script to inference in Firefly.
You can also use the following code:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name_or_path = "YeungNLP/firefly-qwen1.5-en-7b-unsloth"
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
trust_remote_code=True,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
prompt = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. "
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to('cuda')
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=1500,
top_p = 0.9,
temperature = 0.35,
repetition_penalty = 1.0,
eos_token_id=tokenizer.encode('<|im_end|>', add_special_tokens=False)
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Training Details
Both in SFT and DPO stages, We only use a single V100 GPU with QLoRA and Unsloth, and we use Firefly to train our models.
Training Setting
The following hyperparameters are used during SFT:
- num_epochs: 1
- learning_rate: 2e-4
- total_train_batch_size: 32
- max_seq_length: 2048
- optimizer: paged_adamw_32bit
- lr_scheduler_type: constant_with_warmup
- warmup_steps: 600
- lora_rank: 64
- lora_alpha: 16
- lora_dropout: 0.05
- gradient_checkpointing: true
- fp16: true
The following hyperparameters were used during DPO:
- num_epochs: 1
- learning_rate: 2e-4
- total_train_batch_size: 32
- max_seq_length: 2048
- max_prompt_length: 500
- optimizer: paged_adamw_32bit
- lr_scheduler_type: constant_with_warmup
- warmup_steps: 100
- lora_rank: 64
- lora_alpha: 16
- lora_dropout: 0.05
- gradient_checkpointing: true
- fp16: true
Training metrics
The table below shows the full set of DPO training metrics:
| Epoch | Step | Loss | Rewards/accuracies | Rewards/margins | Rewards/chosen | Rewards/rejected | Logits/chosen | Logits/rejected | Logps/chosen | Logps/rejected |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.05 | 100 | 0.6128 | 0.6572 | 0.3914 | -0.0622 | -0.4537 | 1.107 | 1.1104 | -283.7632 | -264.5925 |
| 0.1 | 200 | 0.6066 | 0.6913 | 0.662 | -0.3589 | -1.0209 | 0.9433 | 0.9431 | -279.0002 | -268.6432 |
| 0.16 | 300 | 0.5803 | 0.7069 | 0.876 | -0.3849 | -1.2609 | 0.8411 | 0.8537 | -289.9482 | -274.3425 |
| 0.21 | 400 | 0.5624 | 0.7169 | 0.9575 | -0.2447 | -1.2022 | 0.7615 | 0.7497 | -293.8072 | -274.4167 |
| 0.26 | 500 | 0.5863 | 0.7 | 0.8908 | -0.5283 | -1.4191 | 0.537 | 0.5085 | -284.3388 | -267.9294 |
| 0.31 | 600 | 0.5612 | 0.7166 | 1.0791 | -0.592 | -1.6711 | 0.7121 | 0.7219 | -293.2425 | -278.5992 |
| 0.37 | 700 | 0.5741 | 0.7234 | 1.0742 | -0.8469 | -1.9211 | 0.6002 | 0.5769 | -300.8099 | -285.9137 |
| 0.42 | 800 | 0.582 | 0.7141 | 1.0414 | -1.1658 | -2.2072 | 0.7191 | 0.5934 | -300.458 | -286.1 |
| 0.47 | 900 | 0.5694 | 0.7178 | 1.2055 | -1.7372 | -2.9426 | 0.4226 | 0.316 | -305.5303 | -290.7548 |
| 0.52 | 1000 | 0.5827 | 0.7134 | 1.1063 | -1.354 | -2.4603 | 0.535 | 0.4022 | -302.7598 | -286.636 |
| 0.58 | 1100 | 0.5553 | 0.7306 | 1.3631 | -1.5861 | -2.9492 | 0.7636 | 0.6559 | -312.9375 | -290.3474 |
| 0.63 | 1200 | 0.5633 | 0.7341 | 1.2689 | -1.7187 | -2.9876 | 0.6555 | 0.5894 | -315.0179 | -298.2406 |
| 0.68 | 1300 | 0.5705 | 0.7284 | 1.3501 | -1.7762 | -3.1263 | 0.7419 | 0.6874 | -310.9056 | -294.2934 |
| 0.73 | 1400 | 0.5458 | 0.7347 | 1.4555 | -2.2377 | -3.6932 | 0.7279 | 0.6564 | -309.141 | -299.1613 |
| 0.79 | 1500 | 0.5797 | 0.7222 | 1.2937 | -2.4483 | -3.742 | 0.8444 | 0.771 | -321.578 | -298.111 |
| 0.84 | 1600 | 0.5572 | 0.7319 | 1.4824 | -2.9344 | -4.4168 | 0.9202 | 0.8605 | -323.4034 | -307.0114 |
| 0.89 | 1700 | 0.5518 | 0.7281 | 1.4263 | -2.7301 | -4.1564 | 0.9257 | 0.8785 | -313.694 | -298.1267 |
| 0.94 | 1800 | 0.5572 | 0.7272 | 1.5121 | -2.9505 | -4.4627 | 0.7899 | 0.7503 | -314.1552 | -305.9873 |
| 0.99 | 1900 | 0.5763 | 0.7241 | 1.4982 | -2.7064 | -4.2047 | 0.7841 | 0.7023 | -310.6677 | -299.5064 |
- Downloads last month
- 7
Install from pip and serve model
# Install vLLM from pip: pip install vllm# Start the vLLM server: vllm serve "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth"# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1-unsloth", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'