Instructions to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="shenzhi-wang/Llama3.1-70B-Chinese-Chat") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("shenzhi-wang/Llama3.1-70B-Chinese-Chat") model = AutoModelForCausalLM.from_pretrained("shenzhi-wang/Llama3.1-70B-Chinese-Chat") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shenzhi-wang/Llama3.1-70B-Chinese-Chat", filename="gguf/llama3.1_70b_chinese_chat_gguf_f16/llama3-00001-of-00004.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
Use Docker
docker model run hf.co/shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "shenzhi-wang/Llama3.1-70B-Chinese-Chat" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shenzhi-wang/Llama3.1-70B-Chinese-Chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
- SGLang
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "shenzhi-wang/Llama3.1-70B-Chinese-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shenzhi-wang/Llama3.1-70B-Chinese-Chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "shenzhi-wang/Llama3.1-70B-Chinese-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shenzhi-wang/Llama3.1-70B-Chinese-Chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with Ollama:
ollama run hf.co/shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
- Unsloth Studio new
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shenzhi-wang/Llama3.1-70B-Chinese-Chat to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shenzhi-wang/Llama3.1-70B-Chinese-Chat to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shenzhi-wang/Llama3.1-70B-Chinese-Chat to start chatting
- Docker Model Runner
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with Docker Model Runner:
docker model run hf.co/shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
- Lemonade
How to use shenzhi-wang/Llama3.1-70B-Chinese-Chat with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull shenzhi-wang/Llama3.1-70B-Chinese-Chat:Q4_K_M
Run and chat with the model
lemonade run user.Llama3.1-70B-Chinese-Chat-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)For optimal performance, we refrain from fine-tuning the model's identity. Thus, inquiries such as "Who are you" or "Who developed you" may yield random responses that are not necessarily accurate.
If you enjoy our model, please give it a star on our Hugging Face repo and kindly cite our model. Your support means a lot to us. Thank you!
Updates
- 🚀🚀🚀 [July 25, 2024] We now introduce shenzhi-wang/Llama3.1-70B-Chinese-Chat! The training dataset contains >100K preference pairs, and it exhibits significant enhancements, especially in roleplay, function calling, and math capabilities!
- 🔥 We provide the official q3_k_m, q4_k_m, q8_0, and f16 GGUF versions of Llama3.1-70B-Chinese-Chat at https://huggingface.co/shenzhi-wang/Llama3.1-70B-Chinese-Chat/tree/main/gguf!
- 🔥 We provide the official ollama version of Llama3.1-70B-Chinese-Chat at https://ollama.com/wangshenzhi/llama3.1_70b_chinese_chat! Quick use:
ollama run wangshenzhi/llama3.1_70b_chinese_chat.
Model Summary
llama3.1-70B-Chinese-Chat is an instruction-tuned language model for Chinese & English users with various abilities such as roleplaying & tool-using built upon the Meta-Llama-3.1-70B-Instruct model.
Developers: Shenzhi Wang*, Yaowei Zheng*, Guoyin Wang (in.ai), Shiji Song, Gao Huang. (*: Equal Contribution)
- License: Llama-3.1 License
- Base Model: Meta-Llama-3.1-70B-Instruct
- Model Size: 8.03B
- Context length: 128K (reported by Meta-Llama-3.1-70B-Instruct model, untested for our Chinese model)
1. Introduction
This is the first model specifically fine-tuned for Chinese & English users based on the Meta-Llama-3.1-70B-Instruct model. The fine-tuning algorithm used is ORPO [1].
[1] Hong, Jiwoo, Noah Lee, and James Thorne. "Reference-free Monolithic Preference Optimization with Odds Ratio." arXiv preprint arXiv:2403.07691 (2024).
Training framework: LLaMA-Factory.
Training details:
- epochs: 3
- learning rate: 1.5e-6
- learning rate scheduler type: cosine
- Warmup ratio: 0.1
- cutoff len (i.e. context length): 8192
- orpo beta (i.e. $\lambda$ in the ORPO paper): 0.05
- global batch size: 128
- fine-tuning type: full parameters
- optimizer: paged_adamw_32bit
2. Usage
2.1 Usage of Our BF16 Model
Please upgrade the
transformerspackage to ensure it supports Llama3.1 models. The current version we are using is4.43.0.Use the following Python script to download our BF16 model
from huggingface_hub import snapshot_download
snapshot_download(repo_id="shenzhi-wang/Llama3.1-70B-Chinese-Chat", ignore_patterns=["*.gguf"]) # Download our BF16 model without downloading GGUF models.
- Inference with the BF16 model
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "/Your/Local/Path/to/Llama3.1-70B-Chinese-Chat"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype=dtype,
)
chat = [
{"role": "user", "content": "写一首关于机器学习的诗。"},
]
input_ids = tokenizer.apply_chat_template(
chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=8192,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1] :]
print(tokenizer.decode(response, skip_special_tokens=True))
2.2 Usage of Our GGUF Models
- Download our GGUF models from the gguf_models folder;
- Use the GGUF models with LM Studio;
- You can also follow the instructions from https://github.com/ggerganov/llama.cpp/tree/master#usage to use gguf models.
Citation
If our Llama3.1-70B-Chinese-Chat is helpful, please kindly cite as:
@misc {shenzhi_wang_2024,
author = { Wang, Shenzhi and Zheng, Yaowei and Wang, Guoyin and Song, Shiji and Huang, Gao },
title = { Llama3.1-70B-Chinese-Chat },
year = 2024,
url = { https://huggingface.co/shenzhi-wang/Llama3.1-70B-Chinese-Chat },
doi = { 10.57967/hf/2780 },
publisher = { Hugging Face }
}
- Downloads last month
- 708
Model tree for shenzhi-wang/Llama3.1-70B-Chinese-Chat
Base model
meta-llama/Llama-3.1-70B
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shenzhi-wang/Llama3.1-70B-Chinese-Chat", filename="", )