How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Minami-su/Qwen1.5-14B-Chat_mistral"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minami-su/Qwen1.5-14B-Chat_mistral",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/Minami-su/Qwen1.5-14B-Chat_mistral
Quick Links

This is the Mistral version of Qwen1.5-14B-Chat model by Alibaba Cloud. The original codebase can be found at: (https://github.com/hiyouga/LLaMA-Factory/blob/main/tests/llamafy_qwen.py). I have made modifications to make it compatible with qwen1.5. This model is converted with https://github.com/Minami-su/character_AI_open/blob/main/mistral_qwen2.py

special

1.Before using this model, you need to modify modeling_mistral.py in transformers library

2.vim /root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py

3.find MistralAttention,

4.modify q,k,v,o bias=False ----->, bias=config.attention_bias

Before: image/png After: image/png

Differences between qwen2 mistral and qwen2 llamafy

Compared to qwen2 llamafy,qwen2 mistral can use sliding window attention,qwen2 mistral is faster than qwen2 llamafy, and the context length is better

Usage:


from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
tokenizer = AutoTokenizer.from_pretrained("Minami-su/Qwen1.5-14B-Chat_mistral")
model = AutoModelForCausalLM.from_pretrained("Minami-su/Qwen1.5-14B-Chat_mistral", torch_dtype="auto", device_map="auto")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

messages = [
    {"role": "user", "content": "Who are you?"}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
inputs = inputs.to("cuda")
generate_ids = model.generate(inputs,max_length=32768, streamer=streamer)

Test

load in 4bit

hf-causal (pretrained=Qwen1.5-14B-Chat), limit: None, provide_description: False, num_fewshot: 0, batch_size: 16
|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.4437|±  |0.0145|
|             |       |acc_norm|0.4718|±  |0.0146|
|truthfulqa_mc|      1|mc1     |0.4468|±  |0.0174|
|             |       |mc2     |0.6310|±  |0.0157|
|winogrande   |      0|acc     |0.6788|±  |0.0131|

load in 4bit

hf-causal (pretrained=Qwen1.5-14B-Chat_mistral), limit: None, provide_description: False, num_fewshot: 0, batch_size: 16
|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.4445|±  |0.0145|
|             |       |acc_norm|0.4718|±  |0.0146|
|truthfulqa_mc|      1|mc1     |0.4468|±  |0.0174|
|             |       |mc2     |0.6310|±  |0.0157|
|winogrande   |      0|acc     |0.6788|±  |0.0131|
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support