How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Minami-su/Qwen1.5-0.5B-Chat_mistral" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minami-su/Qwen1.5-0.5B-Chat_mistral",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Minami-su/Qwen1.5-0.5B-Chat_mistral" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minami-su/Qwen1.5-0.5B-Chat_mistral",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Quick Links

This is the Mistral version of Qwen1.5-0.5B-Chat model by Alibaba Cloud. The original codebase can be found at: (https://github.com/hiyouga/LLaMA-Factory/blob/main/tests/llamafy_qwen.py). I have made modifications to make it compatible with qwen1.5. This model is converted with https://github.com/Minami-su/character_AI_open/blob/main/mistral_qwen2.py

special

1.Before using this model, you need to modify modeling_mistral.py in transformers library

2.vim /root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py

3.find MistralAttention,

4.modify q,k,v,o bias=False ----->, bias=config.attention_bias

Before: image/png After: image/png

Differences between qwen2 mistral and qwen2 llamafy

Compared to qwen2 llamafy,qwen2 mistral can use sliding window attention,qwen2 mistral is faster than qwen2 llamafy, and the context length is better

Usage:


from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
tokenizer = AutoTokenizer.from_pretrained("Minami-su/Qwen1.5-0.5B-Chat_mistral")
model = AutoModelForCausalLM.from_pretrained("Minami-su/Qwen1.5-0.5B-Chat_mistral", torch_dtype="auto", device_map="auto")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

messages = [
    {"role": "user", "content": "Who are you?"}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
inputs = inputs.to("cuda")
generate_ids = model.generate(inputs,max_length=2048, streamer=streamer)

Test

load in 4bit

hf-causal (pretrained=Qwen1.5-0.5B-Chat), limit: None, provide_description: False, num_fewshot: 0, batch_size: 32
|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.2389|±  |0.0125|
|             |       |acc_norm|0.2688|±  |0.0130|
|truthfulqa_mc|      1|mc1     |0.2534|±  |0.0152|
|             |       |mc2     |0.4322|±  |0.0151|
|winogrande   |      0|acc     |0.5564|±  |0.0140|

load in 4bit

hf-causal (pretrained=Qwen1.5-0.5B-Chat_mistral), limit: None, provide_description: False, num_fewshot: 0, batch_size: 32
|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.2398|±  |0.0125|
|             |       |acc_norm|0.2705|±  |0.0130|
|truthfulqa_mc|      1|mc1     |0.2534|±  |0.0152|
|             |       |mc2     |0.4322|±  |0.0151|
|winogrande   |      0|acc     |0.5549|±  |0.0140|

Downloads last month
758
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support