Instructions to use norallm/normistral-7b-warm-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use norallm/normistral-7b-warm-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="norallm/normistral-7b-warm-instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-warm-instruct") model = AutoModelForCausalLM.from_pretrained("norallm/normistral-7b-warm-instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use norallm/normistral-7b-warm-instruct with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="norallm/normistral-7b-warm-instruct", filename="normistral-7b-warm-instruct.Q3_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Local Apps Settings
- llama.cpp
How to use norallm/normistral-7b-warm-instruct with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf norallm/normistral-7b-warm-instruct:Q4_K_M # Run inference directly in the terminal: llama cli -hf norallm/normistral-7b-warm-instruct:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf norallm/normistral-7b-warm-instruct:Q4_K_M # Run inference directly in the terminal: llama cli -hf norallm/normistral-7b-warm-instruct:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf norallm/normistral-7b-warm-instruct:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf norallm/normistral-7b-warm-instruct:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf norallm/normistral-7b-warm-instruct:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf norallm/normistral-7b-warm-instruct:Q4_K_M
Use Docker
docker model run hf.co/norallm/normistral-7b-warm-instruct:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use norallm/normistral-7b-warm-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "norallm/normistral-7b-warm-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "norallm/normistral-7b-warm-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/norallm/normistral-7b-warm-instruct:Q4_K_M
- SGLang
How to use norallm/normistral-7b-warm-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "norallm/normistral-7b-warm-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "norallm/normistral-7b-warm-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "norallm/normistral-7b-warm-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "norallm/normistral-7b-warm-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use norallm/normistral-7b-warm-instruct with Ollama:
ollama run hf.co/norallm/normistral-7b-warm-instruct:Q4_K_M
- Unsloth Studio
How to use norallm/normistral-7b-warm-instruct with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for norallm/normistral-7b-warm-instruct to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for norallm/normistral-7b-warm-instruct to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for norallm/normistral-7b-warm-instruct to start chatting
- Atomic Chat new
- Docker Model Runner
How to use norallm/normistral-7b-warm-instruct with Docker Model Runner:
docker model run hf.co/norallm/normistral-7b-warm-instruct:Q4_K_M
- Lemonade
How to use norallm/normistral-7b-warm-instruct with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull norallm/normistral-7b-warm-instruct:Q4_K_M
Run and chat with the model
lemonade run user.normistral-7b-warm-instruct-Q4_K_M
List all available models
lemonade list
Feedback and questions: Model underfitting and poor format adaptation?
First of all, I just want to express that I'm very grateful of the much needed work you've been doing on instruction tuned LMs for Norwegian, and releasing it with a truly open-source license.
I'm making this post to give my feedback on the model as encouraged in the model description, along with a couple questions :)
I have some internal dataset/task that i wanted to try out norwegian instruction tuned models for, but when trying it out, it exhibited responses / behaviors that more closely resembled a foundation model that is not tuned for instructions. I had to double check that the model posted here was not the same norallm/normistral-7b-warm as they both, respond similarly to the example prompt in the model card.
My experience with the model so far is that it does generally perform better than the base model but the difference in my limited, manual testing was marginal.
example prompt
<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
<|im_start|> user
Gi meg en liste over de beste stedene å besøke i hovedstaden<|im_end|>
<|im_start|> assistant
normistral-7b-warm-instruct
1. Vigelandsparken: Denne parken viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland. Parken er et flott sted å tilbringe en solrik dag, og den tilbyr en unik innsikt i Norges kunst- og kulturhistorie.
2. Det kongelige slott: Dette er den offisielle residensen til Norges kongefamilie. Slottet er et vakkert eksempel på norsk arkitektur og er åpent for publikum om sommeren. Du kan ta en guidet omvisning i slottet og lære om historien og den kongelige familien.
3. Holmenkollen: Dette er et av de mest populære turistmålene i Oslo. Det tilbyr fantastisk utsikt over byen og er hjemsted for det berømte Holmenkollbakken, som er vertskap for det årlige FIS Nordic World Ski Championships.
4. Frammuseet: Dette museet viser fram skipet som ble brukt av norske oppdagelsesreisende Fridtjof Nansen og Roald Amundsen. Museet tilbyr en fascinerende innsikt i Norges maritime historie og polarutforskning.
normistral-7b-warm
De beste stedene å besøke i hovedstaden er:
1. Vigelandsparken
2. Det kongelige slott
3. Munch-museet
4. Vikingskipshuset
Furthermore, the model does not seem very adapted to the prompt format used, which raises the question if the model has been trained with the same format for every dataset used. If the model was appropriately adapted to the format, I would expect it to complete parts of the format upon greedy decoding. For instance, if i sent <|im_start|> as input, I would expect either _user or _assistant, but with this model, i get (using greedy decoding):
<|im_start|>”,” sa han. ”Jeg har ikke sett noen av dem på en stund.”
”Jeg har ikke sett noen av dem på en stund,” sa jeg.
...
With _user being the 29247th most probable token, indicating that the model has not yet learned the prompt format.
Similarly, the model doesn't seem to generate <|im_end|> either unless it is given some "few-shot" chat turns as preliminary context. But then again, if it relies on that to use the prompt format correctly, it is not yet much different from a base/foundation model.
So I'm left with a couple questions:
- Is the model trained on the given prompt format?
- Are there any other formats it has been trained on?
- The model is a work in progress, but is the training complete or is this an early checkpoint of a model that is still currently undergoing training?
- The current model, how many instruction documents is it trained on, and has it been trained for multiple passes? If yes, has it been trained until convergence of some validation loss?
I do acknowledge that the model is a work in progress, I'm just not sure if this is a mid-training checkpoint or a first iteration of a complete model
Hi, thanks for the interesting questions!
I had to double check that the model posted here was not the same norallm/normistral-7b-warm as they both, respond similarly to the example prompt in the model card.
Well, it is actually the same model, the instruct model has just been briefly finetuned on instruction data, so it's expected that it "reasons" similarly. I'm just a bit surprised that the base model follows the instruction format at all :)
If the model was appropriately adapted to the format, I would expect it to complete parts of the format upon greedy decoding. For instance, if i sent <|im_start|> as input, I would expect either _user or _assistant.
The model is not trained to do that, it's only trained to give a response (because, well, that's what it's supposed to do). So for example, if the training sample looks like this:
<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
The loss will only be calculated on these tokens: Hovedstaden i Norge er... opplevelser.<|im_end|>. Thus, the model will have no idea what should follow after <|im_start|>, for example.
Similarly, the model doesn't seem to generate <|im_end|> either
I haven't observed this behavior in hundreds of conversations with this model, the responses are definitely not always perfect, but they always end with <|im_end|>. Please make sure you let the model generate enough tokens (it tends to be quite verbose) and that you don't use a repetition penalty.
Is the model trained on the given prompt format? Are there any other formats it has been trained on?
Yes and no.
The model is a work in progress, but is the training complete or is this an early checkpoint of a model that is still currently undergoing training?
This checkpoint is fully trained. The next versions are going to be trained in the same way but we will update the training data to hopefully mitigate some of the current issues.
The current model, how many instruction documents is it trained on, and has it been trained for multiple passes? If yes, has it been trained until convergence of some validation loss?
It's trained for 2 epochs on 60K conversations. No, training/validation loss is in our experience not a great indicator of convergence in this case, the goal is not to fit the data but to teach the model how to respond to user prompts. See for example https://arxiv.org/abs/2403.04652 for more details .
Piggybacking off this: you say
(...) and that you don't use a repetition penalty.
Why is that an issue? Currently testing but results so far look promising, our parameters used are
PARAMETER num_ctx 4096
PARAMETER temperature 0.3
PARAMETER repeat_penalty 1.0
PARAMETER stop <|im_end|>
Like I said we're still testing, but results so far look promising with this. It's even less verbose: even though the initial response goes on a bit (and echoes some parts of the system prompt we don't want it to spit out), the following responses tend to be single-sentence responses unless we prompt it for more information.
For reference, using the Q5_K_M quantized version of the model, and running it through ollama (although I don't expect that should matter much here).
That's a great question, I wish I had a good answer :) NorMistral seems to be more sensitive to repetition penalty than other LLMs and I don't really know what's causing it.
Repetition penalty lowers the probability of any token that has already appeared in the previous context (in the prompts as well as in the responses) -- mathematically it multiplies the temperature with the repetition penalty value for all such tokens. The bad thing is that it also penalizes tokens that should repeat -- like stopwords, punctuation or the special <|im_end|> token, which can cause the outputs to be very random/chaotic.
NorMistral probably outputs more even distribution across tokens than other LLMs and that's why it suffers more from the drawbacks of repetition penalty (as you can see, it's a rather dumb hack to prevent repetitions) -- but I have no idea why would the output distribution be different.
Hmm, alright. I’ll keep testing and keep you updated, but like I said so far it’s acting quite well under the parameters I’ve set. It is quite brief in its answers though, which may be due to the repetition penalty. I’ll try tuning this a bit, since the original non-penalized version tended to go on for too long.
Just to be clear, setting repetition penalty to 1.0 is equivalent to turning the repetition penalty off (as in the parameters you sent). Only a higher value than 1.0 starts penalizing repeated tokens.
Do you use a special system prompt to get short replies? The responses I get tend to be a bit too long and verbose :)
Oh, god. You’re right, my mistake😅
Yeah we have a few paragraphs of instructions, it’s a RAG application that should be fact-finding and not make things up. I think that might motivate it to keep it short whenever it “ventures off” outside the knowledge base. Might try to tweak that around a bit to see what we get, but it’s been quite to the point without being restrictive. Just a little no-nonsense, but that’s what we want I guess.