Instructions to use NousResearch/Nous-Hermes-Llama2-70b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NousResearch/Nous-Hermes-Llama2-70b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NousResearch/Nous-Hermes-Llama2-70b")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-Llama2-70b") model = AutoModelForMultimodalLM.from_pretrained("NousResearch/Nous-Hermes-Llama2-70b") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use NousResearch/Nous-Hermes-Llama2-70b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NousResearch/Nous-Hermes-Llama2-70b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/Nous-Hermes-Llama2-70b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/NousResearch/Nous-Hermes-Llama2-70b
- SGLang
How to use NousResearch/Nous-Hermes-Llama2-70b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NousResearch/Nous-Hermes-Llama2-70b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/Nous-Hermes-Llama2-70b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NousResearch/Nous-Hermes-Llama2-70b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/Nous-Hermes-Llama2-70b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use NousResearch/Nous-Hermes-Llama2-70b with Docker Model Runner:
docker model run hf.co/NousResearch/Nous-Hermes-Llama2-70b
Go/No Go
Thank you for creating and publishing this model. The 13b version is brilliant.
When announcing your models, I hope that you will consider accompanying them with a couple of brief statements:
- The minimum consumer grade hardware that would be required to run the model, with any suggested settings for that mimimum (e.g. which quantization etc) and the sort of inferencing rate to be expected.
- The relative strengths of the model, e.g. it is stronger at programming than story telling, how strongly compliant it is/alignment.
In the case of this model, for example, could you run it with 16GB VRAM and 64 GB RAM?
It was too large for me to benchmark so I can't say other than what huggingface leaderboard says, but it did have roleplaying data, so possibly better than most at it.
As for min requirement, 2x3090s or 4090s or an a6000 48gb is required to inference in 4bit