Instructions to use abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq")
model = AutoModelForMultimodalLM.from_pretrained("abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq

SGLang

How to use abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq with Docker Model Runner:
```
docker model run hf.co/abhinavkulkarni/VMware-open-llama-7b-open-instruct-w4-g128-awq
```

Abhinav Kulkarni commited on Jul 14, 2023

Commit

e3673d3

1 Parent(s): b3f236e

Updated README

Browse files

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ Please refer to the AWQ quantization license ([link](https://github.com/llm-awq/
 ## CUDA Version
-This model was successfully tested on CUDA driver v530.30.02 and runtime v11.7 with Python v3.10.11. Please note that AWQ requires NVIDIA GPUs with compute capability of 80 or higher.
 For Docker users, the `nvcr.io/nvidia/pytorch:23.06-py3` image is runtime v12.1 but otherwise the same as the configuration above and has also been verified to work.
@@ -85,7 +85,7 @@ output = model.generate(
     repetition_penalty=1.1,
     eos_token_id=tokenizer.eos_token_id
 )
-print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 ## Evaluation

 ## CUDA Version
+This model was successfully tested on CUDA driver v530.30.02 and runtime v11.7 with Python v3.10.11. Please note that AWQ requires NVIDIA GPUs with compute capability of `8.0` or higher.
 For Docker users, the `nvcr.io/nvidia/pytorch:23.06-py3` image is runtime v12.1 but otherwise the same as the configuration above and has also been verified to work.
     repetition_penalty=1.1,
     eos_token_id=tokenizer.eos_token_id
 )
+# print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 ## Evaluation