Instructions to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")
model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5

SGLang

How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with Docker Model Runner:
```
docker model run hf.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
```

How to reduce batch size in order to solve CUDA out of memory error?

#19

by samyar03 - opened May 30, 2023

Discussion

samyar03

May 30, 2023

Hello. I'm running this model on a cloud GPU on Google Cloud. I'm currently using a NVIDIA T4 GPU. I thought I had enough memory in the GPU to run this model (16 GB), but whenever I run the server.py script to run the text-generation-webui, I get this message "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 14.62 GiB total capacity; 13.85 GiB already allocated; 169.38 MiB free; 13.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF". I assume I don't need to free much memory, so maybe reducing the batch size could work. Does anyone know how I can do this?

AlienHD

May 30, 2023

Yes, changing the Batch size could help to reduce Vram usage. You can try reducing the batch size by locating the line of code in your script that sets the batch size and decreasing its value. If you’re not sure where to find this line of code, you can try searching for “batch_size” in your script. I do not know the structure of the code that you are using, so I can't give you any precise Instructions on changing that parameter.

samyar03

May 30, 2023

Yes, changing the Batch size could help to reduce Vram usage. You can try reducing the batch size by locating the line of code in your script that sets the batch size and decreasing its value. If you’re not sure where to find this line of code, you can try searching for “batch_size” in your script. I do not know the structure of the code that you are using, so I can't give you any precise Instructions on changing that parameter.

Would I find the script for changing batch size in the model itself, or is it just the server.py script

AlienHD

May 30, 2023

You would find this Parameter if it exists in the script you use to run the model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment