Instructions to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5") model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
- SGLang
How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 with Docker Model Runner:
docker model run hf.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
Reproducing the fine tuning gets stuck with 100% CPU on one process
Hi, I'm trying to reproduce your results, but at the early stages there seems to be a stuck process.
echo '
{
"fp16": {
"enabled": true,
β― (identical to yours)
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
' > ./ds_config.json
deepspeed \
./trainer_sft.py \
--configs defaults reference-data reference-pythia-12b \
--cache_dir /root/.cache/huggingface \
--output_dir .saved/oasst-sft-3-pythia-12b-reference_2kpre \
--num_train_epochs 8 \
--use_flash_attention false \
--verbose true \
--logging_steps 1 \
--dtype fp16 \
--residual_dropout 0.2 \
--model_name andreaskoepf/pythia-12b-pre-2000
So I get the following logs (abbreviated):
Evaluation set sizes:
oasst_export: 2026 (16.55%)
alpaca: 10212 (83.45%)
Total eval: 12238
--------------------------------------------------------------------------------
β―
Number of trainable parameters: 11841M
Loading checkpoint shards: 100%|ββββββββββ| 3/3 [00:17<00:00, 5.83s/it]
Resizing embeddings to 50288
β―
warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
I get a burst of GPU activity some 3 minutes after starting the process. It lasts for about 10 seconds, then it halts completely and I get stuck with a single process using 100% of a CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
βββββββ ββββ ββ β βββββ βββββ ββββββ β ββββ βββ βββββββββ /usr/local/bin/python3 -u ./trainer_sft.py --local_rank=6
Do you have any idea what might that be?
Let me know if more logs/info would help. I'm using 8 GPUs which should fit this model comfortably in memory.