Instructions to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g") model = AutoModelForCausalLM.from_pretrained("ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g
- SGLang
How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with Docker Model Runner:
docker model run hf.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g
stablelm-tuned-alpha-3b-gptq-4bit-128g
This is a quantized model saved with auto-gptq. At time of writing, you cannot directly load models from the hub, but will need to clone this repo and load locally.
git lfs install
git clone https://huggingface.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g
See the below excerpt from the tutorial for instructions.
Auto-GPTQ Quick Start
Quick Installation
Start from v0.0.4, one can install auto-gptq directly from pypi using pip:
pip install auto-gptq
AutoGPTQ supports using triton to speedup inference, but it currently only supports Linux. To integrate triton, using:
pip install auto-gptq[triton]
For some people who want to try the newly supported llama type models in 🤗 Transformers but not update it to the latest version, using:
pip install auto-gptq[llama]
By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
To disable building CUDA extension, you can use the following commands:
For Linux
BUILD_CUDA_EXT=0 pip install auto-gptq
For Windows
set BUILD_CUDA_EXT=0 && pip install auto-gptq
Basic Usage
The full script of basic usage demonstrated here is examples/quantization/basic_usage.py
The two main classes currently used in AutoGPTQ are AutoGPTQForCausalLM and BaseQuantizeConfig.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
Load quantized model and do inference
Instead of .from_pretrained, you should use .from_quantized to load a quantized model.
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
This will first read and load quantize_config.json in opt-125m-4bit-128g directory, then based on the values of bits and group_size in it, load gptq_model-4bit-128g.bin model file into the first GPU.
Then you can initialize 🤗 Transformers' TextGenerationPipeline and do inference.
from transformers import TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])
Conclusion
Congrats! You learned how to quickly install auto-gptq and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.
- Downloads last month
- 10