Instructions to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g")
model = AutoModelForCausalLM.from_pretrained("ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g

SGLang

How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g with Docker Model Runner:
```
docker model run hf.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g
```

stablelm-tuned-alpha-3b-gptq-4bit-128g

This is a quantized model saved with auto-gptq. At time of writing, you cannot directly load models from the hub, but will need to clone this repo and load locally.

git lfs install
git clone https://huggingface.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g

See the below excerpt from the tutorial for instructions.

Auto-GPTQ Quick Start

Quick Installation

Start from v0.0.4, one can install auto-gptq directly from pypi using pip:

pip install auto-gptq

AutoGPTQ supports using triton to speedup inference, but it currently only supports Linux. To integrate triton, using:

pip install auto-gptq[triton]

For some people who want to try the newly supported llama type models in 🤗 Transformers but not update it to the latest version, using:

pip install auto-gptq[llama]

By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.

To disable building CUDA extension, you can use the following commands:

For Linux

BUILD_CUDA_EXT=0 pip install auto-gptq

For Windows

set BUILD_CUDA_EXT=0 && pip install auto-gptq

Basic Usage

The full script of basic usage demonstrated here is examples/quantization/basic_usage.py

The two main classes currently used in AutoGPTQ are AutoGPTQForCausalLM and BaseQuantizeConfig.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

Load quantized model and do inference

Instead of .from_pretrained, you should use .from_quantized to load a quantized model.

device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)

This will first read and load quantize_config.json in opt-125m-4bit-128g directory, then based on the values of bits and group_size in it, load gptq_model-4bit-128g.bin model file into the first GPU.

Then you can initialize 🤗 Transformers' TextGenerationPipeline and do inference.

from transformers import TextGenerationPipeline

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])

Conclusion

Congrats! You learned how to quickly install auto-gptq and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.

Downloads last month: 10