How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "huggingface/falcon-40b-gptq"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "huggingface/falcon-40b-gptq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker
docker model run hf.co/huggingface/falcon-40b-gptq
Quick Links

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Strict copy of https://huggingface.co/tiiuae/falcon-40b but quantized with GPTQ (on wikitext-2, 4bits, groupsize=128).

Intended to be used with https://github.com/huggingface/text-generation-inference

model=huggingface/falcon-40b-gptq
num_shard=2
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id $model --num-shard $num_shard --quantize gptq

For full configuration and usage outside docker, please refer to https://github.com/huggingface/text-generation-inference

Downloads last month
15
Safetensors
Model size
7B params
Tensor type
I64
I32
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Spaces using huggingface/falcon-40b-gptq 2