How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker
docker model run hf.co/ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf
Quick Links

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Official AQLM quantization of meta-llama/Llama-2-13b-hf.

For this quantization, we used 1 codebook of 16 bits.

Selected evaluation results for this and other models:

Model AQLM scheme WikiText 2 PPL Model size, Gb Hub link
Llama-2-7b 1x16 5.92 2.4 Link
Llama-2-7b 2x8 6.69 2.2 Link
Llama-2-7b 8x8 6.61 2.2 Link
Llama-2-13b (THIS) 1x16 5.22 4.1 Link
Llama-2-70b 1x16 3.83 18.8 Link
Llama-2-70b 2x8 4.21 18.2 Link
Mixtral-8x7b 1x16 3.35 12.6 Link
Mixtral-8x7b-Instruct 1x16 - 12.6 Link

To learn more about the inference, as well as the information on how to quantize models yourself, please refer to the official GitHub repo.

Downloads last month
7
Safetensors
Model size
2B params
Tensor type
F16
·
I16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf

Paper for ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf