How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Quick Links

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Official AQLM quantization of meta-llama/Llama-2-13b-hf.

For this quantization, we used 1 codebook of 16 bits.

Selected evaluation results for this and other models:

Model AQLM scheme WikiText 2 PPL Model size, Gb Hub link
Llama-2-7b 1x16 5.92 2.4 Link
Llama-2-7b 2x8 6.69 2.2 Link
Llama-2-7b 8x8 6.61 2.2 Link
Llama-2-13b (THIS) 1x16 5.22 4.1 Link
Llama-2-70b 1x16 3.83 18.8 Link
Llama-2-70b 2x8 4.21 18.2 Link
Mixtral-8x7b 1x16 3.35 12.6 Link
Mixtral-8x7b-Instruct 1x16 - 12.6 Link

To learn more about the inference, as well as the information on how to quantize models yourself, please refer to the official GitHub repo.

Downloads last month
63
Safetensors
Model size
2B params
Tensor type
F16
·
I16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf

Paper for ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf