How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fbaldassarri/meta-llama_Llama-3.1-8B-Instruct-auto_awq-int4-gs128-sym"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fbaldassarri/meta-llama_Llama-3.1-8B-Instruct-auto_awq-int4-gs128-sym",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/fbaldassarri/meta-llama_Llama-3.1-8B-Instruct-auto_awq-int4-gs128-sym
Quick Links

Model Information

Quantized version of meta-llama/Llama-3.1-8B-Instruct using torch.float32 for quantization tuning.

  • 4 bits (INT4)
  • group size = 128
  • Symmetrical Quantization
  • Method AutoAWQ

Quantization framework: Intel AutoRound

Note: this INT4 version of Llama-3.1-8B-Instruct has been quantized to run inference through CPU.

Replication Recipe

Step 1 Install Requirements

I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.

python -m pip install <package> --upgrade
  • accelerate==1.0.1
  • auto_gptq==0.7.1
  • neural_compressor==3.1
  • torch==2.3.0+cpu
  • torchaudio==2.5.0+cpu
  • torchvision==0.18.0+cpu
  • transformers==4.45.2

Step 2 Build Intel Autoround wheel from sources

python -m pip install git+https://github.com/intel/auto-round.git

Step 3 Script for Quantization

  from transformers import AutoModelForCausalLM, AutoTokenizer
  model_name = "meta-llama/Llama-3.1-8B-Instruct"
  model = AutoModelForCausalLM.from_pretrained(model_name)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  from auto_round import AutoRound
  bits, group_size, sym, device, amp = 4, 128, True, 'cpu', False
  autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym, device=device, amp=amp)
  autoround.quantize()
  output_dir = "./AutoRound/meta-llama_Llama-3.1-8B-Instruct-auto_awq-int4-gs128-sym"
  autoround.save_quantized(output_dir, format='auto_awq', inplace=True)

License

Llama 3.1 Community License

Disclaimer

This quantized model comes with no warrenty. It has been developed only for research purposes.

Downloads last month
10
Safetensors
Model size
8B params
Tensor type
F32
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fbaldassarri/meta-llama_Llama-3.1-8B-Instruct-auto_awq-int4-gs128-sym

Quantized
(862)
this model

Collection including fbaldassarri/meta-llama_Llama-3.1-8B-Instruct-auto_awq-int4-gs128-sym