How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "STiFLeR7/Phi2-GPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "STiFLeR7/Phi2-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker
docker model run hf.co/STiFLeR7/Phi2-GPTQ
Quick Links

🧠 Phi-2 GPTQ (Quantized)

This repository provides a 4-bit GPTQ quantized version of the Phi-2 model by Microsoft, optimized for efficient inference using gptqmodel.

πŸ“Œ Model Details

  • Base Model: Microsoft Phi-2
  • Quantization: GPTQ (4-bit)
  • Quantizer: GPTQModel
  • Framework: PyTorch + HuggingFace Transformers
  • Device Support: CUDA (GPU)
  • License: Apache 2.0

πŸš€ Features

  • βœ… Lightweight: 4-bit quantization significantly reduces memory usage
  • βœ… Fast Inference: Ideal for deployment on consumer GPUs
  • βœ… Compatible: Works with transformers, optimum, and gptqmodel
  • βœ… CUDA-accelerated: Automatically uses GPU for speed

πŸ“š Usage

This model is ready-to-use with the Hugging Face transformers library.

πŸ§ͺ Intended Use

  • Research and development
  • Prototyping generative applications
  • Fast inference environments with limited GPU memory

πŸ“– References

βš–οΈ License

This model is distributed under the Apache License 2.0.

Downloads last month
6
Safetensors
Model size
3B params
Tensor type
I32
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for STiFLeR7/Phi2-GPTQ

Base model

microsoft/phi-2
Quantized
(58)
this model