Instructions to use KavinduHansaka/phi4-mini-bnb-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KavinduHansaka/phi4-mini-bnb-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="KavinduHansaka/phi4-mini-bnb-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("KavinduHansaka/phi4-mini-bnb-4bit")
model = AutoModelForMultimodalLM.from_pretrained("KavinduHansaka/phi4-mini-bnb-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use KavinduHansaka/phi4-mini-bnb-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "KavinduHansaka/phi4-mini-bnb-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KavinduHansaka/phi4-mini-bnb-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/KavinduHansaka/phi4-mini-bnb-4bit

SGLang

How to use KavinduHansaka/phi4-mini-bnb-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "KavinduHansaka/phi4-mini-bnb-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KavinduHansaka/phi4-mini-bnb-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "KavinduHansaka/phi4-mini-bnb-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KavinduHansaka/phi4-mini-bnb-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use KavinduHansaka/phi4-mini-bnb-4bit with Docker Model Runner:
```
docker model run hf.co/KavinduHansaka/phi4-mini-bnb-4bit
```

Phi-4-mini-reasoning (BitsAndBytes 4-bit NF4 Quantized)

This repository contains a 4-bit quantized version of microsoft/Phi-4-mini-reasoning, produced with BitsAndBytes via Hugging Face Transformers.
Quantization reduces VRAM usage while preserving most of the model’s reasoning capabilities.

Model Details

Model Description

Developed by (base model): Microsoft
Shared by (quantized version): KavinduHansaka
Model type: Causal Language Model (decoder-only transformer)
Context length: 128K
Language(s): English
License: MIT (inherited from base model)
Finetuned from: microsoft/Phi-4-mini-reasoning

Model Sources

Repository (quantized): KavinduHansaka/phi4-mini-bnb-4bit
Repository (base model): microsoft/Phi-4-mini-reasoning

Uses

Direct Use

Text and reasoning generation
Educational and research experiments
Running inference on lower-VRAM GPUs

Downstream Use

Can be fine-tuned further for domain-specific reasoning tasks
Integrated into chatbots, assistants, and research pipelines

Out-of-Scope Use

Do not use for generating harmful, biased, or unsafe content
Not recommended for high-accuracy production systems without further testing

Bias, Risks, and Limitations

As with the base model, it may produce biased or incorrect content.
Quantization may reduce numerical precision, which can slightly affect reasoning quality.
Long-context reasoning (128k tokens) may still be resource-intensive.

Recommendations

Apply appropriate safety filters before deploying in production.
Be aware that outputs are not guaranteed to be factually correct.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "KavinduHansaka/phi4-mini-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    quantization_config=bnb_config
)

inputs = tokenizer("Explain why the sky is blue in simple terms.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

This model inherits training data from microsoft/Phi-4-mini-reasoning. No additional fine-tuning was done.

Quantization method: BitsAndBytes 4-bit (NF4, double quantization)
Precision: bfloat16 compute
Original precision: fp16

Technical Specifications

Architecture: Decoder-only transformer
Parameters: Same as Phi-4-mini-reasoning
Quantization: 4-bit NF4

Citation

If you use this quantized model, please also cite the original Microsoft release:

@misc{microsoft2025phi4mini,
  title={Phi-4-mini-reasoning},
  author={Microsoft},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/microsoft/Phi-4-mini-reasoning}}
}

Model Card Authors

Quantized version shared by KavinduHansaka
Base model by Microsoft

Model Card Contact

For issues/questions with this quantized release: open a discussion on KavinduHansaka/phi4-mini-bnb-4bit.
For base model details: see microsoft/Phi-4-mini-reasoning.

Downloads last month: 5

Safetensors

Model size

4B params

Tensor type

F32

BF16

Model tree for KavinduHansaka/phi4-mini-bnb-4bit

Base model

microsoft/Phi-4-mini-reasoning

Quantized

(36)

this model