Text Generation
Transformers
Safetensors
English
phi3
bitsandbytes
quantized
4bit
nf4
reasoning
phi
phi4
conversational
4-bit precision
Instructions to use KavinduHansaka/phi4-mini-bnb-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use KavinduHansaka/phi4-mini-bnb-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="KavinduHansaka/phi4-mini-bnb-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("KavinduHansaka/phi4-mini-bnb-4bit") model = AutoModelForMultimodalLM.from_pretrained("KavinduHansaka/phi4-mini-bnb-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use KavinduHansaka/phi4-mini-bnb-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "KavinduHansaka/phi4-mini-bnb-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KavinduHansaka/phi4-mini-bnb-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/KavinduHansaka/phi4-mini-bnb-4bit
- SGLang
How to use KavinduHansaka/phi4-mini-bnb-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "KavinduHansaka/phi4-mini-bnb-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KavinduHansaka/phi4-mini-bnb-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "KavinduHansaka/phi4-mini-bnb-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KavinduHansaka/phi4-mini-bnb-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use KavinduHansaka/phi4-mini-bnb-4bit with Docker Model Runner:
docker model run hf.co/KavinduHansaka/phi4-mini-bnb-4bit
Phi-4-mini-reasoning (BitsAndBytes 4-bit NF4 Quantized)
This repository contains a 4-bit quantized version of microsoft/Phi-4-mini-reasoning, produced with BitsAndBytes via Hugging Face Transformers.
Quantization reduces VRAM usage while preserving most of the model鈥檚 reasoning capabilities.
Model Details
Model Description
- Developed by (base model): Microsoft
- Shared by (quantized version): KavinduHansaka
- Model type: Causal Language Model (decoder-only transformer)
- Context length: 128K
- Language(s): English
- License: MIT (inherited from base model)
- Finetuned from: microsoft/Phi-4-mini-reasoning
Model Sources
- Repository (quantized): KavinduHansaka/phi4-mini-bnb-4bit
- Repository (base model): microsoft/Phi-4-mini-reasoning
Uses
Direct Use
- Text and reasoning generation
- Educational and research experiments
- Running inference on lower-VRAM GPUs
Downstream Use
- Can be fine-tuned further for domain-specific reasoning tasks
- Integrated into chatbots, assistants, and research pipelines
Out-of-Scope Use
- Do not use for generating harmful, biased, or unsafe content
- Not recommended for high-accuracy production systems without further testing
Bias, Risks, and Limitations
- As with the base model, it may produce biased or incorrect content.
- Quantization may reduce numerical precision, which can slightly affect reasoning quality.
- Long-context reasoning (128k tokens) may still be resource-intensive.
Recommendations
- Apply appropriate safety filters before deploying in production.
- Be aware that outputs are not guaranteed to be factually correct.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_id = "KavinduHansaka/phi4-mini-bnb-4bit"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16,
quantization_config=bnb_config
)
inputs = tokenizer("Explain why the sky is blue in simple terms.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
This model inherits training data from microsoft/Phi-4-mini-reasoning. No additional fine-tuning was done.
- Quantization method: BitsAndBytes 4-bit (NF4, double quantization)
- Precision: bfloat16 compute
- Original precision: fp16
Technical Specifications
- Architecture: Decoder-only transformer
- Parameters: Same as Phi-4-mini-reasoning
- Quantization: 4-bit NF4
Citation
If you use this quantized model, please also cite the original Microsoft release:
@misc{microsoft2025phi4mini,
title={Phi-4-mini-reasoning},
author={Microsoft},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/microsoft/Phi-4-mini-reasoning}}
}
Model Card Authors
- Quantized version shared by KavinduHansaka
- Base model by Microsoft
Model Card Contact
- For issues/questions with this quantized release: open a discussion on KavinduHansaka/phi4-mini-bnb-4bit.
- For base model details: see microsoft/Phi-4-mini-reasoning.
- Downloads last month
- 5
Model tree for KavinduHansaka/phi4-mini-bnb-4bit
Base model
microsoft/Phi-4-mini-reasoning