Instructions to use CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell")
model = AutoModelForMultimodalLM.from_pretrained("CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell

SGLang

How to use CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell with Docker Model Runner:
```
docker model run hf.co/CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell
```

MedGemma 1.5 4B IT — NF4 Quantized for NVIDIA Blackwell

NF4 (4-bit NormalFloat) quantization of google/medgemma-1.5-4b-it optimized for NVIDIA Blackwell GPUs.

Why NF4 on Blackwell?

Generation speed. NF4 on Blackwell's GB10 generates at 39.8 tokens/sec — nearly 2x faster than bf16 on the same hardware, and 3.2x faster than Q4_K_M GGUF on a 4-core CPU.

The speed advantage comes from Blackwell's native low-precision compute units combined with bitsandbytes' NF4 kernel fusion. Quantization reduces memory bandwidth pressure, which is the primary bottleneck during autoregressive decoding.

Benchmark

All configurations tested on the same 5 medical prompts, generating 200 tokens each at temperature 0.3.

Configuration	Hardware	Generation Speed	VRAM / RAM
NF4 (this model)	DGX Spark GB10	39.3 tok/s	3.5 GB
bf16 (full precision)	DGX Spark GB10	20.5 tok/s	8.6 GB
Q4_K_M GGUF	Azure 4-core EPYC	12.3 tok/s	~4 GB

NF4 vs bf16 (same GPU): 1.94x faster generation
NF4 vs GGUF (GPU vs CPU): 3.24x faster generation
All three produce identical medical response quality — no degradation observed

Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

tokenizer = AutoTokenizer.from_pretrained("CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell")
model = AutoModelForCausalLM.from_pretrained(
    "CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell",
    device_map="auto"
)

messages = [{"role": "user", "content": "What are the symptoms of malaria?"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=200, temperature=0.3, do_sample=True)

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Example Output

Prompt: "What are the symptoms of malaria?"

The symptoms of malaria can vary depending on the type of malaria parasite, the severity of the infection, and the individual's immune status. Common symptoms include:

Early Symptoms: Fever, chills, headache, muscle aches, fatigue, nausea, vomiting...

Response quality is indistinguishable from the full-precision bf16 model.

Methodology

Warmup: 1 short generation discarded before timing
Prompts: 3–5 medical questions (malaria, diarrhea treatment, diabetes, preeclampsia, ORT)
Generation config: max_new_tokens=200, temperature=0.3, do_sample=True
Timing: torch.cuda.synchronize() before and after generation; wall-clock for CPU
Throughput: tokens generated ÷ wall-clock seconds

Hardware Details

DGX Spark (GPU benchmarks)

NVIDIA GB10, compute capability 12.1
128 GB unified memory
CUDA 13.0, PyTorch 2.11, Transformers 5.6

Azure D4as_v5 (CPU baseline)

AMD EPYC 7763, 4 vCPUs
16 GB RAM
llama.cpp (Q4_K_M GGUF) via llama-server API

About

Built by Crane AI Labs.

Downloads last month: 42

Safetensors

Model size

4B params

Tensor type

F32

BF16

Model tree for CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell

Base model

google/medgemma-1.5-4b-it

Quantized

(36)

this model