Instructions to use ibm-granite/granite-4.0-3b-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ibm-granite/granite-4.0-3b-vision with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ibm-granite/granite-4.0-3b-vision")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ibm-granite/granite-4.0-3b-vision", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ibm-granite/granite-4.0-3b-vision with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ibm-granite/granite-4.0-3b-vision" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-4.0-3b-vision", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ibm-granite/granite-4.0-3b-vision
- SGLang
How to use ibm-granite/granite-4.0-3b-vision with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ibm-granite/granite-4.0-3b-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-4.0-3b-vision", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ibm-granite/granite-4.0-3b-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-4.0-3b-vision", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ibm-granite/granite-4.0-3b-vision with Docker Model Runner:
docker model run hf.co/ibm-granite/granite-4.0-3b-vision
Docker error The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
Hi, i have trouble to run ibm-granite/granite-4.0-3b-vision using docker.
vllm version: 0.19.0
command:
docker run --gpus all -p 9999:8000 --ipc=host vllm-granite-vision --model ibm-granite/granite-4.0-3b-vision --trust-remote-code --max-model-len 16384 --tens
or-parallel-size 2 --host 0.0.0.0 --port 8000 --hf-overrides '{"adapter_path": "ibm-granite/granite-4.0-3b-vision"}' --gpu-memory-utilization 0.7
error:
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] File "/app/granite4_vision.py", line 948, in load_weights
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] self._apply_adapter()
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] File "/app/granite4_vision.py", line 940, in _apply_adapter
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] n = self._merge_lora_deltas(adapter_config, adapter_weights)
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] File "/app/granite4_vision.py", line 919, in _merge_lora_deltas
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] if _add_delta(module_key + ".weight", delta):
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] File "/app/granite4_vision.py", line 906, in _add_delta
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] param.data = (param.data.float() + delta.to(param.device)).to(param.dtype)
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
(Worker_TP1 pid=83) ERROR 04-07 07:26:04 [multiproc_executor.py:857] RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
Hi, there might be an issue with handling tensor parallelism in case of full merge flow. I'm going to check it.
Meanwhile you can try the inference on a single GPU by removing the command arg --tensor-parallel-size 2
Pushed the fixed version, should work now with multiple GPUs