Instructions to use Scoolar/Molmo-7B-D-0924-NF4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Scoolar/Molmo-7B-D-0924-NF4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Scoolar/Molmo-7B-D-0924-NF4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Scoolar/Molmo-7B-D-0924-NF4", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Scoolar/Molmo-7B-D-0924-NF4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Scoolar/Molmo-7B-D-0924-NF4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scoolar/Molmo-7B-D-0924-NF4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Scoolar/Molmo-7B-D-0924-NF4
- SGLang
How to use Scoolar/Molmo-7B-D-0924-NF4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Scoolar/Molmo-7B-D-0924-NF4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scoolar/Molmo-7B-D-0924-NF4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Scoolar/Molmo-7B-D-0924-NF4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scoolar/Molmo-7B-D-0924-NF4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Scoolar/Molmo-7B-D-0924-NF4 with Docker Model Runner:
docker model run hf.co/Scoolar/Molmo-7B-D-0924-NF4
Molmo-7B-D-0924 4Bit Quantization
Model size (disk): 30GB original → 6.2GB
VRAM usage: Loaded Model ~7GB, inference up to ~10GB (4K image input)
This quantization uses NF4 quantization while keeping FP16 in key modules to avoid deteriorating performance.
It has a relatively minimal VRAM impact compared to full 4-bit quantization and aims to strike a performance/memory optimum.
The model loads significantly faster than the original, making it suitable for serverless hosting.
It fits into a 12GB GPU for serving and allows for batching on a T4 (16GB).
How to run
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
import torch
# Can also be a local path if you have already cloned the hugging face repo
MODEL_PATH = "Scoolar/Molmo-7B-D-0924-NF4"
# load the processor
processor = AutoProcessor.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto'
)
# load the model
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto',
)
# process the image and text
inputs = processor.process(
images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
text="Describe this image."
)
# move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
# Compute is done in float16, while most weights are NF4
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.float16):
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
# only get generated tokens; decode them to text
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
# print the generated text
print(generated_text)
How was the model converted to NF4?
I decided to write this down since I would have been happy to have something like this, so enjoy :)
To convert the model, you need to load the weights with the desired data types/quantization settings
and save them again. This process will produce SafeTensor files along with some configuration files.
All missing files can be copied from the original model repository—you only need to remove the file path in config.json.
The applied quantization strategy can also be seen in config.json (quantization_config)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Can also be a local path if you have already cloned the hugginface repo
MODEL_PATH = "allenai/Molmo-7B-D-0924"
YOUR_OUTPUT_PATH = "enter_local_model_output_path"
DEFAULT_DTYPE = torch.float16
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=DEFAULT_DTYPE,
llm_int8_skip_modules=[
# Module names can also be relative like "ff_norm" which would apply to all such layers
"model.vision_backbone", "model.transformer.ff_out", "model.transformer.ln_f"
]
)
# load the model
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto',
torch_dtype=DEFAULT_DTYPE,
quantization_config=nf4_config,
)
# Save model
model.save_pretrained(
save_directory=YOUR_OUTPUT_PATH,
safe_serialization=True,
# Set a maximum shard size if you don't like the default
max_shard_size="4GB"
)
Details
Inspired by observations from SeanScripts/Molmo-72B-0924-nf4, I experimented with keeping certain modules in FP16, particularly the vision_backbone. The vision backbone has relatively few parameters but deteriorates significantly in NF4. Additionally, I found that the transformer output layers are crucial, whereas other layer normalization layers within the transformer stack had no significant impact.
Layers can be easily inspected in model.safetensors.index.json or analyzed in more detail in modeling_molmo.py.
- Downloads last month
- 5