--- language: en license: apache-2.0 tags: - vision-language - multimodal - vlm - nanbeige - siglip - image-text-to-text datasets: - liuhaotian/LLaVA-CC3M-Pretrain-595K - liuhaotian/LLaVA-Instruct-150K base_model: - Nanbeige/Nanbeige4.1-3B - google/siglip-so400m-patch14-384 pipeline_tag: image-text-to-text library_name: transformers --- # Nanbeige4.1-VLM Full vision-language model after Stage 2 instruction fine-tuning on LLaVA-Instruct-150K. LoRA weights have been merged into the base model for easy inference. ## Architecture ``` Image → SigLIP so400m → AvgPool(729→196) → MLP Projector → Nanbeige4.1-3B → Text ``` ## Usage ```python from transformers import AutoModel, AutoTokenizer from PIL import Image model = AutoModel.from_pretrained( "SkyAsl/Nanbeige4.1-VLM", trust_remote_code=True, ) model.to("cuda") tokenizer = AutoTokenizer.from_pretrained( "SkyAsl/Nanbeige4.1-VLM", trust_remote_code=True, ) model.set_tokenizer(tokenizer) image = Image.open("photo.jpg") result = model.describe(image, prompt="What do you see in this image?") print(result) ``` ## Training Details | | Stage 1 | Stage 2 | |---|---|---| | Dataset | LLaVA-CC3M-595K | LLaVA-Instruct-150K | | Trainable | Projector only | Projector + LoRA (r=64) | | LR | 2e-3 | 2e-5 | | Hardware | A100 80GB | A100 80GB | | Duration | ~6 hours | ~5 hours | ## Related Repos - Stage 1 base: [SkyAsl/Nanbeige4.1-VLM-Base](https://huggingface.co/SkyAsl/Nanbeige4.1-VLM-Base) - LoRA only: [SkyAsl/Nanbeige4.1-VLM-LoRA](https://huggingface.co/SkyAsl/Nanbeige4.1-VLM-LoRA)