---
language: en
license: apache-2.0
tags:
- vision-language
- multimodal
- vlm
- nanbeige
- siglip
- image-text-to-text
datasets:
- liuhaotian/LLaVA-CC3M-Pretrain-595K
- liuhaotian/LLaVA-Instruct-150K
base_model:
- Nanbeige/Nanbeige4.1-3B
- google/siglip-so400m-patch14-384
pipeline_tag: image-text-to-text
library_name: transformers
---

# Nanbeige4.1-VLM

Full vision-language model after Stage 2 instruction fine-tuning on LLaVA-Instruct-150K.
LoRA weights have been merged into the base model for easy inference.

## Architecture

```
Image → SigLIP so400m → AvgPool(729→196) → MLP Projector → Nanbeige4.1-3B → Text
```

## Usage

```python
from transformers import AutoModel, AutoTokenizer
from PIL import Image

model = AutoModel.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM",
    trust_remote_code=True,
)
model.to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM",
    trust_remote_code=True,
)
model.set_tokenizer(tokenizer)

image  = Image.open("photo.jpg")
result = model.describe(image, prompt="What do you see in this image?")
print(result)
```

## Training Details

| | Stage 1 | Stage 2 |
|---|---|---|
| Dataset | LLaVA-CC3M-595K | LLaVA-Instruct-150K |
| Trainable | Projector only | Projector + LoRA (r=64) |
| LR | 2e-3 | 2e-5 |
| Hardware | A100 80GB | A100 80GB |
| Duration | ~6 hours | ~5 hours |

## Related Repos
- Stage 1 base: [SkyAsl/Nanbeige4.1-VLM-Base](https://huggingface.co/SkyAsl/Nanbeige4.1-VLM-Base)
- LoRA only: [SkyAsl/Nanbeige4.1-VLM-LoRA](https://huggingface.co/SkyAsl/Nanbeige4.1-VLM-LoRA)