# Gemma3-270m-VLM (Pi0.6) A Vision-Language Model combining: - **Vision Tower**: SigLIP from google/gemma-3-4b-pt (417M params) - **Multi-modal Projector**: Randomly initialized (739K params) - **Language Model**: google/gemma-3-270m (268M params) **Total**: 686M parameters ## Architecture - Vision hidden size: 1152 - LLM hidden size: 640 - Vocab size: 262,208 (includes 64 image tokens) - Image token index: 262,144 ## Usage ### With LLaMAFactory ```bash llamafactory-cli train \ --stage sft \ --model_name_or_path models/gemma3-270m-vlm-with-weights \ --template gemma3 \ --dataset mllm_demo \ --freeze_vision_tower True \ --freeze_multi_modal_projector True \ --bf16 True \ ... ``` ### With Transformers ```python from transformers import AutoModelForImageTextToText, AutoProcessor model = AutoModelForImageTextToText.from_pretrained( "models/gemma3-270m-vlm-with-weights", torch_dtype="bfloat16" ) processor = AutoProcessor.from_pretrained("models/gemma3-270m-vlm-with-weights") ``` ## Training Recommendations 1. **Freeze vision tower and projector initially** to train only the LLM 2. **Use small learning rate** (e.g., 5e-5 or 1e-4) 3. **Gradually unfreeze** projector after LLM converges 4. Vision tower can remain frozen if using pretrained vision encoder ## Notes - Multi-modal projector is randomly initialized and needs training - The model uses Gemma3 tokenizer with 262,144 base tokens + 64 image tokens - Compatible with all Gemma3 features (sliding window attention, etc.)