visual-narrator-llm / create_model_card.py
Ytgetahun's picture
feat: Visual Narrator 3B - Clean repository with professional benchmarks
d6e97b5
Raw
History Blame Contribute Delete
2.46 kB
#!/usr/bin/env python3
"""
Create and push model card to Hugging Face
"""
from huggingface_hub import ModelCard
def create_card():
print("πŸ“ CREATING MODEL CARD")
print("=" * 50)
repo_name = "Ytgetahun/visual-narrator-vlm"
model_card_content = """---
license: apache-2.0
tags:
- vision
- image-captioning
- blip
- adjectives
- descriptive
- visual-narrator
- multimodal
- audio-description
- accessibility
pipeline_tag: image-to-text
---
# 🎭 Visual Narrator VLM
## World's First Adjective-Dominant Visual Language Model
Transform **visual streaming** into **immersive audio theater** through adjective-dominant AI narration.
## πŸ† Performance Highlights
- **Average Adjectives**: 5.40 per description
- **Peak Performance**: 7 adjectives in single captions
- **Consistency**: 100% of captions β‰₯3 adjectives
- **Inference Speed**: ~400ms per image
## πŸš€ Quick Start
```python
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch
from PIL import Image
processor = BlipProcessor.from_pretrained("Ytgetahun/visual-narrator-vlm")
model = BlipForConditionalGeneration.from_pretrained("Ytgetahun/visual-narrator-vlm").to("cuda")
image = Image.open("your_image.jpg")
inputs = processor(images=image, return_tensors="pt").to("cuda")
with torch.amp.autocast("cuda", enabled=True):
outputs = model.generate(**inputs, max_length=60)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Caption: {caption}")
πŸ“Š Benchmark Results
Model Avg Adjectives
Visual Narrator VLM 5.40
Baseline BLIP 0.00
🎨 Quality Examples
"a luminous, vibrant, majestic, expressive, velvety, cinematic action shot"
"a vivid, atmospheric, serene, rugged, tranquil, gleaming indoor space"
"a vivid, atmospheric, serene, rugged, tranquil, textured portrait"
🌍 Applications
Audio description for visually impaired
Enhanced streaming content
Creative storytelling
Accessibility tools
πŸ”§ Technical Details
Base Model: BLIP Vision-Language
Training: 10,000 steps, 50 epochs
Dataset: 3,138 adjective-augmented samples
Precision: FP16 optimized
"From pixels to poetry, creating worlds with words" 🎭
"""
text
try:
card = ModelCard(model_card_content)
card.push_to_hub(repo_name)
print("βœ… Model card created successfully!")
return True
except Exception as e:
print(f"❌ Model card creation failed: {e}")
return False
if name == "main":
create_card()