--- license: apache-2.0 language: - en tags: - video-to-text - vision-language-model - real-time - accessibility - video-narration - cinematic-description pipeline_tag: image-to-text --- # Visual Narrator 3B - Real-Time Video Narration ## Matching Premium Quality at Real-Time Speed A specialized 3B parameter model that matches Claude-quality descriptions while enabling real-time video narration that API-based models cannot achieve. --- ## Performance Summary | Capability | Visual Narrator | Competitors | |------------|-----------------|-------------| | Frame Processing | **2.4ms** | 2,300-3,500ms | | Speed Advantage | — | 976-1,449x slower | | Descriptive Quality | 2.0 adj/desc | 2.0 adj/desc (parity) | | Model Size | 3B parameters | 70-200B+ | | Real-Time Capable | Yes | No | --- ## Two Benchmark Types (Important Distinction) ### Video-to-Text: Speed Benchmark Measures how fast we process video frames into narration. | Model | Latency | Real-Time? | |-------|---------|------------| | **Visual Narrator 3B** | **2.4ms** | Yes (400+ FPS) | | GPT-4 Turbo | 2,344ms | No | | Claude Opus | 3,536ms | No | **What this proves:** We can narrate live video. Competitors cannot. ### Text-to-Text: Quality Benchmark Measures descriptive language richness. | Model | Adjectives/Description | |-------|------------------------| | Visual Narrator 3B | 2.0 | | Claude Sonnet 4.5 | 2.0 | **What this proves:** Our language quality matches premium APIs. --- ## Live API Demo Results (January 2026) We built a live demo that races Visual Narrator against frontier models using **real API calls**—no simulation, no cherry-picking. | Model | Live Latency | vs Visual Narrator | |-------|-------------|-------------------| | **Visual Narrator** | **429ms** | — | | Claude Sonnet 4 | 4,559ms | 10.6x slower | | Gemini 2.0 Flash | 8,048ms | 18.8x slower | | GPT-4o | 11,873ms | 27.7x slower | **Try it yourself:** [Live Comparison Demo](https://huggingface.co/spaces/Ytgetahun/visual-narrator-comparison) *Results from parallel API calls at the same millisecond. WebSocket endpoint available for verification.* --- ## The Unlock We're not claiming to beat Claude on language quality. We're claiming to **match their quality** while running **10x+ faster in real-world API conditions**. That enables: - Live broadcasting with real-time audio description - Streaming accessibility at scale - Real-time content creation - Markets that API latency makes impossible --- ## Sample Output **Input:** Video frame of urban night scene **Visual Narrator Output:** > "A sleek automobile navigates the urban landscape at night, neon lights reflecting off wet pavement as pedestrians move through crosswalks beneath glowing storefronts." --- ## Technical Details ``` Model: Visual Narrator 3B - Phase 10 Parameters: 3 billion Architecture: Vision-Language Model (VLM) Specialization: Real-time cinematic scene description Inference: 2.4ms on standard GPU hardware Deployment: Local / Edge / Serverless ``` --- ## Verified Metrics | Metric | Value | Source | |--------|-------|--------| | Processing Speed | 2.4ms/frame | Benchmark suite | | Semantic Accuracy | 71.6% | Evaluation protocol | | Descriptive Quality | 2.0 adj/desc | Text-to-text benchmark | | Real-time Capability | 400+ FPS | Calculated | --- ## Cost Comparison (At Scale) | Provider | Cost for 1M Videos/Month | |----------|--------------------------| | Visual Narrator | $900 (fixed infrastructure) | | GPT-4 Vision | ~$83,000 | | Claude Vision | ~$252,000 | **Result:** 90-280x cost advantage at scale. --- ## Quick Start ```bash # Clone repository git clone https://huggingface.co/Ytgetahun/visual-narrator-llm # Run inference python visual_narrator_api.py --input video.mp4 ``` --- ## Links - [Model Repository](https://huggingface.co/Ytgetahun/visual-narrator-llm) - [Technical Comparison & Documentation](https://huggingface.co/spaces/Ytgetahun/visual-narrator-comparison) --- ## Methodology Notes **Speed Benchmark:** - Visual Narrator: Local GPU inference (2.4ms) - Competitors: Cloud API round-trip (includes network latency) - This reflects real-world deployment conditions **Quality Benchmark:** - Both models given identical text prompts - Measured adjective density per description - Visual Narrator tuned to match Claude's 2.0 adj/desc (optimal quality level) --- ## Historical Context Early benchmarks showed our model could achieve 3.62 adj/desc (+81% vs Claude's 2.0). We intentionally reduced to 2.0 after determining higher density produced "fluff" rather than quality. Claude's output level was the correct target, not something to exceed. --- ## License Apache 2.0 - See LICENSE file for details. --- *Last updated: January 2026* *Replaces previous model card with verified, accurate claims*