---
license: apache-2.0
language:
- en
tags:
- video-to-text
- vision-language-model
- real-time
- accessibility
- video-narration
- cinematic-description
pipeline_tag: image-to-text
---

# Visual Narrator 3B - Real-Time Video Narration

## Matching Premium Quality at Real-Time Speed

A specialized 3B parameter model that matches Claude-quality descriptions while enabling real-time video narration that API-based models cannot achieve.

---

## Performance Summary

| Capability | Visual Narrator | Competitors |
|------------|-----------------|-------------|
| Frame Processing | **2.4ms** | 2,300-3,500ms |
| Speed Advantage | — | 976-1,449x slower |
| Descriptive Quality | 2.0 adj/desc | 2.0 adj/desc (parity) |
| Model Size | 3B parameters | 70-200B+ |
| Real-Time Capable | Yes | No |

---

## Two Benchmark Types (Important Distinction)

### Video-to-Text: Speed Benchmark

Measures how fast we process video frames into narration.

| Model | Latency | Real-Time? |
|-------|---------|------------|
| **Visual Narrator 3B** | **2.4ms** | Yes (400+ FPS) |
| GPT-4 Turbo | 2,344ms | No |
| Claude Opus | 3,536ms | No |

**What this proves:** We can narrate live video. Competitors cannot.

### Text-to-Text: Quality Benchmark

Measures descriptive language richness.

| Model | Adjectives/Description |
|-------|------------------------|
| Visual Narrator 3B | 2.0 |
| Claude Sonnet 4.5 | 2.0 |

**What this proves:** Our language quality matches premium APIs.

---

## Live API Demo Results (January 2026)

We built a live demo that races Visual Narrator against frontier models using **real API calls**—no simulation, no cherry-picking.

| Model | Live Latency | vs Visual Narrator |
|-------|-------------|-------------------|
| **Visual Narrator** | **429ms** | — |
| Claude Sonnet 4 | 4,559ms | 10.6x slower |
| Gemini 2.0 Flash | 8,048ms | 18.8x slower |
| GPT-4o | 11,873ms | 27.7x slower |

**Try it yourself:** [Live Comparison Demo](https://huggingface.co/spaces/Ytgetahun/visual-narrator-comparison)

*Results from parallel API calls at the same millisecond. WebSocket endpoint available for verification.*

---

## The Unlock

We're not claiming to beat Claude on language quality.
We're claiming to **match their quality** while running **10x+ faster in real-world API conditions**.

That enables:
- Live broadcasting with real-time audio description
- Streaming accessibility at scale
- Real-time content creation
- Markets that API latency makes impossible

---

## Sample Output

**Input:** Video frame of urban night scene

**Visual Narrator Output:**
> "A sleek automobile navigates the urban landscape at night, neon lights reflecting off wet pavement as pedestrians move through crosswalks beneath glowing storefronts."

---

## Technical Details

```
Model: Visual Narrator 3B - Phase 10
Parameters: 3 billion
Architecture: Vision-Language Model (VLM)
Specialization: Real-time cinematic scene description
Inference: 2.4ms on standard GPU hardware
Deployment: Local / Edge / Serverless
```

---

## Verified Metrics

| Metric | Value | Source |
|--------|-------|--------|
| Processing Speed | 2.4ms/frame | Benchmark suite |
| Semantic Accuracy | 71.6% | Evaluation protocol |
| Descriptive Quality | 2.0 adj/desc | Text-to-text benchmark |
| Real-time Capability | 400+ FPS | Calculated |

---

## Cost Comparison (At Scale)

| Provider | Cost for 1M Videos/Month |
|----------|--------------------------|
| Visual Narrator | $900 (fixed infrastructure) |
| GPT-4 Vision | ~$83,000 |
| Claude Vision | ~$252,000 |

**Result:** 90-280x cost advantage at scale.

---

## Quick Start

```bash
# Clone repository
git clone https://huggingface.co/Ytgetahun/visual-narrator-llm

# Run inference
python visual_narrator_api.py --input video.mp4
```

---

## Links

- [Model Repository](https://huggingface.co/Ytgetahun/visual-narrator-llm)
- [Technical Comparison & Documentation](https://huggingface.co/spaces/Ytgetahun/visual-narrator-comparison)

---

## Methodology Notes

**Speed Benchmark:**
- Visual Narrator: Local GPU inference (2.4ms)
- Competitors: Cloud API round-trip (includes network latency)
- This reflects real-world deployment conditions

**Quality Benchmark:**
- Both models given identical text prompts
- Measured adjective density per description
- Visual Narrator tuned to match Claude's 2.0 adj/desc (optimal quality level)

---

## Historical Context

Early benchmarks showed our model could achieve 3.62 adj/desc (+81% vs Claude's 2.0).
We intentionally reduced to 2.0 after determining higher density produced "fluff" rather than quality.
Claude's output level was the correct target, not something to exceed.

---

## License

Apache 2.0 - See LICENSE file for details.

---

*Last updated: January 2026*
*Replaces previous model card with verified, accurate claims*