videoprism-base-f16r288-pt

PyTorch port of google/videoprism-base-f16r288 (Google DeepMind's VideoPrism). The original release ships JAX/Flax weights only; this repo hosts a self-contained PyTorch implementation that produces numerically-equivalent outputs (cosine sim 1.000000 vs the JAX reference). Source: https://github.com/Skovorp/torch_videoprism.

Usage

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True)

# Process a video file (or a list of frames / numpy / torch tensor):
inputs = processor(videos="path/to/video.mp4", return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state    # shape: (B, T*N, model_dim) — token sequence

Frames are sampled uniformly to the model's native frame count and resized bilinearly to image_size × image_size. Pixels are scaled to [0, 1].

Citation

If you use this model, please cite the original VideoPrism paper:

@inproceedings{zhao2024videoprism,
  title = {VideoPrism: A Foundational Visual Encoder for Video Understanding},
  author = {Zhao, Long and Gundavarapu, Nitesh B. and Yuan, Liangzhe and Zhou, Hao and Yan, Shen and Sun, Jennifer J. and Friedman, Luke and Qian, Rui and Weyand, Tobias and Zhao, Yue and Hornung, Rachel and Schroff, Florian and Yang, Ming-Hsuan and Ross, David A. and Wang, Huisheng and Adam, Hartwig and Sirotenko, Mikhail and Liu, Ting and Gong, Boqing},
  booktitle = {ICML},
  year = {2024},
}

Apache-2.0 (matches upstream license).

Downloads last month: 46

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including sposiboh/videoprism-base-f16r288-pt

VideoPrism (PyTorch port)

Collection

All 4 publicly-released VideoPrism variants ported to PyTorch. Source: github.com/Skovorp/torch_videoprism • 4 items • Updated May 9