VideoPrism (PyTorch port)
Collection
All 4 publicly-released VideoPrism variants ported to PyTorch. Source: github.com/Skovorp/torch_videoprism • 4 items • Updated
How to use sposiboh/videoprism-base-f16r288-pt with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True, dtype="auto")How to use sposiboh/videoprism-base-f16r288-pt with VideoPrism:
# Install from https://github.com/google-deepmind/videoprism
import jax
from videoprism import models as vp
flax_model = vp.get_model("sposiboh/videoprism-base-f16r288-pt")
loaded_state = vp.load_pretrained_weights("sposiboh/videoprism-base-f16r288-pt")
@jax.jit
def forward_fn(inputs, train=False):
return flax_model.apply(loaded_state, inputs, train=train)PyTorch port of google/videoprism-base-f16r288 (Google DeepMind's VideoPrism). The original release ships JAX/Flax weights only; this repo hosts a self-contained PyTorch implementation that produces numerically-equivalent outputs (cosine sim 1.000000 vs the JAX reference). Source: https://github.com/Skovorp/torch_videoprism.
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True)
# Process a video file (or a list of frames / numpy / torch tensor):
inputs = processor(videos="path/to/video.mp4", return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state # shape: (B, T*N, model_dim) — token sequence
Frames are sampled uniformly to the model's native frame count and resized
bilinearly to image_size × image_size. Pixels are scaled to [0, 1].
If you use this model, please cite the original VideoPrism paper:
@inproceedings{zhao2024videoprism,
title = {VideoPrism: A Foundational Visual Encoder for Video Understanding},
author = {Zhao, Long and Gundavarapu, Nitesh B. and Yuan, Liangzhe and Zhou, Hao and Yan, Shen and Sun, Jennifer J. and Friedman, Luke and Qian, Rui and Weyand, Tobias and Zhao, Yue and Hornung, Rachel and Schroff, Florian and Yang, Ming-Hsuan and Ross, David A. and Wang, Huisheng and Adam, Hartwig and Sirotenko, Mikhail and Liu, Ting and Gong, Boqing},
booktitle = {ICML},
year = {2024},
}
Apache-2.0 (matches upstream license).