videoprism-base-f16r288-pt

PyTorch port of google/videoprism-base-f16r288 (Google DeepMind's VideoPrism). The original release ships JAX/Flax weights only; this repo hosts a self-contained PyTorch implementation that produces numerically-equivalent outputs (cosine sim 1.000000 vs the JAX reference). Source: https://github.com/Skovorp/torch_videoprism.

Usage

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sposiboh/videoprism-base-f16r288-pt", trust_remote_code=True)

# Process a video file (or a list of frames / numpy / torch tensor):
inputs = processor(videos="path/to/video.mp4", return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state    # shape: (B, T*N, model_dim) — token sequence

Frames are sampled uniformly to the model's native frame count and resized bilinearly to image_size × image_size. Pixels are scaled to [0, 1].

Citation

If you use this model, please cite the original VideoPrism paper:

@inproceedings{zhao2024videoprism,
  title = {VideoPrism: A Foundational Visual Encoder for Video Understanding},
  author = {Zhao, Long and Gundavarapu, Nitesh B. and Yuan, Liangzhe and Zhou, Hao and Yan, Shen and Sun, Jennifer J. and Friedman, Luke and Qian, Rui and Weyand, Tobias and Zhao, Yue and Hornung, Rachel and Schroff, Florian and Yang, Ming-Hsuan and Ross, David A. and Wang, Huisheng and Adam, Hartwig and Sirotenko, Mikhail and Liu, Ting and Gong, Boqing},
  booktitle = {ICML},
  year = {2024},
}

Apache-2.0 (matches upstream license).

Downloads last month
46
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including sposiboh/videoprism-base-f16r288-pt