Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
- Website
- Community
- Solutions
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2605.00809

Vision Language Action models

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2, 2025 • 39
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22, 2025 • 35
MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published Aug 11, 2025 • 45
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Paper • 2508.20072 • Published Aug 27, 2025 • 32

Interesting work but not directly related

VOID: Video Object and Interaction Deletion

Paper • 2604.02296 • Published Apr 2 • 55
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Paper • 2604.18486 • Published Apr 20 • 95
WildDet3D: Scaling Promptable 3D Detection in the Wild

Paper • 2604.08626 • Published Apr 9 • 246
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Paper • 2604.19734 • Published Apr 21 • 32

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 53
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Paper • 2603.23516 • Published Mar 6 • 50
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Paper • 2604.04921 • Published Apr 6 • 114
Let ViT Speak: Generative Language-Image Pre-training

Paper • 2605.00809 • Published 24 days ago • 32

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 30
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

about 7 hours ago

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Paper • 2603.25746 • Published Mar 26 • 155
TAPS: Task Aware Proposal Distributions for Speculative Sampling

Paper • 2603.27027 • Published Mar 27 • 144
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Paper • 2603.25716 • Published Mar 26 • 156
LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Paper • 2603.27538 • Published Mar 29 • 147

Vision Language Action models

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2, 2025 • 39
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22, 2025 • 35
MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published Aug 11, 2025 • 45
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Paper • 2508.20072 • Published Aug 27, 2025 • 32

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 30
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Interesting work but not directly related

VOID: Video Object and Interaction Deletion

Paper • 2604.02296 • Published Apr 2 • 55
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Paper • 2604.18486 • Published Apr 20 • 95
WildDet3D: Scaling Promptable 3D Detection in the Wild

Paper • 2604.08626 • Published Apr 9 • 246
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Paper • 2604.19734 • Published Apr 21 • 32

about 7 hours ago

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Paper • 2603.25746 • Published Mar 26 • 155
TAPS: Task Aware Proposal Distributions for Speculative Sampling

Paper • 2603.27027 • Published Mar 27 • 144
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Paper • 2603.25716 • Published Mar 26 • 156
LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Paper • 2603.27538 • Published Mar 29 • 147

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 53
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Paper • 2603.23516 • Published Mar 6 • 50
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Paper • 2604.04921 • Published Apr 6 • 114
Let ViT Speak: Generative Language-Image Pre-training

Paper • 2605.00809 • Published 24 days ago • 32

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs