-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
Collections
Discover the best community collections!
Collections including paper arxiv:2605.00809
-
VOID: Video Object and Interaction Deletion
Paper • 2604.02296 • Published • 55 -
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
Paper • 2604.18486 • Published • 95 -
WildDet3D: Scaling Promptable 3D Detection in the Wild
Paper • 2604.08626 • Published • 246 -
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
Paper • 2604.19734 • Published • 32
-
Star Attention: Efficient LLM Inference over Long Sequences
Paper • 2411.17116 • Published • 53 -
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
Paper • 2603.23516 • Published • 50 -
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Paper • 2604.04921 • Published • 114 -
Let ViT Speak: Generative Language-Image Pre-training
Paper • 2605.00809 • Published • 32
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Paper • 2603.25746 • Published • 155 -
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Paper • 2603.27027 • Published • 144 -
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Paper • 2603.25716 • Published • 156 -
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Paper • 2603.27538 • Published • 147
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
VOID: Video Object and Interaction Deletion
Paper • 2604.02296 • Published • 55 -
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
Paper • 2604.18486 • Published • 95 -
WildDet3D: Scaling Promptable 3D Detection in the Wild
Paper • 2604.08626 • Published • 246 -
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
Paper • 2604.19734 • Published • 32
-
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Paper • 2603.25746 • Published • 155 -
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Paper • 2603.27027 • Published • 144 -
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Paper • 2603.25716 • Published • 156 -
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Paper • 2603.27538 • Published • 147
-
Star Attention: Efficient LLM Inference over Long Sequences
Paper • 2411.17116 • Published • 53 -
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
Paper • 2603.23516 • Published • 50 -
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Paper • 2604.04921 • Published • 114 -
Let ViT Speak: Generative Language-Image Pre-training
Paper • 2605.00809 • Published • 32