-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 91 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 35
Collections
Discover the best community collections!
Collections including paper arxiv:2511.05491
-
V-Thinker: Interactive Thinking with Images
Paper • 2511.04460 • Published • 98 -
Visual Spatial Tuning
Paper • 2511.05491 • Published • 53 -
BabyVision: Visual Reasoning Beyond Language
Paper • 2601.06521 • Published • 201 -
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Paper • 2601.22975 • Published • 113
-
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
Paper • 2510.15110 • Published • 18 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 129 -
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Paper • 2510.13795 • Published • 60 -
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Paper • 2510.13515 • Published • 12
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 20 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
Visual Spatial Tuning
Paper • 2511.05491 • Published • 53 -
Adam's Law: Textual Frequency Law on Large Language Models
Paper • 2604.02176 • Published • 506 -
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
Paper • 2604.10098 • Published • 82 -
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Paper • 2604.13016 • Published • 110
-
A Survey of Reinforcement Learning for Large Reasoning Models
Paper • 2509.08827 • Published • 193 -
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Paper • 2508.02193 • Published • 139 -
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Paper • 2510.23607 • Published • 181 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 128
-
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation
Paper • 2105.09501 • Published • 1 -
Cross-modal Contrastive Learning for Speech Translation
Paper • 2205.02444 • Published -
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Paper • 2210.03052 • Published -
Diffusion Glancing Transformer for Parallel Sequence to Sequence Learning
Paper • 2212.10240 • Published • 1
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 91 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 35
-
Visual Spatial Tuning
Paper • 2511.05491 • Published • 53 -
Adam's Law: Textual Frequency Law on Large Language Models
Paper • 2604.02176 • Published • 506 -
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
Paper • 2604.10098 • Published • 82 -
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Paper • 2604.13016 • Published • 110
-
V-Thinker: Interactive Thinking with Images
Paper • 2511.04460 • Published • 98 -
Visual Spatial Tuning
Paper • 2511.05491 • Published • 53 -
BabyVision: Visual Reasoning Beyond Language
Paper • 2601.06521 • Published • 201 -
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Paper • 2601.22975 • Published • 113
-
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
Paper • 2510.15110 • Published • 18 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 129 -
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Paper • 2510.13795 • Published • 60 -
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Paper • 2510.13515 • Published • 12
-
A Survey of Reinforcement Learning for Large Reasoning Models
Paper • 2509.08827 • Published • 193 -
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Paper • 2508.02193 • Published • 139 -
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Paper • 2510.23607 • Published • 181 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 128
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 20 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation
Paper • 2105.09501 • Published • 1 -
Cross-modal Contrastive Learning for Speech Translation
Paper • 2205.02444 • Published -
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Paper • 2210.03052 • Published -
Diffusion Glancing Transformer for Parallel Sequence to Sequence Learning
Paper • 2212.10240 • Published • 1