π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows Paper • 2605.14678 • Published 7 days ago • 94
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling Paper • 2605.13301 • Published 13 days ago • 156
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them Paper • 2509.21117 • Published Sep 25, 2025 • 30
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration Paper • 2509.14760 • Published Sep 18, 2025 • 53
The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang Paper • 2509.00425 • Published Aug 30, 2025 • 12
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models Paper • 2505.14810 • Published May 20, 2025 • 62
Uni-Encoder: A Fast and Accurate Response Selection Paradigm for Generation-Based Dialogue Systems Paper • 2106.01263 • Published Jun 2, 2021
Non-autoregressive Text Editing with Copy-aware Latent Alignments Paper • 2310.07821 • Published Oct 11, 2023
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models Paper • 2309.01219 • Published Sep 3, 2023 • 2
LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4 Paper • 2305.12147 • Published May 20, 2023 • 2
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 20
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt Paper • 2406.16377 • Published Jun 24, 2024 • 13