arxiv:2602.00986

Sparse Reward Subsystem in Large Language Models

Published on May 11

· Submitted by

Guowei Xu on Feb 3

Stanford AI

Upvote

Authors:

Guowei Xu ,

Abstract

Research reveals that large language model hidden states contain a sparse reward subsystem with value neurons predicting state values and dopamine neurons encoding temporal difference errors, which can be used to guide inference-time search and predict model confidence.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent studies show that LLM hidden states encode reward-related information, such as answer correctness and model confidence. However, existing approaches typically fit black-box probes on the full hidden states, offering little insight into how this information is structured across neurons. In this paper, we show that reward-related information is concentrated in a sparse subset of neurons. Using simple probing, we identify two types of neurons: value neurons, whose activations predict state value, and dopamine neurons, whose activations encode step-level temporal difference (TD) errors. Together, these neurons form a sparse reward subsystem within LLM hidden states. These names are drawn by analogy with neuroscience, where value neurons and dopamine neurons in the biological reward subsystem also encode value and reward prediction errors, respectively. We demonstrate that value neurons are robust and transferable across diverse datasets and models, and provide causal evidence that they encode reward-related information. Finally, we show applications of the reward subsystem: value neurons serve as effective predictors of model confidence, and dopamine neurons can function as a process reward model (PRM) to guide inference-time search.

View arXiv page View PDF Add to collection

Community

Xkev

Paper author Paper submitter Feb 3

In this paper, we identify a sparse reward subsystem within the hidden states of Large Language Models (LLMs), drawing an analogy to the biological reward subsystem in the human brain.