Abstract:Offline reinforcement learning (ORL) leverages fixed datasets for dynamic decision-making, but is often prone to extrapolation errors due to out-of-distribution actions. Existing approaches typically mitigate this issue by constraining the policy distribution or applying conservative Q-value estimation, but the induced pessimism often leads to suboptimal policies. To address this issue, this paper introduces a Q-value correction (QVC) Bellman operator from the perspective of improving the accuracy of value function estimation. The QVC Bellman operator uses the difference between the learned and behavior Q-functions as a directional signal, applying bounded adjustments to the Q-value update target. Building on this, we combine the QVC Bellman operator with the in-distribution Bellman operator to form a balanced Bellman operator, enabling more effective utilization of both in-distribution and out-of-distribution data. Theoretical analysis confirms that the Q-function derived from iterative application of the balanced Bellman operator is convergent, and its deviation from the true Q-function is bounded. Furthermore, we integrate the balanced Bellman operator into implicit Q-learning and incorporate an adaptive correction mechanism in V-function update to jointly address Q-value overestimation and underestimation, thus propose a novel ORL method based on dual value correction (ORL-DVC).Experimental results on the D4RL benchmark, including Gym-Mujoco locomotion and AntMaze navigation tasks, demonstrate that ORL-DVC achieves an average normalized score of 80.9 and 62.7, respectively, outperforming existing state-of-the-art ORL methods with superior generalization capability.