Abstract:Offline reinforcement learning datasets are often collected by mixtures of different behaviour policies, resulting in complex multimodal action distributions. Diffusion-based policy methods can effectively model such offline action distributions; however, their action generation relies on multi-step reverse diffusion, which results in high inference latency, becoming a significant bottleneck for practical deployment. Additionally, while the generated actions may still fall within the distribution"s support, the inherent uncertainty can lead to overestimation of Q-values, thereby compromising policy stability. To address the challenges, we propose an Efficient Q-Ensemble Diffusion Policy (E2DP). E2DP significantly reduces the computational cost of action generation through a two-step reverse diffusion mechanism. Meanwhile, ensemble Q networks with variance regularization are employed to improve uncertainty estimation under a small ensemble size, and a lower confidence bound constraint is incorporated during policy improvement to balance in-distribution actions and high-risk candidates near the data support. Experimental results on bandit and D4RL benchmark demonstrate that E2DP achieves inference speedups of approximately 2.5× while maintaining distribution modeling capability comparable to existing diffusion-based policies, and obtain improved normalized performance across multiple tasks. The code of E2DP is available on GitHub via https://github.com/smartredcat/E2DP.