Safe Online Reinforcement Learning with Diffusion World Model and Langevin Dynamics
Liu, Jingyu ; Wang, Yuanda ; Sun, Changyin ; Duan, Anqing
Liu, Jingyu
Wang, Yuanda
Sun, Changyin
Duan, Anqing
Supervisor
Department
Robotics
Embargo End Date
Type
Journal article
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Safe reinforcement learning (RL) faces a fundamental challenge: how can agents maximize rewards while guaranteeing safety during both training and deployment? Traditional approaches either sacrifice performance for safety or risk constraint violations during learning. We propose Safe Diffusion Langevin Policy (SDLP), a framework that enables agents to ”think before they act” by explicitly modeling long-term consequences. SDLP uses a Diffusion World Model (DWM) to generate diverse possible future trajectories for candidate actions, evaluates these through a principled trajectory evaluation function balancing rewards against safety violations, and iteratively refines actions using Langevin dynamics guided by gradient information. This approach shifts safe RL from reactive constraint handling to proactive safety reasoning. Instead of hoping agents learn to avoid violations through trial and error, SDLP explicitly models future consequences and optimizes actions accordingly. Experimental validation on challenging continuous control and navigation benchmarks confirms this advantage: SDLP achieves superior task performance while maintaining consistently low constraint violations, significantly outperforming Soft Actor-Critic (SAC), SafeLayer, and Control Barrier Functions (CBF) across all tested environments.
Citation
Source
Expert Systems with Applications
Conference
Keywords
46 Information and Computing Sciences, 4602 Artificial Intelligence, 4611 Machine Learning, 7.5, Q1
Subjects
Source
Publisher
Elsevier
