Search | VHL Regional Portal

1.

Optimization Landscape of Policy Gradient Methods for Discrete-Time Static Output Feedback.

Duan, Jingliang; Li, Jie; Chen, Xuyang; Zhao, Kai; Li, Shengbo Eben; Zhao, Lin.

IEEE Trans Cybern ; PP2023 Oct 26.

Article in English | MEDLINE | ID: mdl-37883283

ABSTRACT

In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This article analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L -smoothness, and M -Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence toward local minima when initialized near such minima. This article concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.

2.

Policy-Iteration-Based Finite-Horizon Approximate Dynamic Programming for Continuous-Time Nonlinear Optimal Control.

Lin, Ziyu; Duan, Jingliang; Li, Shengbo Eben; Ma, Haitong; Li, Jie; Chen, Jianyu; Cheng, Bo; Ma, Jun.

IEEE Trans Neural Netw Learn Syst ; 34(9): 5255-5267, 2023 Sep.

Article in English | MEDLINE | ID: mdl-37015565

ABSTRACT

The Hamilton-Jacobi-Bellman (HJB) equation serves as the necessary and sufficient condition for the optimal solution to the continuous-time (CT) optimal control problem (OCP). Compared with the infinite-horizon HJB equation, the solving of the finite-horizon (FH) HJB equation has been a long-standing challenge, because the partial time derivative of the value function is involved as an additional unknown term. To address this problem, this study first-time bridges the link between the partial time derivative and the terminal-time utility function, and thus it facilitates the use of the policy iteration (PI) technique to solve the CT FH OCPs. Based on this key finding, the FH approximate dynamic programming (ADP) algorithm is proposed leveraging an actor-critic framework. It is shown that the algorithm exhibits important properties in terms of convergence and optimality. Rather importantly, with the use of multilayer neural networks (NNs) in the actor-critic architecture, the algorithm is suitable for CT FH OCPs toward more general nonlinear and complex systems. Finally, the effectiveness of the proposed algorithm is demonstrated by conducting a series of simulations on both a linear quadratic regulator (LQR) problem and a nonlinear vehicle tracking problem.

3.

Integrated Decision and Control: Toward Interpretable and Computationally Efficient Driving Intelligence.

Guan, Yang; Ren, Yangang; Sun, Qi; Li, Shengbo Eben; Ma, Haitong; Duan, Jingliang; Dai, Yifan; Cheng, Bo.

IEEE Trans Cybern ; 53(2): 859-873, 2023 Feb.

Article in English | MEDLINE | ID: mdl-35439160

ABSTRACT

Decision and control are core functionalities of high-level automated vehicles. Current mainstream methods, such as functional decomposition and end-to-end reinforcement learning (RL), suffer high time complexity or poor interpretability and adaptability on real-world autonomous driving tasks. In this article, we present an interpretable and computationally efficient framework called integrated decision and control (IDC) for automated vehicles, which decomposes the driving task into static path planning and dynamic optimal tracking that are structured hierarchically. First, the static path planning generates several candidate paths only considering static traffic elements. Then, the dynamic optimal tracking is designed to track the optimal path while considering the dynamic obstacles. To that end, we formulate a constrained optimal control problem (OCP) for each candidate path, optimize them separately, and follow the one with the best tracking performance. To unload the heavy online computation, we propose a model-based RL algorithm that can be served as an approximate-constrained OCP solver. Specifically, the OCPs for all paths are considered together to construct a single complete RL problem and then solved offline in the form of value and policy networks for real-time online path selecting and tracking, respectively. We verify our framework in both simulations and the real world. Results show that compared with baseline methods, IDC has an order of magnitude higher online computing efficiency, as well as better driving performance, including traffic efficiency and safety. In addition, it yields great interpretability and adaptability among different driving scenarios and tasks.

4.

Model-Based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian.

Peng, Baiyu; Duan, Jingliang; Chen, Jianyu; Li, Shengbo Eben; Xie, Genjin; Zhang, Congsheng; Guan, Yang; Mu, Yao; Sun, Enxin.

IEEE Trans Neural Netw Learn Syst ; PP2022 May 30.

Article in English | MEDLINE | ID: mdl-35635820

ABSTRACT

Safety is essential for reinforcement learning (RL) applied in the real world. Adding chance constraints (or probabilistic constraints) is a suitable way to enhance RL safety under uncertainty. Existing chance-constrained RL methods, such as the penalty methods and the Lagrangian methods, either exhibit periodic oscillations or learn an overconservative or unsafe policy. In this article, we address these shortcomings by proposing a separated proportional-integral Lagrangian (SPIL) algorithm. We first review the constrained policy optimization process from a feedback control perspective, which regards the penalty weight as the control input and the safe probability as the control output. Based on this, the penalty method is formulated as a proportional controller, and the Lagrangian method is formulated as an integral controller. We then unify them and present a proportional-integral Lagrangian method to get both their merits with an integral separation technique to limit the integral value to a reasonable range. To accelerate training, the gradient of safe probability is computed in a model-based manner. The convergence of the overall algorithm is analyzed. We demonstrate that our method can reduce the oscillations and conservatism of RL policy in a car-following simulation. To prove its practicality, we also apply our method to a real-world mobile robot navigation task, where our robot successfully avoids a moving obstacle with highly uncertain or even aggressive behaviors.

5.

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors.

Duan, Jingliang; Guan, Yang; Li, Shengbo Eben; Ren, Yangang; Sun, Qi; Cheng, Bo.

IEEE Trans Neural Netw Learn Syst ; 33(11): 6584-6598, 2022 Nov.

Article in English | MEDLINE | ID: mdl-34101599

ABSTRACT

In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.

6.

Driver braking behavior analysis to improve autonomous emergency braking systems in typical Chinese vehicle-bicycle conflicts.

Duan, Jingliang; Li, Renjie; Hou, Lian; Wang, Wenjun; Li, Guofa; Li, Shengbo Eben; Cheng, Bo; Gao, Hongbo.

Accid Anal Prev ; 108: 74-82, 2017 Nov.

Article in English | MEDLINE | ID: mdl-28858775

ABSTRACT

Bicycling is one of the fundamental modes of transportation especially in developing countries. Because of the lack of effective protection for bicyclists, vehicle-bicycle (V-B) accident has become a primary contributor to traffic fatalities. Although AEB (Autonomous Emergency Braking) systems have been developed to avoid or mitigate collisions, they need to be further adapted in various conflict situations. This paper analyzes the driver's braking behavior in typical V-B conflicts of China to improve the performance of Bicyclist-AEB systems. Naturalistic driving data were collected, from which the top three scenarios of V-B accidents in China were extracted, including SCR (a bicycle crossing the road from right while a car is driving straight), SCL (a bicycle crossing the road from left while a car is driving straight) and SSR (a bicycle swerving in front of the car from right while a car is driving straight). For safety and data reliability, a driving simulator was employed to reconstruct these three scenarios and some 25 licensed drivers were recruited for braking behavior analysis. Results revealed that driver's braking behavior was significantly influenced by V-B conflict types. Pre-decelerating behaviors were found in SCL and SSR conflicts, whereas in SCR the subjects were less vigilant. The brake reaction time and brake severity in lateral V-B conflicts (SCR and SCL) was shorter and higher than that in longitudinal conflicts (SSR). The findings improve their applications in the Bicyclist-AEB and test protocol enactment to enhance the performance of Bicyclist-AEB systems in mixed traffic situations especially for developing countries.

Subject(s)

Accidents, Traffic/prevention & control , Automobile Driving/psychology , Bicycling , Adult , China , Deceleration , Emergencies , Female , Humans , Male , Middle Aged , Reaction Time , Reproducibility of Results , Young Adult

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL