Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 4 de 4
Filter
Add more filters










Database
Language
Publication year range
1.
IEEE Trans Cybern ; 51(8): 4251-4264, 2021 Aug.
Article in English | MEDLINE | ID: mdl-30908269

ABSTRACT

Since the late 1980s, temporal difference (TD) learning has dominated the research area of policy evaluation algorithms. However, the demand for the avoidance of TD defects, such as low data-efficiency and divergence in off-policy learning, has inspired the studies of a large number of novel TD-based approaches. Gradient-based and least-squares-based algorithms comprise the major part of these new approaches. This paper aims to combine advantages of these two categories to derive an efficient policy evaluation algorithm with O ( n 2 ) per-time-step runtime complexity. The least-squares-based framework is adopted, and the gradient correction is used to improve convergence performance. This paper begins with the revision of a previous O ( n 3 ) batch algorithm, least-squares TD with a gradient correction (LS-TDC) to regularize the parameter vector. Based on the recursive least-squares technique, an O ( n 2 ) counterpart of LS-TDC called RC is proposed. To increase data efficiency, we generalize RC with eligibility traces. An off-policy extension is also proposed based on importance sampling. In addition, the convergence analysis for RC as well as LS-TDC is given. The empirical results in both on-policy and off-policy benchmarks show that RC has a higher estimation accuracy than that of RLSTD and a significantly lower runtime complexity than that of LSTDC.

2.
IEEE Trans Neural Netw Learn Syst ; 32(3): 1217-1227, 2021 Mar.
Article in English | MEDLINE | ID: mdl-32324571

ABSTRACT

Actor-critic (AC) learning control architecture has been regarded as an important framework for reinforcement learning (RL) with continuous states and actions. In order to improve learning efficiency and convergence property, previous works have been mainly devoted to solve regularization and feature learning problem in the policy evaluation. In this article, we propose a novel AC learning control method with regularization and feature selection for policy gradient estimation in the actor network. The main contribution is that l1 -regularization is used on the actor network to achieve the function of feature selection. In each iteration, policy parameters are updated by the regularized dual-averaging (RDA) technique, which solves a minimization problem that involves two terms: one is the running average of the past policy gradients and the other is the l1 -regularization term of policy parameters. Our algorithm can efficiently calculate the solution of the minimization problem, and we call the new adaptation of policy gradient RDA-policy gradient (RDA-PG). The proposed RDA-PG can learn stochastic and deterministic near-optimal policies. The convergence of the proposed algorithm is established based on the theory of two-timescale stochastic approximation. The simulation and experimental results show that RDA-PG performs feature selection successfully in the actor and learns sparse representations of the actor both in stochastic and deterministic cases. RDA-PG performs better than existing AC algorithms on standard RL benchmark problems with irrelevant features or redundant features.

3.
IEEE Trans Neural Netw Learn Syst ; 29(12): 5899-5909, 2018 12.
Article in English | MEDLINE | ID: mdl-29993664

ABSTRACT

Actor-critic based on the policy gradient (PG-based AC) methods have been widely studied to solve learning control problems. In order to increase the data efficiency of learning prediction in the critic of PG-based AC, studies on how to use recursive least-squares temporal difference (RLS-TD) algorithms for policy evaluation have been conducted in recent years. In such contexts, the critic RLS-TD evaluates an unknown mixed policy generated by a series of different actors, but not one fixed policy generated by the current actor. Therefore, this AC framework with RLS-TD critic cannot be proved to converge to the optimal fixed point of learning problem. To address the above problem, this paper proposes a new AC framework named critic-iteration PG (CIPG), which learns the state-value function of current policy in an on-policy way and performs gradient ascent in the direction of improving discounted total reward. During each iteration, CIPG keeps the policy parameters fixed and evaluates the resulting fixed policy by -regularized RLS-TD critic. Our convergence analysis extends previous convergence analysis of PG with function approximation to the case of RLS-TD critic. The simulation results demonstrate that the -regularization term in the critic of CIPG is undamped during the learning process, and CIPG has better learning efficiency and faster convergence rate than conventional AC learning control methods.

4.
IEEE Trans Neural Netw Learn Syst ; 27(4): 771-82, 2016 Apr.
Article in English | MEDLINE | ID: mdl-25955853

ABSTRACT

A least squares temporal difference with gradient correction (LS-TDC) algorithm and its kernel-based version kernel-based LS-TDC (KLS-TDC) are proposed as policy evaluation algorithms for reinforcement learning (RL). LS-TDC is derived from the TDC algorithm. Attributed to TDC derived by minimizing the mean-square projected Bellman error, LS-TDC has better convergence performance. The least squares technique is used to omit the size-step tuning of the original TDC and enhance robustness. For KLS-TDC, since the kernel method is used, feature vectors can be selected automatically. The approximate linear dependence analysis is performed to realize kernel sparsification. In addition, a policy iteration strategy motivated by KLS-TDC is constructed to solve control learning problems. The convergence and parameter sensitivities of both LS-TDC and KLS-TDC are tested through on-policy learning, off-policy learning, and control learning problems. Experimental results, as compared with a series of corresponding RL algorithms, demonstrate that both LS-TDC and KLS-TDC have better approximation and convergence performance, higher efficiency for sample usage, smaller burden of parameter tuning, and less sensitivity to parameters.

SELECTION OF CITATIONS
SEARCH DETAIL
...