Finite-Time Error Bounds for Greedy-GQ

Kavli Affiliate: Yi Zhou

| First 5 Authors: Yue Wang, Yi Zhou, Shaofeng Zou, ,

| Summary:

Greedy-GQ with linear function approximation, originally proposed in
cite{maei2010toward}, is a value-based off-policy algorithm for optimal
control in reinforcement learning, and it has a non-linear two timescale
structure with a non-convex objective function. This paper develops its
finite-time error bounds. We show that the Greedy-GQ algorithm converges as
fast as $mathcal{O}({1}/{sqrt{T}})$ under the i.i.d. setting and
$mathcal{O}({log T}/{sqrt{T}})$ under the Markovian setting. We further
design a variant of the vanilla Greedy-GQ algorithm using the nested-loop
approach, and show that its sample complexity is
$mathcal{O}({log(1/epsilon)epsilon^{-2}})$, which matches with the one of
the vanilla Greedy-GQ. Our finite-time error bounds match with one of the
stochastic gradient descent algorithms for general smooth non-convex
optimization problems. Our finite-sample analysis provides theoretical guidance
on choosing step-sizes for faster convergence in practice and suggests the
trade-off between the convergence rate and the quality of the obtained policy.
Our techniques in this paper provide a general approach for finite-sample
analysis of non-convex two timescale value-based reinforcement learning
algorithms.

| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=10

Read More