Finite-Time Error Bounds for Greedy-GQ – Kavli Institute Pre-Print Publications

Kavli Affiliate: Yi Zhou

| First 5 Authors: Yue Wang, Yi Zhou, Shaofeng Zou, ,

| Summary:

Greedy-GQ with linear function approximation, originally proposed in
cite{maei2010toward}, is a value-based off-policy algorithm for optimal
control in reinforcement learning, and it has a non-linear two timescale
structure with the non-convex objective function. This paper develops its
tightest finite-time error bounds. We show that the Greedy-GQ algorithm
converges as fast as $mathcal{O}({1}/{sqrt{T}})$ under the i.i.d. setting
and $mathcal{O}({log T}/{sqrt{T}})$ under the Markovian setting. We further
design a variant of the vanilla Greedy-GQ algorithm using the nested-loop
approach, and show that its sample complexity is
$mathcal{O}({log(1/epsilon)epsilon^{-2}})$, which matches with the one of
the vanilla Greedy-GQ. Our finite-time error bounds match with one of the
stochastic gradient descent algorithms for general smooth non-convex
optimization problems, despite its additonal challenge in the two time-scale
updates. Our finite-sample analysis provides theoretical guidance on choosing
step-sizes for faster convergence in practice, and suggests the trade-off
between the convergence rate and the quality of the obtained policy. Our
techniques provide a general approach for finite-sample analysis of non-convex
two timescale value-based reinforcement learning algorithms.

| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3