Provably efficient neural GTD algorithm for off-policy learning

Hoi To Wai, Zhuoran Yang, Zhaoran Wang, Mingyi Hong

Research output: Contribution to journalConference articlepeer-review

Abstract

This paper studies a gradient temporal difference (GTD) algorithm using neural network (NN) function approximators to minimize the mean squared Bellman error (MSBE). For off-policy learning, we show that the minimum MSBE problem can be recast into a min-max optimization involving a pair of over-parameterized primal-dual NNs. The resultant formulation can then be tackled using a neural GTD algorithm. We analyze the convergence of the proposed algorithm with a 2-layer ReLU NN architecture using m neurons and prove that it computes an approximate optimal solution to the minimum MSBE problem as m ! 1.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume2020-December
StatePublished - 2020
Event34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online
Duration: Dec 6 2020Dec 12 2020

Bibliographical note

Funding Information:
Acknowledgement & Funding Disclosure The authors would like to thank Mr. Alan Lun (CUHK) for conducting the preliminary numerical experiments in this paper. H.-T. Wai is supported by the CUHK Direct Grant #4055113. M. Hong is supported in part by NSF under Grant CCF-1651825, CMMI-172775, CIF-1910385 and by AFOSR under grant 19RT0424.

Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.

Fingerprint

Dive into the research topics of 'Provably efficient neural GTD algorithm for off-policy learning'. Together they form a unique fingerprint.

Cite this