Convergence of simulation-based policy iteration

William L. Cooper; Shane G. Henderson; Mark E. Lewis

doi:10.1017/S0269964803172051

Convergence of simulation-based policy iteration

William L. Cooper, Shane G. Henderson, Mark E. Lewis

Industrial and Systems Engineering

Research output: Contribution to journal › Article › peer-review

29 Scopus citations

Abstract

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almostsurely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.

Original language	English (US)
Pages (from-to)	213-234
Number of pages	22
Journal	Probability in the Engineering and Informational Sciences
Volume	17
Issue number	2
DOIs	https://doi.org/10.1017/S0269964803172051
State	Published - 2003

Access

10.1017/S0269964803172051

OpenUrl availability

Full text

Cite this

@article{eea60e42ff514e6ebc4ef670e75f35b9,

title = "Convergence of simulation-based policy iteration",

abstract = "Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almostsurely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.",

author = "Cooper, {William L.} and Henderson, {Shane G.} and Lewis, {Mark E.}",

year = "2003",

doi = "10.1017/S0269964803172051",

language = "English (US)",

volume = "17",

pages = "213--234",

journal = "Probability in the Engineering and Informational Sciences",

issn = "0269-9648",

publisher = "Cambridge University Press",

number = "2",

}

TY - JOUR

T1 - Convergence of simulation-based policy iteration

AU - Cooper, William L.

AU - Henderson, Shane G.

AU - Lewis, Mark E.

PY - 2003

Y1 - 2003

N2 - Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almostsurely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.

AB - Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almostsurely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.

UR - http://www.scopus.com/inward/record.url?scp=0038380746&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0038380746&partnerID=8YFLogxK

U2 - 10.1017/S0269964803172051

DO - 10.1017/S0269964803172051

M3 - Article

AN - SCOPUS:0038380746

SN - 0269-9648

VL - 17

SP - 213

EP - 234

JO - Probability in the Engineering and Informational Sciences

JF - Probability in the Engineering and Informational Sciences

IS - 2

ER -

Convergence of simulation-based policy iteration

Abstract

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this