A smoothed analysis of the greedy algorithm for the linear contextual bandit problem

Sampath Kannan; Jamie Morgenstern; Aaron Roth; Bo Waggoner; Zhiwei Steven Wu

A smoothed analysis of the greedy algorithm for the linear contextual bandit problem

Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, Zhiwei Steven Wu

Computer Science and Engineering

Research output: Contribution to journal › Conference article › peer-review

38 Scopus citations

Abstract

Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a “greedy” algorithm, which always makes the optimal decision for the individuals at hand - but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve “no regret”, perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that in slightly perturbed environments, exploration and exploitation need not be in conflict in the linear setting.¹.

Original language	English (US)
Pages (from-to)	2227-2236
Number of pages	10
Journal	Advances in Neural Information Processing Systems
Volume	2018-December
State	Published - 2018
Event	32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada Duration: Dec 2 2018 → Dec 8 2018

Bibliographical note

Publisher Copyright:
© 2018 Curran Associates Inc.All rights reserved.

OpenUrl availability

Full text

Cite this

@article{9f5b85fc45b64dac86580cf454048714,

title = "A smoothed analysis of the greedy algorithm for the linear contextual bandit problem",

abstract = "Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a “greedy” algorithm, which always makes the optimal decision for the individuals at hand - but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve “no regret”, perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that in slightly perturbed environments, exploration and exploitation need not be in conflict in the linear setting.1.",

author = "Sampath Kannan and Jamie Morgenstern and Aaron Roth and Bo Waggoner and Wu, {Zhiwei Steven}",

note = "Publisher Copyright: {\textcopyright} 2018 Curran Associates Inc.All rights reserved.; 32nd Conference on Neural Information Processing Systems, NeurIPS 2018 ; Conference date: 02-12-2018 Through 08-12-2018",

year = "2018",

language = "English (US)",

volume = "2018-December",

pages = "2227--2236",

journal = "Advances in Neural Information Processing Systems",

issn = "1049-5258",

}

TY - JOUR

T1 - A smoothed analysis of the greedy algorithm for the linear contextual bandit problem

AU - Kannan, Sampath

AU - Morgenstern, Jamie

AU - Roth, Aaron

AU - Waggoner, Bo

AU - Wu, Zhiwei Steven

PY - 2018

Y1 - 2018

N2 - Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a “greedy” algorithm, which always makes the optimal decision for the individuals at hand - but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve “no regret”, perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that in slightly perturbed environments, exploration and exploitation need not be in conflict in the linear setting.1.

AB - Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a “greedy” algorithm, which always makes the optimal decision for the individuals at hand - but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve “no regret”, perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that in slightly perturbed environments, exploration and exploitation need not be in conflict in the linear setting.1.

UR - http://www.scopus.com/inward/record.url?scp=85061814125&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061814125&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85061814125

SN - 1049-5258

VL - 2018-December

SP - 2227

EP - 2236

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

T2 - 32nd Conference on Neural Information Processing Systems, NeurIPS 2018

Y2 - 2 December 2018 through 8 December 2018

ER -

A smoothed analysis of the greedy algorithm for the linear contextual bandit problem

Abstract

Bibliographical note

OpenUrl availability

Other files and links

Fingerprint

Cite this