Adaptive Discretization in Online Reinforcement Learning

Sean R. Sinclair
Corresponding Author
Sean R. Sinclair
[email protected]
https://orcid.org/0000-0002-7011-8253
Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853
Search for more papers by this author
,
Siddhartha Banerjee
Siddhartha Banerjee
[email protected]
https://orcid.org/0000-0002-8954-4578
Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853
Search for more papers by this author
,
Christina Lee Yu
Christina Lee Yu
[email protected]
https://orcid.org/0000-0002-2165-5220
Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853
Search for more papers by this author

Sean R. Sinclair

Corresponding Author

Sean R. Sinclair

[email protected]

https://orcid.org/0000-0002-7011-8253

Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853

Search for more papers by this author

Siddhartha Banerjee

[email protected]

https://orcid.org/0000-0002-8954-4578

Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853

Search for more papers by this author

Christina Lee Yu

[email protected]

https://orcid.org/0000-0002-2165-5220

Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853

Search for more papers by this author

Published Online:11 Nov 2022https://doi.org/10.1287/opre.2022.2396

References

Agarwal A, Jiang N, Kakade SM (2019) Reinforcement learning: Theory and algorithms. Technical report, University of Washington, Seattle.Google Scholar
Alizadeh M, Yang S, Sharif M, Katti S, McKeown N, Prabhakar B, Shenker S (2013) pFabric: Minimal near-optimal datacenter transport. Comput. Comm. Rev. 43(4):435–446.Crossref, Google Scholar
Alizadeh M, Greenberg A, Maltz D, Padhye J, Patel P, Prabhakar B, Sengupta S, Sridharan M (2010) DCTCP: Efficient packet transport for the commoditized data center (ACM SIGCOMM). https://www.microsoft.com/en-us/research/publication/dctcp-efficient-packet-transport-for-the-commoditized-data-center/.Google Scholar
Araújo JP, Figueiredo MA, Botto MA (2020a) Control with adaptive Q-learning. Preprint, submitted November 3, https://arxiv.org/abs/2011.02141.Google Scholar
Araújo JP, Figueiredo M, Botto MA (2020b) Single-partition adaptive Q-learning. Preprint, submitted July 14, https://arxiv.org/abs/2007.06741.Google Scholar
Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70, 263–272.Google Scholar
Bhandari J, Russo D (2021) On the linear convergence of policy gradient methods for finite MDPs. Banerjee A, Fukumizu K, eds. Proc. 24th Internat. Conf. Artificial Intelligence Statist., vol. 130 (PMLR), 2386–2394.Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems, Foundations and Trends in Machine Learning, vol. 5 (now Publishers Inc., Norwell, MA).Google Scholar
Cao T, Krishnamurthy A (2020) Provably adaptive reinforcement learning in metric spaces. Adv. Neural Inform. Processing Systems 33:9736–9744.Google Scholar
Chen S, Zhang B (2021) Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. Preprint, submitted April 15, https://arxiv.org/abs/2104.07822.Google Scholar
Chinchali S, Hu P, Chu T, Sharma M, Bansal M, Misra R, Pavone M, Katti S (2018) Cellular network traffic scheduling with deep reinforcement learning. 32nd AAAI Conf. Artificial Intelligence.Google Scholar
Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2020) Regret bounds for kernel-based reinforcement learning. Preprint, submitted April 12, https://arxiv.org/abs/2004.05599.Google Scholar
Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Internat. Conf. Learn. Representations.Google Scholar
Du SS, Luo Y, Wang R, Zhang H (2019) Provably efficient Q-Learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 33:8058–8068.Google Scholar
Efroni Y, Merlis N, Ghavamzadeh M, Mannor S (2019) Tight regret bounds for model-based reinforcement learning with greedy policies. Adv. Neural Inform. Processing Systems 33:12203–12213.Google Scholar
Henaff M (2019) Explicit explore-exploit algorithms in continuous state spaces. Adv. Neural Inform. Processing Systems 32:9372–9382.Google Scholar
Hubbs CD, Perez HD, Sarwar O, Sahinidis NV, Grossmann IE, Wassick JM (2020) OR-gym: A reinforcement learning library for operations research problems. Preprint, submitted August 14, https://arxiv.org/abs/2008.06319.Google Scholar
Ipek E, Mutlu O, Martínez JF, Caruana R (2008) Self-optimizing memory controllers: A reinforcement learning approach. SIGARCH Comput. Architecture. News 36(3):39–50.Crossref, Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11:1563–1600.Google Scholar
Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Internat. Conf. Machine Learn. (PMLR), 1704–1713.Google Scholar
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, ed. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 4863–4873.Google Scholar
Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR), 2137–2143.Google Scholar
Kakade S, Kearns MJ, Langford J (2003) Exploration in metric state spaces. Proc. 20th Internat. Conf. Machine Learn., 306–312.Google Scholar
Keller PW, Mannor S, Precup D (2006) Automatic basis function construction for approximate dynamic programming and reinforcement learning. Proc. 23rd Internat. Conf. Machine Learn., 449–456.Google Scholar
Kleinberg R, Slivkins A, Upfal E (2019) Bandits and experts in metric spaces. J. ACM 66(4):1–77.Google Scholar
Lakshmanan K, Ortner R, Ryabko D (2015) Improved regret bounds for undiscounted continuous reinforcement learning. Internat. Conf. Machine Learn., 524–532.Google Scholar
Le Lan C, Bellemare MG, Castro PS (2021) Metrics and continuity in reinforcement learning. Proc. Conf. AAAI Artificial Intelligence, vol. 35, 8261–8269.Google Scholar
Lolos K, Konstantinou I, Kantere V, Koziris N (2017) Adaptive state space partitioning of Markov decision processes for elastic resource management. IEEE 33rd Internat. Conf. Data Engrg., 191–194.Google Scholar
Lykouris T, Vassilvitskii S (2018) Competitive caching with machine learned advice. Preprint, submitted February 15, https://arxiv.org/abs/1802.05399.Google Scholar
Menache I, Mannor S, Shimkin N (2005) Basis function adaptation in temporal difference reinforcement learning. Ann. Oper. Res. 134(1):215–238.Crossref, Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. Internat. Conf. Machine Learn., 1928–1937.Google Scholar
Mozur P (2017) Google’s Alphago defeats Chinese Go master in win for A.I. The New York Times Online (May 23), https://www.nytimes.com/2017/05/23/business/google-deepmind-alphago-go-champion-defeat.html.Google Scholar
Munos R (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, vol. 7 (now Publishers Inc., Norwell, MA), 1–129.Google Scholar
Nishtala R, Fugal H, Grimm S, Kwiatkowski M, Lee H, Li HC, McElroy R, et al. (2013) Scaling memcache at Facebook. Proc. 10th Sympos. Networked Systems Design Implementation, vol. 13, 385–398.Google Scholar
Osband I, Van Roy B (2014) Model-based reinforcement learning and the eluder dimension. Adv. Neural Inform. Processing Systems 27:1466–1474.Google Scholar
Powell WB (2022) Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions (Wiley-Blackwell, Hoboken, NJ).Google Scholar
Puterman ML (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed. (John Wiley & Sons, Inc., New York).Crossref, Google Scholar
Pyeatt LD, Howe AE (2001) Decision tree function approximation in reinforcement learning. Proc. Third Internat. Sympos. Adaptive Systems: Evolutionary Comput. Probabilistic Graphical Models, vol. 2, 70–77.Google Scholar
Royden HL, Fitzpatrick P (1988) Real Analysis, vol. 32 (Macmillan, New York).Google Scholar
Russo D, Van Roy B (2013) Eluder dimension and the sample complexity of optimistic exploration. Adv. Neural Inform. Processing Systems 26:2256–2264.Google Scholar
Shah D, Xie Q (2018) Q-Learning with nearest neighbors. Adv. Neural Inform. Processing Systems 31:3111–3121.Google Scholar
Shah D, Song D, Xu Z, Yang Y (2020) Sample efficient reinforcement learning via low-rank matrix estimation. Adv. Neural Inform. Processing Systems 33:12092–12103.Google Scholar
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Google Scholar
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Preprint, submitted December 5, https://arxiv.org/abs/1712.01815.Google Scholar
Simchowitz M, Jamieson KG (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Adv. Neural Inform. Processing Systems 33:1151–1160.Google Scholar
Sinclair SR, Banerjee S, Yu CL (2019) Adaptive discretization for episodic reinforcement learning in metric spaces. Proc. ACM Measurement Anal. Comput. Systems, vol. 3, 1–44.Google Scholar
Sinclair SR, Wang T, Jain G, Banerjee S, Yu C (2020) Adaptive discretization for model-based reinforcement learning. Adv. Neural Inform. Processing Systems 34:33.Google Scholar
Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Adv. Neural Inform. Processing Systems 7:361–368.Google Scholar
Slivkins A (2014) Contextual bandits with similarity information. J. Machine Learn. Res. 15(1):2533–2568.Google Scholar
Slivkins A (2019) Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, vol. 12 (now Publishers Inc., Norwell, MA).Google Scholar
Song Z, Sun W (2019) Efficient model-free reinforcement learning in metric spaces. Preprint, submitted May 1, https://arxiv.org/abs/1905.00475.Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Tessler C, Shpigelman Y, Dalal G, Mandelbaum A, Haritan Kazakov D, Fuhrer B, Chechik G, Mannor S (2022) Reinforcement learning for datacenter congestion control. Performance Evaluation Rev. 49(2):43–46.Crossref, Google Scholar
Uther WT, Veloso MM (1998) Tree based discretization for continuous state space reinforcement learning. AAAI/IAAI, 769–774.Google Scholar
Wang R, Salakhutdinov R, Yang LF (2020) Provably efficient reinforcement learning with general value function approximation. Preprint, submitted May 21, https://arxiv.org/abs/2005.10804.Google Scholar
Wang Y, Wang R, Du SS, Krishnamurthy A (2019) Optimism in reinforcement learning with generalized linear function approximation. Preprint, submitted December 9, https://arxiv.org/abs/1912.04136.Google Scholar
Wanigasekara N, Yu C (2019) Nonparametric contextual bandits in metric spaces with unknown metric. Adv. Neural Inform. Processing Systems 32:14657–14667.Google Scholar
Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge, UK.Google Scholar
Whiteson S, Stone P (2006) Evolutionary function approximation for reinforcement learning. J. Machine Learn. Res. 7:877–917.Google Scholar
Yang LF, Ni C, Wang M (2019) Learning to control in metric space with optimal regret. Preprint, submitted May 5, https://arxiv.org/abs/1905.01576.Google Scholar
Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Preprint, submitted January 1, https://arxiv.org/abs/1901.00210.Google Scholar
Zanette A, Lazaric A, Kochenderfer MJ, Brunskill E (2019) Limiting extrapolation in linear approximate value iteration. Adv. Neural Inform. Processing Systems 32:5616–5625.Google Scholar

Volume 71, Issue 5

September-October 2023

Pages iii-vii, 1441-1924, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:May 31, 2021
Accepted:September 29, 2022
Published Online:November 11, 2022

Cite as

Sean R. Sinclair, Siddhartha Banerjee, Christina Lee Yu (2022) Adaptive Discretization in Online Reinforcement Learning. Operations Research 71(5):1636-1652.

https://doi.org/10.1287/opre.2022.2396

Keywords

Acknowledgments

Part of this work was done when Sean Sinclair and Christina Yu were visiting the Simons Institute for the Theory of Computing for the semester on the Theory of Reinforcement Learning.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Adaptive Discretization in Online Reinforcement Learning

References

Volume 71, Issue 5

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News