Adaptive Discretization in Online Reinforcement Learning

Published Online:https://doi.org/10.1287/opre.2022.2396

References

  • Agarwal A, Jiang N, Kakade SM (2019) Reinforcement learning: Theory and algorithms. Technical report, University of Washington, Seattle.Google Scholar
  • Alizadeh M, Yang S, Sharif M, Katti S, McKeown N, Prabhakar B, Shenker S (2013) pFabric: Minimal near-optimal datacenter transport. Comput. Comm. Rev. 43(4):435–446.CrossrefGoogle Scholar
  • Alizadeh M, Greenberg A, Maltz D, Padhye J, Patel P, Prabhakar B, Sengupta S, Sridharan M (2010) DCTCP: Efficient packet transport for the commoditized data center (ACM SIGCOMM). https://www.microsoft.com/en-us/research/publication/dctcp-efficient-packet-transport-for-the-commoditized-data-center/.Google Scholar
  • Araújo JP, Figueiredo MA, Botto MA (2020a) Control with adaptive Q-learning. Preprint, submitted November 3, https://arxiv.org/abs/2011.02141.Google Scholar
  • Araújo JP, Figueiredo M, Botto MA (2020b) Single-partition adaptive Q-learning. Preprint, submitted July 14, https://arxiv.org/abs/2007.06741.Google Scholar
  • Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.CrossrefGoogle Scholar
  • Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70, 263–272.Google Scholar
  • Bhandari J, Russo D (2021) On the linear convergence of policy gradient methods for finite MDPs. Banerjee A, Fukumizu K, eds. Proc. 24th Internat. Conf. Artificial Intelligence Statist., vol. 130 (PMLR), 2386–2394.Google Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems, Foundations and Trends in Machine Learning, vol. 5 (now Publishers Inc., Norwell, MA).Google Scholar
  • Cao T, Krishnamurthy A (2020) Provably adaptive reinforcement learning in metric spaces. Adv. Neural Inform. Processing Systems 33:9736–9744.Google Scholar
  • Chen S, Zhang B (2021) Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. Preprint, submitted April 15, https://arxiv.org/abs/2104.07822.Google Scholar
  • Chinchali S, Hu P, Chu T, Sharma M, Bansal M, Misra R, Pavone M, Katti S (2018) Cellular network traffic scheduling with deep reinforcement learning. 32nd AAAI Conf. Artificial Intelligence.Google Scholar
  • Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2020) Regret bounds for kernel-based reinforcement learning. Preprint, submitted April 12, https://arxiv.org/abs/2004.05599.Google Scholar
  • Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Internat. Conf. Learn. Representations.Google Scholar
  • Du SS, Luo Y, Wang R, Zhang H (2019) Provably efficient Q-Learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 33:8058–8068.Google Scholar
  • Efroni Y, Merlis N, Ghavamzadeh M, Mannor S (2019) Tight regret bounds for model-based reinforcement learning with greedy policies. Adv. Neural Inform. Processing Systems 33:12203–12213.Google Scholar
  • Henaff M (2019) Explicit explore-exploit algorithms in continuous state spaces. Adv. Neural Inform. Processing Systems 32:9372–9382.Google Scholar
  • Hubbs CD, Perez HD, Sarwar O, Sahinidis NV, Grossmann IE, Wassick JM (2020) OR-gym: A reinforcement learning library for operations research problems. Preprint, submitted August 14, https://arxiv.org/abs/2008.06319.Google Scholar
  • Ipek E, Mutlu O, Martínez JF, Caruana R (2008) Self-optimizing memory controllers: A reinforcement learning approach. SIGARCH Comput. Architecture. News 36(3):39–50.CrossrefGoogle Scholar
  • Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11:1563–1600.Google Scholar
  • Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Internat. Conf. Machine Learn. (PMLR), 1704–1713.Google Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, ed. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 4863–4873.Google Scholar
  • Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR), 2137–2143.Google Scholar
  • Kakade S, Kearns MJ, Langford J (2003) Exploration in metric state spaces. Proc. 20th Internat. Conf. Machine Learn., 306–312.Google Scholar
  • Keller PW, Mannor S, Precup D (2006) Automatic basis function construction for approximate dynamic programming and reinforcement learning. Proc. 23rd Internat. Conf. Machine Learn., 449–456.Google Scholar
  • Kleinberg R, Slivkins A, Upfal E (2019) Bandits and experts in metric spaces. J. ACM 66(4):1–77.Google Scholar
  • Lakshmanan K, Ortner R, Ryabko D (2015) Improved regret bounds for undiscounted continuous reinforcement learning. Internat. Conf. Machine Learn., 524–532.Google Scholar
  • Le Lan C, Bellemare MG, Castro PS (2021) Metrics and continuity in reinforcement learning. Proc. Conf. AAAI Artificial Intelligence, vol. 35, 8261–8269.Google Scholar
  • Lolos K, Konstantinou I, Kantere V, Koziris N (2017) Adaptive state space partitioning of Markov decision processes for elastic resource management. IEEE 33rd Internat. Conf. Data Engrg., 191–194.Google Scholar
  • Lykouris T, Vassilvitskii S (2018) Competitive caching with machine learned advice. Preprint, submitted February 15, https://arxiv.org/abs/1802.05399.Google Scholar
  • Menache I, Mannor S, Shimkin N (2005) Basis function adaptation in temporal difference reinforcement learning. Ann. Oper. Res. 134(1):215–238.CrossrefGoogle Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
  • Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. Internat. Conf. Machine Learn., 1928–1937.Google Scholar
  • Mozur P (2017) Google’s Alphago defeats Chinese Go master in win for A.I. The New York Times Online (May 23), https://www.nytimes.com/2017/05/23/business/google-deepmind-alphago-go-champion-defeat.html.Google Scholar
  • Munos R (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, vol. 7 (now Publishers Inc., Norwell, MA), 1–129.Google Scholar
  • Nishtala R, Fugal H, Grimm S, Kwiatkowski M, Lee H, Li HC, McElroy R, et al. (2013) Scaling memcache at Facebook. Proc. 10th Sympos. Networked Systems Design Implementation, vol. 13, 385–398.Google Scholar
  • Osband I, Van Roy B (2014) Model-based reinforcement learning and the eluder dimension. Adv. Neural Inform. Processing Systems 27:1466–1474.Google Scholar
  • Powell WB (2022) Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions (Wiley-Blackwell, Hoboken, NJ).Google Scholar
  • Puterman ML (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed. (John Wiley & Sons, Inc., New York).CrossrefGoogle Scholar
  • Pyeatt LD, Howe AE (2001) Decision tree function approximation in reinforcement learning. Proc. Third Internat. Sympos. Adaptive Systems: Evolutionary Comput. Probabilistic Graphical Models, vol. 2, 70–77.Google Scholar
  • Royden HL, Fitzpatrick P (1988) Real Analysis, vol. 32 (Macmillan, New York).Google Scholar
  • Russo D, Van Roy B (2013) Eluder dimension and the sample complexity of optimistic exploration. Adv. Neural Inform. Processing Systems 26:2256–2264.Google Scholar
  • Shah D, Xie Q (2018) Q-Learning with nearest neighbors. Adv. Neural Inform. Processing Systems 31:3111–3121.Google Scholar
  • Shah D, Song D, Xu Z, Yang Y (2020) Sample efficient reinforcement learning via low-rank matrix estimation. Adv. Neural Inform. Processing Systems 33:12092–12103.Google Scholar
  • Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Google Scholar
  • Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Preprint, submitted December 5, https://arxiv.org/abs/1712.01815.Google Scholar
  • Simchowitz M, Jamieson KG (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Adv. Neural Inform. Processing Systems 33:1151–1160.Google Scholar
  • Sinclair SR, Banerjee S, Yu CL (2019) Adaptive discretization for episodic reinforcement learning in metric spaces. Proc. ACM Measurement Anal. Comput. Systems, vol. 3, 1–44.Google Scholar
  • Sinclair SR, Wang T, Jain G, Banerjee S, Yu C (2020) Adaptive discretization for model-based reinforcement learning. Adv. Neural Inform. Processing Systems 34:33.Google Scholar
  • Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Adv. Neural Inform. Processing Systems 7:361–368.Google Scholar
  • Slivkins A (2014) Contextual bandits with similarity information. J. Machine Learn. Res. 15(1):2533–2568.Google Scholar
  • Slivkins A (2019) Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, vol. 12 (now Publishers Inc., Norwell, MA).Google Scholar
  • Song Z, Sun W (2019) Efficient model-free reinforcement learning in metric spaces. Preprint, submitted May 1, https://arxiv.org/abs/1905.00475.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Tessler C, Shpigelman Y, Dalal G, Mandelbaum A, Haritan Kazakov D, Fuhrer B, Chechik G, Mannor S (2022) Reinforcement learning for datacenter congestion control. Performance Evaluation Rev. 49(2):43–46.CrossrefGoogle Scholar
  • Uther WT, Veloso MM (1998) Tree based discretization for continuous state space reinforcement learning. AAAI/IAAI, 769–774.Google Scholar
  • Wang R, Salakhutdinov R, Yang LF (2020) Provably efficient reinforcement learning with general value function approximation. Preprint, submitted May 21, https://arxiv.org/abs/2005.10804.Google Scholar
  • Wang Y, Wang R, Du SS, Krishnamurthy A (2019) Optimism in reinforcement learning with generalized linear function approximation. Preprint, submitted December 9, https://arxiv.org/abs/1912.04136.Google Scholar
  • Wanigasekara N, Yu C (2019) Nonparametric contextual bandits in metric spaces with unknown metric. Adv. Neural Inform. Processing Systems 32:14657–14667.Google Scholar
  • Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge, UK.Google Scholar
  • Whiteson S, Stone P (2006) Evolutionary function approximation for reinforcement learning. J. Machine Learn. Res. 7:877–917.Google Scholar
  • Yang LF, Ni C, Wang M (2019) Learning to control in metric space with optimal regret. Preprint, submitted May 5, https://arxiv.org/abs/1905.01576.Google Scholar
  • Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Preprint, submitted January 1, https://arxiv.org/abs/1901.00210.Google Scholar
  • Zanette A, Lazaric A, Kochenderfer MJ, Brunskill E (2019) Limiting extrapolation in linear approximate value iteration. Adv. Neural Inform. Processing Systems 32:5616–5625.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.