Frozen-State Value Iteration: Faster Reinforcement Learning by Freezing Slow States

Published Online:https://doi.org/10.1287/mnsc.2023.00012

References

  • Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learning Theory, vol. 125 (PMLR, New York), 67–83.Google Scholar
  • Asadi K, Misra D, Littman M (2018) Lipschitz continuity in model-based reinforcement learning. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 264–273.Google Scholar
  • Ashkrof P, Homem de Almeida Correia G, Cats O, Van Arem B (2024) On the relocation behavior of ride-sourcing drivers. Transportation Lett. 16(4):330–337.CrossrefGoogle Scholar
  • Barto AG, Mahadevan S (2003) Recent advances in hierarchical reinforcement learning. Discrete Event Dynam. Systems 13(1–2):41–77.CrossrefGoogle Scholar
  • Bellemare M, Veness J, Bowling M (2012) Investigating contingency awareness using Atari 2600 games. Hoffmann J, Selman B, eds. Proc. 26th AAAI Conf. Artificial Intelligence, vol. 26 (AAAI Press, Palo Alto, CA), 864–871.Google Scholar
  • Benjaafar S, Jiang D, Li X, Li X (2022) Dynamic inventory repositioning in on-demand rental networks. Management Sci. 68(11):7861–7878.LinkGoogle Scholar
  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Bhatnagar S, Panigrahi JR (2006) Actor-critic algorithms for hierarchical Markov decision processes. Automatica 42(4):637–644.CrossrefGoogle Scholar
  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI Gym. Preprint, submitted June 5, https://doi.org/10.48550/arXiv.1606.01540.Google Scholar
  • Brown DB, Smith JE (2025) Unit commitment without commitment: A dynamic programming approach for managing an integrated energy system under uncertainty. Oper. Res. 73(4):1744–1766.LinkGoogle Scholar
  • Carmona R, Ludkovski M (2010) Valuation of energy storage: An optimal switching approach. Quant. Finance 10(4):359–374.CrossrefGoogle Scholar
  • Chang HS, Fard PJ, Marcus SI, Shayman M (2003) Multitime scale Markov decision processes. IEEE Trans. Automatic Control 48(6):976–987.CrossrefGoogle Scholar
  • Chen X, Hu P, Hu Z (2017) Efficient algorithms for the dynamic pricing problem with reference price effect. Management Sci. 63(12):4389–4408.LinkGoogle Scholar
  • Chung H, Freund D, Shmoys DB (2018) Bike angels: An analysis of Citi Bike’s incentive program. Proc. First ACM SIGCAS Conf. Comput. Sustainable Soc., 1–9.Google Scholar
  • Ciosek K, Silver D (2015) Value iteration with options and state aggregation. Preprint, submitted January 16, http://arxiv.org/abs/1501.03959.Google Scholar
  • Dietterich TG (2000) Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artificial Intelligence Res. 13:227–303.CrossrefGoogle Scholar
  • Digney BL (1998) Learning hierarchical control structures for multiple tasks and changing environments. Pfeifer R, Blumberg B, Meyer J-A, Wilson SW, eds. Proc. Fifth Internat. Conf. Simulation Adaptive Behav. Animals Animats, vol. 5 (MIT PRESS, Cambridge, MA), 321–330.Google Scholar
  • Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2021) Kernel-based reinforcement learning: A finite-time analysis. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 2783–2792.Google Scholar
  • El Shar IE, Jiang D (2023) Weakly coupled deep q-networks. Oh A, Naumann T, Globerson A, Saenko K. Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates, Inc., Red Hook, NY), 43931–43950.CrossrefGoogle Scholar
  • Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J. Machine Learn. Res. 6:503–556.Google Scholar
  • Gelada C, Kumar S, Buckman J, Nachum O, Bellemare MG (2019) DeepMDP: Learning continuous latent space models for representation learning. Proc. 36th Internat. Conf. Machine Learn. (ICML 2019), vol. 97 (PMLR, New York), 2170–2179.Google Scholar
  • Haskell WB, Jain R, Kalathil D (2016) Empirical dynamic programming. Math. Oper. Res. 41(2):402–429.LinkGoogle Scholar
  • He L, Hu Z, Zhang M (2020) Robust repositioning for vehicle sharing. Manufacturing Service Oper. Management 22(2):241–256.LinkGoogle Scholar
  • Jacobson M, Shimkin N, Shwartz (1999) A piecewise stationary Markov decision processes, II: Constant gain, https://adam.net.technion.ac.il/files/2017/01/2TimesScale.pdf.Google Scholar
  • Jiang DR, Powell WB (2015a) An approximate dynamic programming algorithm for monotone value functions. Oper. Res. 63(6):1489–1511.LinkGoogle Scholar
  • Jiang DR, Powell WB (2015b) Optimal hour-ahead bidding in the real-time electricity market with battery storage using approximate dynamic programming. INFORMS J. Comput. 27(3):525–543.LinkGoogle Scholar
  • Jiang DR, Al-Kanj L, Powell WB (2020) Optimistic Monte Carlo tree search with sampled information relaxation dual bounds. Oper. Res. 68(6):1678–1697.LinkGoogle Scholar
  • Jiang N, Kulesza A, Singh S, Lewis R (2015) The dependence of effective planning horizon on model accuracy. Bordini RH, Elkind E, Weiss G, Yolum PK, eds. Proc.14th Internat. Conf. Autonomous Agents Multiagent Systems ((IFAAMAS, Richland, SC), 1181–1189.Google Scholar
  • Jonsson A, Barto AG (2005) A causal approach to hierarchical decomposition of factored MDPs. De Raedt L, Wrobel S, eds. Proc. 22nd Internat. Conf. Machine Learn. (ICML 2005) (ACM, New York), 401–408.Google Scholar
  • Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Sammut C, Hoffmann A, eds. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco, CA), 267–274.Google Scholar
  • Kearns M, Mansour Y, Ng AY (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49:193–208.CrossrefGoogle Scholar
  • Killian JA, Biswas A, Shah S, Tambe M (2021) Q-learning Lagrange policies for multi-action restless bandits. Zhu F, Ooi BC, Miao C, eds. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (ACM, New York), 871–881.Google Scholar
  • Kim JH, Powell WB (2011) Optimal energy commitments with storage and intermittent supply. Oper. Res. 59(6):1347–1360.LinkGoogle Scholar
  • Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
  • Kunnumkal S, Topaloglu H (2008) Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Oper. Res. 56(3):646–664.LinkGoogle Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Larochelle H, Ranzato MA, Hadsell R, Balcan M-F, Lin H-T, eds. Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 12861–12872.Google Scholar
  • Littman ML, Dean TL, Kaelbling LP (1995) On the complexity of solving Markov decision problems. Besnard P, Hanks S, eds. Proc. 11th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann, San Francisco, CA) 394–402.Google Scholar
  • Löhndorf N, Minner S (2010) Optimal day-ahead trading and storage of renewable energies—An approximate dynamic programming approach. Energy Systems 1:61–77.CrossrefGoogle Scholar
  • Longstaff FA, Schwartz ES (2001) Valuing American options by simulation: A simple least-squares approach. Rev. Financial Stud. 14(1):113–147.CrossrefGoogle Scholar
  • Mannor S, Menache I, Hoze A, Klein U (2004) Dynamic abstraction in reinforcement learning via clustering. Greiner R, Schuurmans D, eds. Proc. 21st Internat. Conf. Machine Learn. (ACM, New York).Google Scholar
  • Maran D, Metelli AM, Papini M, Restell M (2024) No-regret reinforcement learning in smooth MDPs. Preprint, submitted February 6, https://arxiv.org/abs/2402.03792.Google Scholar
  • McGovern A, Barto AG (2001) Automatic discovery of subgoals in reinforcement learning using diverse density. Brodley CE, Danyluk AP, eds. Proc. Eighth Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco, CA), 361–368.Google Scholar
  • Menache I, Mannor S, Shimkin N (2002) Q-cut-dynamic discovery of sub-goals in reinforcement learning. Proc. 13th Eur. Conf. Machine Learn., Lecture Notes in Computer Science, vol. 2430 (Springer, Berlin), 295–306.Google Scholar
  • Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
  • Nadarajah S, Cire AA (2025) Self-adapting network relaxations for weakly coupled Markov decision processes. Management Sci. 71(2):1779–1802.LinkGoogle Scholar
  • Nadarajah S, Margot F, Secomandi N (2017) Comparison of least squares Monte Carlo methods with applications to energy real options. Eur. J. Oper. Res. 256(1):196–204.CrossrefGoogle Scholar
  • Nascimento JM, Powell WB (2009) An optimal approximate dynamic programming algorithm for the lagged asset acquisition problem. Math. Oper. Res. 34(1):210–237.LinkGoogle Scholar
  • Ok J, Proutiere A, Tranos D (2018) Exploration in structured reinforcement learning. Bengio S, et al. eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc. City: Red Hook, NY), 5057–5067.Google Scholar
  • Ong HY, Freund D, Crapis D (2021) Driver positioning and incentive budgeting with an escrow mechanism for ride-sharing platforms. INFORMS J. Appl. Anal. 51(5):373–390.LinkGoogle Scholar
  • Osband I, Van Roy B (2014) Near-optimal reinforcement learning in factored MDPs. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Proc. 28th Conf. Adv. Neural Inform. Processing Systems, vol. 27 (Curran Associates Inc., Red Hook, NY), 604–612.Google Scholar
  • Panigrahi JR, Bhatnagar S (2004) Hierarchical decision making in semiconductor fabs using multi-time scale Markov decision processes. 43rd IEEE Conf. Decision Control, vol. 4 (IEEE), 4387–4392.Google Scholar
  • Papadaki KP, Powell WB (2002) Exploiting structure in adaptive dynamic programming algorithms for a stochastic batch service problem. Eur. J. Oper. Res. 142(1):108–127.CrossrefGoogle Scholar
  • Parr RE, Russell S (1998) Hierarchical Control and Learning for Markov Decision Processes (University of California, Berkeley, CA).Google Scholar
  • Pereira MV, Pinto LM (1991) Multi-stage stochastic optimization applied to energy planning. Math. Programming 52:359–375.CrossrefGoogle Scholar
  • Philpott AB, Guan Z (2008) On the convergence of stochastic dual dynamic programming and related methods. Oper. Res. Lett. 36(4):450–455.CrossrefGoogle Scholar
  • Pirotta M, Restelli M, Bascetta L (2015) Policy gradient in Lipschitz Markov decision processes. Machine Learn. 100(2–3):255–283.CrossrefGoogle Scholar
  • Porteus EL (1990) Stochastic inventory theory. Heyman DP, Sobel MJ, eds. Handbooks in Operations Research and Management Science, vol. 2 (Elsevier, Amsterdam), 605–652.Google Scholar
  • Precup D (2000) Temporal abstraction in reinforcement learning. Unpublished PhD thesis, University of Massachusetts, Boston.Google Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Şimşek Ö, Barto AG (2004) Using relative novelty to identify useful temporal abstractions in reinforcement learning. Proc. 21st Internat. Conf. Machine Learn., 95.Google Scholar
  • Şimşek Ö, Wolfe AP, Barto AG (2005) Identifying useful subgoals in reinforcement learning by local graph partitioning. Proc. 22nd Internat. Conf. Machine Learn., 816–823.Google Scholar
  • Sinclair S, Banerjee S, Yu CL (2022) Adaptive discretization in online reinforcement learning. Oper. Res. 71(5):1636–1652.Google Scholar
  • Sinclair S, Wang T, Jain G, Banerjee S, Yu C (2020) Adaptive discretization for model-based reinforcement learning. Adv. Neural Inform. Processing Systems: Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 3858–3871.Google Scholar
  • Song R, Xu K (2022) Temporal concatenation for Markov decision processes. Probab. Engrg. Inform. Sci. 36(4):999–1026.CrossrefGoogle Scholar
  • Stolle M, Precup D (2002) Learning options in reinforcement learning. Proc. Fifth Internat. Sympos. Abstraction Reformulation Approximation (Springer, Berlin), 212–223.CrossrefGoogle Scholar
  • Sutton RS, Precup D, Singh S (1999) Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1–2):181–211.CrossrefGoogle Scholar
  • Tunyasuvunakool S, Muldal A, Doron Y, Liu S, BohezS, Merel J, Erez T, Lillicrap T, Heess N, Tassa Y (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022.Google Scholar
  • Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IEEE, Piscataway, NJ), 5026–5033.Google Scholar
  • Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.CrossrefGoogle Scholar
  • Wang Z (2016) Intertemporal price discrimination via reference price effects. Oper. Res. 64(2):290–296.LinkGoogle Scholar
  • Wang S, Bi S, Zhang Y-JA (2018) Demand response management for profit maximizing energy loads in real-time electricity market. IEEE Trans. Power Systems 33(6):6387–6396.CrossrefGoogle Scholar
  • Wang Y, Poloczek M, Jiang DR (2023) Dynamic subgoal-based exploration via Bayesian optimization. Trans. Machine Learn. Res. (September 28), https://openreview.net/forum?id=ThJl4d5JRg.Google Scholar
  • Wongthatsanekorn W, Realff MJ, Ammons JC (2010) Multi-time scale Markov decision process approach to strategic network growth of reverse supply chains. Omega 38(1–2):20–32.CrossrefGoogle Scholar
  • Yang B, Li W, Wang J, Yang J, Wang T, Liu X (2020) A novel path planning algorithm for warehouse robots based on a two-dimensional grid model. IEEE Access 8:80347–80357.CrossrefGoogle Scholar
  • Yu JY, Mannor S (2009) Arbitrarily modulated Markov decision processes. Proc. 48h IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 2946–2953.Google Scholar
  • Zhao X, Fan F, Liu X, Xie J (2007) Storage-space capacitated inventory system with (r, q) policies. Oper. Res. 55(5):854–865.LinkGoogle Scholar
  • Zhu C, Zhou J, Wu W, Mo L (2006) Hydropower portfolios management via Markov decision process. 32nd Annual Conf. IEEE Indust. Electronics (IEEE), 2883–2888.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.