Frozen-State Value Iteration: Faster Reinforcement Learning by Freezing Slow States

Yijia Wang
Yijia Wang
[email protected]
https://orcid.org/0000-0001-8606-4846
University of Pittsburgh, Pittsburgh, Pennsylvania 15260
Search for more papers by this author
,
Daniel R. Jiang
Corresponding Author
Daniel R. Jiang
[email protected]
https://orcid.org/0000-0002-5388-8061
University of Pittsburgh, Pittsburgh, Pennsylvania 15260; and Meta Platforms, Menlo Park, California 94025
Search for more papers by this author

University of Pittsburgh, Pittsburgh, Pennsylvania 15260

Search for more papers by this author

Daniel R. Jiang

Corresponding Author

Daniel R. Jiang

[email protected]

https://orcid.org/0000-0002-5388-8061

University of Pittsburgh, Pittsburgh, Pennsylvania 15260; and Meta Platforms, Menlo Park, California 94025

Search for more papers by this author

Published Online:18 Mar 2026https://doi.org/10.1287/mnsc.2023.00012

References

Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learning Theory, vol. 125 (PMLR, New York), 67–83.Google Scholar
Asadi K, Misra D, Littman M (2018) Lipschitz continuity in model-based reinforcement learning. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 264–273.Google Scholar
Ashkrof P, Homem de Almeida Correia G, Cats O, Van Arem B (2024) On the relocation behavior of ride-sourcing drivers. Transportation Lett. 16(4):330–337.Crossref, Google Scholar
Barto AG, Mahadevan S (2003) Recent advances in hierarchical reinforcement learning. Discrete Event Dynam. Systems 13(1–2):41–77.Crossref, Google Scholar
Bellemare M, Veness J, Bowling M (2012) Investigating contingency awareness using Atari 2600 games. Hoffmann J, Selman B, eds. Proc. 26th AAAI Conf. Artificial Intelligence, vol. 26 (AAAI Press, Palo Alto, CA), 864–871.Google Scholar
Benjaafar S, Jiang D, Li X, Li X (2022) Dynamic inventory repositioning in on-demand rental networks. Management Sci. 68(11):7861–7878.Link, Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bhatnagar S, Panigrahi JR (2006) Actor-critic algorithms for hierarchical Markov decision processes. Automatica 42(4):637–644.Crossref, Google Scholar
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI Gym. Preprint, submitted June 5, https://doi.org/10.48550/arXiv.1606.01540.Google Scholar
Brown DB, Smith JE (2025) Unit commitment without commitment: A dynamic programming approach for managing an integrated energy system under uncertainty. Oper. Res. 73(4):1744–1766.Link, Google Scholar
Carmona R, Ludkovski M (2010) Valuation of energy storage: An optimal switching approach. Quant. Finance 10(4):359–374.Crossref, Google Scholar
Chang HS, Fard PJ, Marcus SI, Shayman M (2003) Multitime scale Markov decision processes. IEEE Trans. Automatic Control 48(6):976–987.Crossref, Google Scholar
Chen X, Hu P, Hu Z (2017) Efficient algorithms for the dynamic pricing problem with reference price effect. Management Sci. 63(12):4389–4408.Link, Google Scholar
Chung H, Freund D, Shmoys DB (2018) Bike angels: An analysis of Citi Bike’s incentive program. Proc. First ACM SIGCAS Conf. Comput. Sustainable Soc., 1–9.Google Scholar
Ciosek K, Silver D (2015) Value iteration with options and state aggregation. Preprint, submitted January 16, http://arxiv.org/abs/1501.03959.Google Scholar
Dietterich TG (2000) Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artificial Intelligence Res. 13:227–303.Crossref, Google Scholar
Digney BL (1998) Learning hierarchical control structures for multiple tasks and changing environments. Pfeifer R, Blumberg B, Meyer J-A, Wilson SW, eds. Proc. Fifth Internat. Conf. Simulation Adaptive Behav. Animals Animats, vol. 5 (MIT PRESS, Cambridge, MA), 321–330.Google Scholar
Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2021) Kernel-based reinforcement learning: A finite-time analysis. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 2783–2792.Google Scholar
El Shar IE, Jiang D (2023) Weakly coupled deep q-networks. Oh A, Naumann T, Globerson A, Saenko K. Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates, Inc., Red Hook, NY), 43931–43950.Crossref, Google Scholar
Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J. Machine Learn. Res. 6:503–556.Google Scholar
Gelada C, Kumar S, Buckman J, Nachum O, Bellemare MG (2019) DeepMDP: Learning continuous latent space models for representation learning. Proc. 36th Internat. Conf. Machine Learn. (ICML 2019), vol. 97 (PMLR, New York), 2170–2179.Google Scholar
Haskell WB, Jain R, Kalathil D (2016) Empirical dynamic programming. Math. Oper. Res. 41(2):402–429.Link, Google Scholar
He L, Hu Z, Zhang M (2020) Robust repositioning for vehicle sharing. Manufacturing Service Oper. Management 22(2):241–256.Link, Google Scholar
Jacobson M, Shimkin N, Shwartz (1999) A piecewise stationary Markov decision processes, II: Constant gain, https://adam.net.technion.ac.il/files/2017/01/2TimesScale.pdf.Google Scholar
Jiang DR, Powell WB (2015a) An approximate dynamic programming algorithm for monotone value functions. Oper. Res. 63(6):1489–1511.Link, Google Scholar
Jiang DR, Powell WB (2015b) Optimal hour-ahead bidding in the real-time electricity market with battery storage using approximate dynamic programming. INFORMS J. Comput. 27(3):525–543.Link, Google Scholar
Jiang DR, Al-Kanj L, Powell WB (2020) Optimistic Monte Carlo tree search with sampled information relaxation dual bounds. Oper. Res. 68(6):1678–1697.Link, Google Scholar
Jiang N, Kulesza A, Singh S, Lewis R (2015) The dependence of effective planning horizon on model accuracy. Bordini RH, Elkind E, Weiss G, Yolum PK, eds. Proc.14th Internat. Conf. Autonomous Agents Multiagent Systems ((IFAAMAS, Richland, SC), 1181–1189.Google Scholar
Jonsson A, Barto AG (2005) A causal approach to hierarchical decomposition of factored MDPs. De Raedt L, Wrobel S, eds. Proc. 22nd Internat. Conf. Machine Learn. (ICML 2005) (ACM, New York), 401–408.Google Scholar
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Sammut C, Hoffmann A, eds. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco, CA), 267–274.Google Scholar
Kearns M, Mansour Y, Ng AY (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49:193–208.Crossref, Google Scholar
Killian JA, Biswas A, Shah S, Tambe M (2021) Q-learning Lagrange policies for multi-action restless bandits. Zhu F, Ooi BC, Miao C, eds. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (ACM, New York), 871–881.Google Scholar
Kim JH, Powell WB (2011) Optimal energy commitments with storage and intermittent supply. Oper. Res. 59(6):1347–1360.Link, Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
Kunnumkal S, Topaloglu H (2008) Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Oper. Res. 56(3):646–664.Link, Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Larochelle H, Ranzato MA, Hadsell R, Balcan M-F, Lin H-T, eds. Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 12861–12872.Google Scholar
Littman ML, Dean TL, Kaelbling LP (1995) On the complexity of solving Markov decision problems. Besnard P, Hanks S, eds. Proc. 11th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann, San Francisco, CA) 394–402.Google Scholar
Löhndorf N, Minner S (2010) Optimal day-ahead trading and storage of renewable energies—An approximate dynamic programming approach. Energy Systems 1:61–77.Crossref, Google Scholar
Longstaff FA, Schwartz ES (2001) Valuing American options by simulation: A simple least-squares approach. Rev. Financial Stud. 14(1):113–147.Crossref, Google Scholar
Mannor S, Menache I, Hoze A, Klein U (2004) Dynamic abstraction in reinforcement learning via clustering. Greiner R, Schuurmans D, eds. Proc. 21st Internat. Conf. Machine Learn. (ACM, New York).Google Scholar
Maran D, Metelli AM, Papini M, Restell M (2024) No-regret reinforcement learning in smooth MDPs. Preprint, submitted February 6, https://arxiv.org/abs/2402.03792.Google Scholar
McGovern A, Barto AG (2001) Automatic discovery of subgoals in reinforcement learning using diverse density. Brodley CE, Danyluk AP, eds. Proc. Eighth Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco, CA), 361–368.Google Scholar
Menache I, Mannor S, Shimkin N (2002) Q-cut-dynamic discovery of sub-goals in reinforcement learning. Proc. 13th Eur. Conf. Machine Learn., Lecture Notes in Computer Science, vol. 2430 (Springer, Berlin), 295–306.Google Scholar
Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
Nadarajah S, Cire AA (2025) Self-adapting network relaxations for weakly coupled Markov decision processes. Management Sci. 71(2):1779–1802.Link, Google Scholar
Nadarajah S, Margot F, Secomandi N (2017) Comparison of least squares Monte Carlo methods with applications to energy real options. Eur. J. Oper. Res. 256(1):196–204.Crossref, Google Scholar
Nascimento JM, Powell WB (2009) An optimal approximate dynamic programming algorithm for the lagged asset acquisition problem. Math. Oper. Res. 34(1):210–237.Link, Google Scholar
Ok J, Proutiere A, Tranos D (2018) Exploration in structured reinforcement learning. Bengio S, et al. eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc. City: Red Hook, NY), 5057–5067.Google Scholar
Ong HY, Freund D, Crapis D (2021) Driver positioning and incentive budgeting with an escrow mechanism for ride-sharing platforms. INFORMS J. Appl. Anal. 51(5):373–390.Link, Google Scholar
Osband I, Van Roy B (2014) Near-optimal reinforcement learning in factored MDPs. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Proc. 28th Conf. Adv. Neural Inform. Processing Systems, vol. 27 (Curran Associates Inc., Red Hook, NY), 604–612.Google Scholar
Panigrahi JR, Bhatnagar S (2004) Hierarchical decision making in semiconductor fabs using multi-time scale Markov decision processes. 43rd IEEE Conf. Decision Control, vol. 4 (IEEE), 4387–4392.Google Scholar
Papadaki KP, Powell WB (2002) Exploiting structure in adaptive dynamic programming algorithms for a stochastic batch service problem. Eur. J. Oper. Res. 142(1):108–127.Crossref, Google Scholar
Parr RE, Russell S (1998) Hierarchical Control and Learning for Markov Decision Processes (University of California, Berkeley, CA).Google Scholar
Pereira MV, Pinto LM (1991) Multi-stage stochastic optimization applied to energy planning. Math. Programming 52:359–375.Crossref, Google Scholar
Philpott AB, Guan Z (2008) On the convergence of stochastic dual dynamic programming and related methods. Oper. Res. Lett. 36(4):450–455.Crossref, Google Scholar
Pirotta M, Restelli M, Bascetta L (2015) Policy gradient in Lipschitz Markov decision processes. Machine Learn. 100(2–3):255–283.Crossref, Google Scholar
Porteus EL (1990) Stochastic inventory theory. Heyman DP, Sobel MJ, eds. Handbooks in Operations Research and Management Science, vol. 2 (Elsevier, Amsterdam), 605–652.Google Scholar
Precup D (2000) Temporal abstraction in reinforcement learning. Unpublished PhD thesis, University of Massachusetts, Boston.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
Şimşek Ö, Barto AG (2004) Using relative novelty to identify useful temporal abstractions in reinforcement learning. Proc. 21st Internat. Conf. Machine Learn., 95.Google Scholar
Şimşek Ö, Wolfe AP, Barto AG (2005) Identifying useful subgoals in reinforcement learning by local graph partitioning. Proc. 22nd Internat. Conf. Machine Learn., 816–823.Google Scholar
Sinclair S, Banerjee S, Yu CL (2022) Adaptive discretization in online reinforcement learning. Oper. Res. 71(5):1636–1652.Google Scholar
Sinclair S, Wang T, Jain G, Banerjee S, Yu C (2020) Adaptive discretization for model-based reinforcement learning. Adv. Neural Inform. Processing Systems: Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 3858–3871.Google Scholar
Song R, Xu K (2022) Temporal concatenation for Markov decision processes. Probab. Engrg. Inform. Sci. 36(4):999–1026.Crossref, Google Scholar
Stolle M, Precup D (2002) Learning options in reinforcement learning. Proc. Fifth Internat. Sympos. Abstraction Reformulation Approximation (Springer, Berlin), 212–223.Crossref, Google Scholar
Sutton RS, Precup D, Singh S (1999) Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1–2):181–211.Crossref, Google Scholar
Tunyasuvunakool S, Muldal A, Doron Y, Liu S, BohezS, Merel J, Erez T, Lillicrap T, Heess N, Tassa Y (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022.Google Scholar
Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IEEE, Piscataway, NJ), 5026–5033.Google Scholar
Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.Crossref, Google Scholar
Wang Z (2016) Intertemporal price discrimination via reference price effects. Oper. Res. 64(2):290–296.Link, Google Scholar
Wang S, Bi S, Zhang Y-JA (2018) Demand response management for profit maximizing energy loads in real-time electricity market. IEEE Trans. Power Systems 33(6):6387–6396.Crossref, Google Scholar
Wang Y, Poloczek M, Jiang DR (2023) Dynamic subgoal-based exploration via Bayesian optimization. Trans. Machine Learn. Res. (September 28), https://openreview.net/forum?id=ThJl4d5JRg.Google Scholar
Wongthatsanekorn W, Realff MJ, Ammons JC (2010) Multi-time scale Markov decision process approach to strategic network growth of reverse supply chains. Omega 38(1–2):20–32.Crossref, Google Scholar
Yang B, Li W, Wang J, Yang J, Wang T, Liu X (2020) A novel path planning algorithm for warehouse robots based on a two-dimensional grid model. IEEE Access 8:80347–80357.Crossref, Google Scholar
Yu JY, Mannor S (2009) Arbitrarily modulated Markov decision processes. Proc. 48h IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 2946–2953.Google Scholar
Zhao X, Fan F, Liu X, Xie J (2007) Storage-space capacitated inventory system with (r, q) policies. Oper. Res. 55(5):854–865.Link, Google Scholar
Zhu C, Zhou J, Wu W, Mo L (2006) Hydropower portfolios management via Markov decision process. 32nd Annual Conf. IEEE Indust. Electronics (IEEE), 2883–2888.Google Scholar

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:January 02, 2023
Accepted:September 04, 2025
Published Online:March 18, 2026

Cite as

Yijia Wang, Daniel R. Jiang (2026) Frozen-State Value Iteration: Faster Reinforcement Learning by Freezing Slow States. Management Science 0(0).

https://doi.org/10.1287/mnsc.2023.00012

Keywords

Acknowledgments

The authors are grateful to the department editor, associate editor, and three anonymous reviewers for their multiple rounds of careful reading and constructive feedback. Their comments helped us catch important issues and substantially improve the paper, including strengthening the organization, sharpening the intuition and exposition, and clarifying key points throughout.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Frozen-State Value Iteration: Faster Reinforcement Learning by Freezing Slow States

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News