Frozen-State Value Iteration: Faster Reinforcement Learning by Freezing Slow States
References
- (2020) Model-based reinforcement learning with a generative model is minimax optimal. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learning Theory, vol. 125 (PMLR, New York), 67–83.Google Scholar
- (2018) Lipschitz continuity in model-based reinforcement learning. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 264–273.Google Scholar
- (2024) On the relocation behavior of ride-sourcing drivers. Transportation Lett. 16(4):330–337.Crossref, Google Scholar
- (2003) Recent advances in hierarchical reinforcement learning. Discrete Event Dynam. Systems 13(1–2):41–77.Crossref, Google Scholar
- (2012) Investigating contingency awareness using Atari 2600 games. Hoffmann J, Selman B, eds. Proc. 26th AAAI Conf. Artificial Intelligence, vol. 26 (AAAI Press, Palo Alto, CA), 864–871.Google Scholar
- (2022) Dynamic inventory repositioning in on-demand rental networks. Management Sci. 68(11):7861–7878.Link, Google Scholar
- (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
- (2006) Actor-critic algorithms for hierarchical Markov decision processes. Automatica 42(4):637–644.Crossref, Google Scholar
- (2016) OpenAI Gym. Preprint, submitted June 5, https://doi.org/10.48550/arXiv.1606.01540.Google Scholar
- (2025) Unit commitment without commitment: A dynamic programming approach for managing an integrated energy system under uncertainty. Oper. Res. 73(4):1744–1766.Link, Google Scholar
- (2010) Valuation of energy storage: An optimal switching approach. Quant. Finance 10(4):359–374.Crossref, Google Scholar
- (2003) Multitime scale Markov decision processes. IEEE Trans. Automatic Control 48(6):976–987.Crossref, Google Scholar
- (2017) Efficient algorithms for the dynamic pricing problem with reference price effect. Management Sci. 63(12):4389–4408.Link, Google Scholar
- (2018) Bike angels: An analysis of Citi Bike’s incentive program. Proc. First ACM SIGCAS Conf. Comput. Sustainable Soc., 1–9.Google Scholar
- (2015) Value iteration with options and state aggregation. Preprint, submitted January 16, http://arxiv.org/abs/1501.03959.Google Scholar
- (2000) Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artificial Intelligence Res. 13:227–303.Crossref, Google Scholar
- (1998) Learning hierarchical control structures for multiple tasks and changing environments. Pfeifer R, Blumberg B, Meyer J-A, Wilson SW, eds. Proc. Fifth Internat. Conf. Simulation Adaptive Behav. Animals Animats, vol. 5 (MIT PRESS, Cambridge, MA), 321–330.Google Scholar
- (2021) Kernel-based reinforcement learning: A finite-time analysis. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 2783–2792.Google Scholar
- (2023) Weakly coupled deep q-networks. Oh A, Naumann T, Globerson A, Saenko K. Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates, Inc., Red Hook, NY), 43931–43950.Crossref, Google Scholar
- (2005) Tree-based batch mode reinforcement learning. J. Machine Learn. Res. 6:503–556.Google Scholar
- (2019) DeepMDP: Learning continuous latent space models for representation learning. Proc. 36th Internat. Conf. Machine Learn. (ICML 2019), vol. 97 (PMLR, New York), 2170–2179.Google Scholar
- (2016) Empirical dynamic programming. Math. Oper. Res. 41(2):402–429.Link, Google Scholar
- (2020) Robust repositioning for vehicle sharing. Manufacturing Service Oper. Management 22(2):241–256.Link, Google Scholar
- (1999) A piecewise stationary Markov decision processes, II: Constant gain, https://adam.net.technion.ac.il/files/2017/01/2TimesScale.pdf.Google Scholar
- (2015a) An approximate dynamic programming algorithm for monotone value functions. Oper. Res. 63(6):1489–1511.Link, Google Scholar
- (2015b) Optimal hour-ahead bidding in the real-time electricity market with battery storage using approximate dynamic programming. INFORMS J. Comput. 27(3):525–543.Link, Google Scholar
- (2020) Optimistic Monte Carlo tree search with sampled information relaxation dual bounds. Oper. Res. 68(6):1678–1697.Link, Google Scholar
- (2015) The dependence of effective planning horizon on model accuracy. Bordini RH, Elkind E, Weiss G, Yolum PK, eds. Proc.14th Internat. Conf. Autonomous Agents Multiagent Systems ((IFAAMAS, Richland, SC), 1181–1189.Google Scholar
- (2005) A causal approach to hierarchical decomposition of factored MDPs. De Raedt L, Wrobel S, eds. Proc. 22nd Internat. Conf. Machine Learn. (ICML 2005) (ACM, New York), 401–408.Google Scholar
- (2002) Approximately optimal approximate reinforcement learning. Sammut C, Hoffmann A, eds. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco, CA), 267–274.Google Scholar
- (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49:193–208.Crossref, Google Scholar
- (2021) Q-learning Lagrange policies for multi-action restless bandits. Zhu F, Ooi BC, Miao C, eds. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (ACM, New York), 871–881.Google Scholar
- (2011) Optimal energy commitments with storage and intermittent supply. Oper. Res. 59(6):1347–1360.Link, Google Scholar
- (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
- (2008) Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Oper. Res. 56(3):646–664.Link, Google Scholar
- (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Larochelle H, Ranzato MA, Hadsell R, Balcan M-F, Lin H-T, eds. Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 12861–12872.Google Scholar
- (1995) On the complexity of solving Markov decision problems. Besnard P, Hanks S, eds. Proc. 11th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann, San Francisco, CA) 394–402.Google Scholar
- (2010) Optimal day-ahead trading and storage of renewable energies—An approximate dynamic programming approach. Energy Systems 1:61–77.Crossref, Google Scholar
- (2001) Valuing American options by simulation: A simple least-squares approach. Rev. Financial Stud. 14(1):113–147.Crossref, Google Scholar
- (2004) Dynamic abstraction in reinforcement learning via clustering. Greiner R, Schuurmans D, eds. Proc. 21st Internat. Conf. Machine Learn. (ACM, New York).Google Scholar
- (2024) No-regret reinforcement learning in smooth MDPs. Preprint, submitted February 6, https://arxiv.org/abs/2402.03792.Google Scholar
- (2001) Automatic discovery of subgoals in reinforcement learning using diverse density. Brodley CE, Danyluk AP, eds. Proc. Eighth Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco, CA), 361–368.Google Scholar
- (2002) Q-cut-dynamic discovery of sub-goals in reinforcement learning. Proc. 13th Eur. Conf. Machine Learn., Lecture Notes in Computer Science, vol. 2430 (Springer, Berlin), 295–306.Google Scholar
- (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
- (2025) Self-adapting network relaxations for weakly coupled Markov decision processes. Management Sci. 71(2):1779–1802.Link, Google Scholar
- (2017) Comparison of least squares Monte Carlo methods with applications to energy real options. Eur. J. Oper. Res. 256(1):196–204.Crossref, Google Scholar
- (2009) An optimal approximate dynamic programming algorithm for the lagged asset acquisition problem. Math. Oper. Res. 34(1):210–237.Link, Google Scholar
- (2018) Exploration in structured reinforcement learning. Bengio S, et al. eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc. City: Red Hook, NY), 5057–5067.Google Scholar
- (2021) Driver positioning and incentive budgeting with an escrow mechanism for ride-sharing platforms. INFORMS J. Appl. Anal. 51(5):373–390.Link, Google Scholar
- (2014) Near-optimal reinforcement learning in factored MDPs. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Proc. 28th Conf. Adv. Neural Inform. Processing Systems, vol. 27 (Curran Associates Inc., Red Hook, NY), 604–612.Google Scholar
- (2004) Hierarchical decision making in semiconductor fabs using multi-time scale Markov decision processes. 43rd IEEE Conf. Decision Control, vol. 4 (IEEE), 4387–4392.Google Scholar
- (2002) Exploiting structure in adaptive dynamic programming algorithms for a stochastic batch service problem. Eur. J. Oper. Res. 142(1):108–127.Crossref, Google Scholar
- (1998) Hierarchical Control and Learning for Markov Decision Processes (University of California, Berkeley, CA).Google Scholar
- (1991) Multi-stage stochastic optimization applied to energy planning. Math. Programming 52:359–375.Crossref, Google Scholar
- (2008) On the convergence of stochastic dual dynamic programming and related methods. Oper. Res. Lett. 36(4):450–455.Crossref, Google Scholar
- (2015) Policy gradient in Lipschitz Markov decision processes. Machine Learn. 100(2–3):255–283.Crossref, Google Scholar
- (1990) Stochastic inventory theory. Heyman DP, Sobel MJ, eds. Handbooks in Operations Research and Management Science, vol. 2 (Elsevier, Amsterdam), 605–652.Google Scholar
- (2000) Temporal abstraction in reinforcement learning. Unpublished PhD thesis, University of Massachusetts, Boston.Google Scholar
- (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
- (2004) Using relative novelty to identify useful temporal abstractions in reinforcement learning. Proc. 21st Internat. Conf. Machine Learn., 95.Google Scholar
- (2005) Identifying useful subgoals in reinforcement learning by local graph partitioning. Proc. 22nd Internat. Conf. Machine Learn., 816–823.Google Scholar
- (2022) Adaptive discretization in online reinforcement learning. Oper. Res. 71(5):1636–1652.Google Scholar
- (2020) Adaptive discretization for model-based reinforcement learning. Adv. Neural Inform. Processing Systems: Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 3858–3871.Google Scholar
- (2022) Temporal concatenation for Markov decision processes. Probab. Engrg. Inform. Sci. 36(4):999–1026.Crossref, Google Scholar
- (2002) Learning options in reinforcement learning. Proc. Fifth Internat. Sympos. Abstraction Reformulation Approximation (Springer, Berlin), 212–223.Crossref, Google Scholar
- (1999) Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1–2):181–211.Crossref, Google Scholar
- (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022.Google Scholar
- (2012) Mujoco: A physics engine for model-based control. IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IEEE, Piscataway, NJ), 5026–5033.Google Scholar
- (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.Crossref, Google Scholar
- (2016) Intertemporal price discrimination via reference price effects. Oper. Res. 64(2):290–296.Link, Google Scholar
- (2018) Demand response management for profit maximizing energy loads in real-time electricity market. IEEE Trans. Power Systems 33(6):6387–6396.Crossref, Google Scholar
- (2023) Dynamic subgoal-based exploration via Bayesian optimization. Trans. Machine Learn. Res. (September 28), https://openreview.net/forum?id=ThJl4d5JRg.Google Scholar
- (2010) Multi-time scale Markov decision process approach to strategic network growth of reverse supply chains. Omega 38(1–2):20–32.Crossref, Google Scholar
- (2020) A novel path planning algorithm for warehouse robots based on a two-dimensional grid model. IEEE Access 8:80347–80357.Crossref, Google Scholar
- (2009) Arbitrarily modulated Markov decision processes. Proc. 48h IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 2946–2953.Google Scholar
- (2007) Storage-space capacitated inventory system with (r, q) policies. Oper. Res. 55(5):854–865.Link, Google Scholar
- (2006) Hydropower portfolios management via Markov decision process. 32nd Annual Conf. IEEE Indust. Electronics (IEEE), 2883–2888.Google Scholar

