February 6, 2023 in Stochastic Optimization

A Universal Framework for Sequential Decision Problems

Welcome to the “Jungle of Stochastic Optimization”

SHARE: PRINT ARTICLE:print this page https://doi.org/10.1287/orms.2023.01.02

The operations research community has long adopted the standard format for describing a deterministic optimization problem, which might be stated as:

Thanks to George Dantzig, this language is spoken around the world and is the basis for any deterministic optimization model.

A similar language is also known to the field of machine learning, in which it is standard to search for a function (called a statistical model) f (x|θ) that is chosen to minimize deviations between a known input data x and a response y. Functions can be lookup tables, parametric (linear or nonlinear) or nonparametric models.

Now consider the vast range of problems that can be described as “sequential decision problems.” Examples of sequential decision problems arise in virtually every domain of human processes: business problems, economics, engineering and the sciences. Application domains include energy, health, freight or personal transportation, supply chain management, finance, e-commerce and the laboratory sciences. Decisions can act on a wide array of physical resources (inventories, equipment, people), financial resources (cash flows, pricing, investments, insurance contracts, loans), and the flow of information (running tests or experiments, marketing).

All of these problems can be described as:

 decision, information, decision, information . . . 

Each decision is made with information that is represented by a state variable, which evolves according to a set of equations called the transition function. After making a decision, we incur a cost or reward. Decisions are made with a method we call a “policy” that we will design later (this is a major change from the literature). Whereas deterministic optimization searches for the best vector “x,” with sequential decisions, we are searching for a function (that we call a policy) for making decisions. This parallels machine learning but is much richer.

Sequential decision problems arise in a vast range of settings; therefore, it is not surprising that it has attracted considerable attention from the research literature. In fact, I can make a case that these problems have been addressed by 15 distinct fields, each covered by at least one book and thousands of papers (illustrated by Figure 1). These fields are best known under names such as dynamic programming (or Markov decision processes), stochastic programming, stochastic search, optimal control, simulation-optimization, approximate dynamic programming, reinforcement learning, decision analysis and multiarmed bandit problems. I call this the “jungle of stochastic optimization.”

15 book covers
Figure 1. The "jungle of stochastic optimization" as seen in books covering 15 distinct fields.

These books use eight different notational systems with a wide range of modeling styles. Each features at least one method for making decisions, most frequently (but not always) based on Bellman’s equation. Many (but not all) use the concept of a state variable (without defining it). Decisions might be binary, discrete, or continuous or discrete vectors. It is possible to get a Ph.D. based on any one of these books and spend years (even a career) working in the field covered by any one of them.

Blind men and the elephantThe situation is reminiscent of the parable of the blind men and the elephant [1]. Experts view sequential decision problems in a way that reflects their background (and motivating applications) and prior training. The differences in languages and perspectives are deep and fundamentally affect how people approach these problems. They are not just cosmetic differences: the modeling frameworks often limit both the quality of a model and its applicability.

Every sequential decision problem can be modeled using five elements: state variables, decision variables, exogenous information variables, a transition function and an objective function. These basic elements can model any of the applications previously listed and cover every problem solved in the books discussed. I call this the “universal framework” for sequential decision problems.

A detailed presentation of this model is contained in my book “Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions” (Chapter 9) [2], which builds on the thinking in some of my other chapters and articles [3-5]. I have used this modeling framework in large-scale applications in trucking and rail, energy applications from modeling energy storage to the entire power grid, public and private health, e-commerce, finance, and even the sequential design of experiments in materials science. The framework has been used in more than 100 undergraduate senior theses, and is the foundation for a rapidly growing startup, Optimal Dynamics [6].

An important feature of the modeling framework is the close relationship between the model and its implementation in software. This may seem surprising because this is precisely what happens in deterministic optimization and machine learning, but this is not standard in the communities that deal with sequential decision problems. This is most apparent in the field of Markov decision processes, which uses notation that has been popularized by the field of reinforcement learning. The canonical framework for a Markov decision process is to define a state space, an action space, a one-step transition matrix (which can never be computed) and a reward function (but without an objective function). The first three elements are not computable and have no counterpart in a software implementation.

It is surprisingly common for a field to define a policy without the objective function that is needed for evaluating and improving policies. For example, a stochastic program with scenario trees is a form of lookahead policy that still needs to be evaluated. A common mistake is to assume that if you optimize a lookahead model (such as a stochastic program), then the resulting solution is “optimal,” without realizing that it is not an optimal policy.

The Universal Framework

It is natural to ask: Is this framework really new? Out of the 15 fields, surely one of them uses this approach. The answer is: optimal control (in particular, stochastic control) comes close. My notation is different because the controls community likes to use “x” for state and “u” for decisions (or controls). Textbooks on optimal control, similar to those on dynamic programming, tend to pivot from optimizing over policies to assuming that policies are found using Hamilton-Jacobi equations (their name for Bellman’s equation). This is a major point of departure with my approach.

Easily the most confusing dimension of sequential decision problems is the need to search over policies. I suspect that the same confusion arose in the 1930s when Kantorovich first proposed a linear program – and no one knew how to search over vector spaces – a problem famously solved by Dantzig. I found the answer to the problem of searching over policies from the collection of contributions across all of the books (and thousands of papers) in stochastic optimization. However, I also found guidance from observing what people actually do in the real world.  

This search, supported by my own experience running dozens of industrial projects through CASTLE Lab at Princeton University, allowed me to identify four (meta)classes of policies, divided into two groups, as follows.

Group I: Policy Search. This strategy searches over parameterized functions to identify the ones that work best over time. These come in two classes:

  • Policy function approximations (PFAs). Analytical functions that map the information in the state variable direction to a decision. Some examples are:
    1. Order-up-to policies for inventories: If the inventory is below θmin, order up to θmax, and we then have to tune θ=(θmin, θmax).
    2. Buy low, sell high policies in finance.
    3. Linear decision rules are special problems where we can write a policy as:
      formula 2
    4. A PFA can be any lookup table, parametric function (linear or nonlinear, including neural networks) or nonparametric function.
  • Cost function approximations (CFAs). A parameterized optimization problem that is typically a deterministic approximation, in which parameters have been introduced to make it work well under uncertainty. CFAs are widely used in industry in an ad hoc way, but I have not been able to find this strategy formally studied in the research literature [7]. Some examples include:
    1. Scheduling aircraft using an integer program, while inserting slack in the schedule to account for weather delays.
    2. Scheduling nurses but limiting their time to 32 hours per week to provide slack in case emergencies arise.
    3. Solving the shortest path over a stochastic graph and then adding 10 minutes to the estimated travel time to make sure we arrive on time.

Group II: Lookahead approximations. These policies identify good decisions by optimizing across the current cost or contribution plus an approximation of the effect of a decision now on the future. Again, these come in two classes:

  • Value function approximations (VFAs). Policies based on VFAs cover all methods based on Bellman’s equation, which approximates the downstream value of landing in a state. This approach has attracted tremendous attention under names such as approximate dynamic programming, adaptive dynamic programming, neurodynamic programming and, most commonly today, reinforcement learning.
  • Direct lookahead approximations (DLAs). This is where we explicitly plan into the future to help make a decision now. DLAs can be split into two subclasses:
    1. Deterministic DLAs are when we ignore uncertainty to create a deterministic lookahead model, a strategy that is often called a rolling (or receding) horizon procedure, or model predictive control. It is possible to parameterize the lookahead to help make it more robust to uncertainty, producing a hybrid CFA/DLA.
    2. Stochastic DLAs create an approximate stochastic lookahead model, typically using sampled approximations of random outcomes. This covers stochastic programming (with scenario trees), robust optimization and approximate dynamic programming, which can be used to solve a simplified stochastic lookahead.

I am now going to make a very strong claim: These four (meta)classes of policies are universal – they include any method proposed in the research literature or used in practice. A good starting point is to ask how decisions are made now in the field.

I am convinced that none of these methods is a panacea – depending on the specific characteristics of a problem, any one of these may work best. Keep in mind that “best” involves more than simply which produces the highest objective function. Computational complexity, consistency, robustness and transparency are also important issues.

Given this broader toolbox, which policies are most popular? Without question, the academic community favors policies based on Bellman’s equation (primarily) and stochastic lookahead models. In practice, my experience is that PFAs, CFAs and deterministic DLAs (which may be parameterized) are the most useful. Just think of these as a rich class of parameterized policies that have to be tuned just as parametric models have to be fitted in machine learning. The policies are much easier to compute than those based on Bellman’s equation, but tuning can be hard. 

Today, a student can take a course in mathematical programming and emerge with a practical understanding of a broad set of methods. The same is true of machine learning and simulation. There are multiple books in each of these fields that communicate a broad set of methods that allow students to solve real problems with commercial software.

This is not true of sequential decision problems. Available books lack a common modeling framework, and our courses tend to teach very specialized methods, often based on concepts such as Bellman’s equation, that have limited usefulness.

I think we need a field that I would call sequential decision analytics. You can find an introduction to this field in conjunction with CASTLE Labs, as well as a wealth of resource materials including books, videos, tutorials and suggested courses [8, 9]. A recent book, which uses a teach-by-example style, was written specifically for an undergraduate course [10], and a helpful video to get started can be found on YouTube [11].

References & Notes

  1. https://allpoetry.com/The-Blind-Man-And-The-Elephant
  2. Warren B. Powell, 2022, “Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions,” Hoboken, N.J.: John Wiley and Sons, http://tinyurl.com/RLandSO/.
  3. Warren B. Powell, 2014, “Clearing the Jungle of Stochastic Optimization,” INFORMS Tutorials in Operations Research: Bridging Data and Decisions, pp. 109-137, http://dx.doi.org/10.1287/educ.2014.0128.
  4. Warren B. Powell, 2016, “A Unified Framework for Optimization under Uncertainty,” INFORMS TutORials in Operations Research, pp. 54-83.
  5. Warren B. Powell, 2019, “A unified framework for stochastic optimization,” European Journal of Operational Research, Vol. 275, No. 3, pp. 795-821, https://doi.org/10.1016/j.ejor.2018.07.014.
  6. https://optimaldynamics.com/
  7. Policies using cost function approximations: https://tinyurl.com/cfapolicy/
  8. https://castlelab.princeton.edu/sda/
  9. https://castlelab.princeton.edu/sdalinks/
  10. Warren B. Powell, 2022, “Sequential Decision Analytics and Modeling: Foundations and Trends in Technology, Information and Operations Management,” NOW Publishing, https://tinyurl.com/sdamodeling.
  11. Warren B. Powell, 2022, “A Unified Framework for Sequential Decision Analytics,” YouTube, https://tinyurl.com/sdafieldyoutube.

Warren B. Powell

SHARE:

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.