The Exploration-Exploitation Trade-off in the Newsvendor Problem

Published Online:https://doi.org/10.1287/stsy.2022.0093

Abstract

When an inventory manager attempts to construct probabilistic models of demand based on past data, demand samples are almost never available: only sales data can be used. This demand censoring introduces an exploration-exploitation trade-off as the ordering decisions impact the information collected. Much of the literature has sought to understand how operational decisions should be modified to incorporate this trade-off. We ask an even more basic question: When does the exploration-exploitation trade-off matter? To what extent should one deviate from a myopic policy that takes the optimal decision for the current period without consideration for future periods? We analyze these questions in the context of a well-studied stationary multiperiod newsvendor problem in which the decision maker starts with a prior on parameters characterizing the demand distribution. We show that, under very general conditions in both perishable and nonperishable settings, the myopic policy will almost surely learn the optimal decision one would have taken with knowledge of the unknown parameters. Furthermore, in the perishable setting, we analyze finite time performance for a broad family of tractable cases. Through a combination of analytical parametric bounds and exhaustive exact analysis, we show that the myopic optimality gap is negligible for many practical instances.

Funding: The third author was partially supported by the National Science Foundation [Grant CMMI-1235023].

Supplemental Material: The online supplement is available at https://doi.org/10.1287/stsy.2022.0093.

1. Introduction

In many business settings, inventory management can be challenging because of uncertainty about underlying demand. A decision maker must build probabilistic models for future demand that are estimated based on past data. In practical operational situations, however, demand information is rarely accessible. Often, only access to some form of distorted demand is possible, and this distortion is usually impacted by prior operational decisions. An important example is demand censoring: in many retail environments, demand is only observable up to the level at which it can be fulfilled. As a result, firms most often only have access to sales data as opposed to the true, underlying demand. In such settings, inventory decisions affect not only immediate profits but also the amount of information collected for use in future periods.

Such a problem is a special instance of a more general class of dynamic learning problems, in which a decision maker faces the following question: Should he optimize his decision using the current information, or should he make a suboptimal choice with regard to instantaneous rewards that might improve his future knowledge of the system and lead to better performance in the future? This is known as the exploration-exploitation trade-off. The key concept is that decisions not only affect rewards but also information that is collected to inform future decisions.

Accounting for the informational impact of actions can considerably complicate decision making, and much of the literature addressing the exploration-exploitation trade-off seeks to obtain insights into understanding how this trade-off should impact operational decisions. In the present paper, we focus on a particular setting that has received significant attention in the literature—a multiperiod newsvendor problem with demand censoring—and ask an even more basic question: Does the exploration-exploitation trade-off matter in the first place—that is, to what extent does one need to account for future rewards and deviate from myopic decisions? Quite remarkably, we find that myopic decision making is “sufficient” for all practical purposes in a wide range of settings: there is almost no value of accounting for the impact of ordering decisions on information collection. We also articulate circumstances where this conclusion does not apply (i.e., the conjunction of conditions that are required for the value of accounting for the exploration-exploitation trade-off to be nontrivial).

We consider a multiperiod newsvendor problem, where a retailer sells perishable items1 and has to decide how many units to order for each time period. Unmet demand is censored. We assume that the retailer does not have complete knowledge of a vector of parameters characterizing the underlying distribution but rather learns about those as sales realizations are observed. We consider a Bayesian formulation where one may summarize the state of the system through the current belief over the parameters characterizing the demand distribution.

As we discuss in the literature review in Section 1.2, it is clear from existing studies that ignoring the exploration-exploitation trade-off is suboptimal and that one should order more than the myopic solution under fairly general conditions. This feature of optimal policies is known as “information stalking” (Lariviere and Porteus 1999). However, although the literature finds that the policy of information stalking is optimal in general, little focus has been given to quantitatively analyzing the value of information stalking. This is despite the fact that numerical experiments in earlier studies suggest that it may not be significant.2

Motivated by this, in the present paper, we analyze a fundamental question that arises in the presence of demand censoring: What is the value of optimally accounting for the exploration-exploitation trade-off?

1.1. Main Contributions

The main contribution of the present paper is to articulate a case for the viability and appeal of the myopic policy in Bayesian inventory problems. To do so, we proceed in two main steps.

In a first step, we analyze the problem under very general conditions with regard to the unknown parameters, the prior, and the demand distribution. In particular, the prior-demand pair need not admit any conjugate structure. In this general setting, we establish that the myopic policy learns almost surely in the long run the optimal decision one would have taken with full knowledge of the unknown parameters. In particular, this result shows that the myopic policy is always a viable alternative if the time horizon is large. The approach taken to prove the result is novel and different from typical approaches in parametric learning problems, where a basic building block is to establish that one can learn the true underlying parameters. Our approach does not rely on learning the true parameters directly but on the structure of the myopic policy in the specific context of the newsvendor problem and the analysis of martingales in Banach spaces. We further establish that the result continues to hold for the case of nonperishable inventory.

In a second step, we focus on the finite performance of the myopic policy. To do so, we narrow down the scope to one of the few known tractable versions of the censored newsvendor problem, Weibull demand with one unknown parameter, and a prior with a gamma distribution. We evaluate the relative performance of the myopic policy compared with the optimal policy, a quantity we refer to as the myopic optimality gap.

We develop a parametric upper bound on the myopic optimality gap through a recursive argument. The upper bound captures the dependence on the problem parameters such as the time horizon, the service rate, the uncertainty in the prior, as well as the shape of the demand distribution. We show that the upper bound becomes negligible for any extreme cases of the input parameters and characterize the exact rate of convergence to 0. In particular, we study the bound as the time horizon increases to infinity, as the service level increases to 1 or decreases to 0, and as the parameter uncertainty decreases. Quite notably, even when the inventory manager selects a low service level and demand is very often censored, the myopic optimality gap still vanishes to 0.

We then evaluate the myopic optimality gap in an exact manner for a grid of all input parameters: time horizon, parameter uncertainty, service level, and Weibull parameter. We show that it is negligible for most parameter combinations, often on the order of a fraction of a percentage point. The gap may only be significant when the problem primitives satisfy at the same time three very specific conditions: (i) high demand distribution uncertainty, (ii) low noise in demand realizations, and (iii) high myopic censoring probability. If any of these conditions is violated, there is almost no value in accounting for the exploration-exploitation trade-off.

In addition to the above-mentioned points, we note that implementing myopic policies is always relatively straightforward, whereas the computation of optimal policies may be intractable. In other words, the findings of this paper make a case that the myopic policy should not only be a computationally attractive but may also be a viable candidate heuristic in cases where the optimal policy is intractable.

By analyzing the evolution of ordering decisions and costs on simulated trajectories when the optimal policy is tractable, we develop intuition for the near optimality of the myopic policy. We find that it does not arise from “flatness” in the cost-to-go function of the problem. Indeed, we observe single-period costs vary substantially for both the optimal and myopic policies as they learn the demand. Instead, the similarity in performance stems from the fact that in most cases, both policies learn at approximately the same rate. The exception to this arises in the fragile confluence of conditions described in the preceding, where it is both possible to quickly learn demand and necessary (from a cost perspective) to do so; in these cases, the optimality gap is nontrivial.

The framework we use can also be leveraged to analyze other questions. For example, we show how one may leverage the bounds we develop to bound the cost of censoring, the difference when operating in the presence of demand censoring versus when operating while observing all demand samples. Furthermore, in an exact analysis of the cost of censoring akin to the one conducted for the myopic optimality gap, we find that it is always significantly higher than the myopic optimality gap. This suggests that effort to improve information collection about lost sales may be more valuable than analytic or computational effort to pin down the optimal dynamic policy in the presence of censoring.

1.2. Related Literature

A Bayesian approach to inventory management with observable demand was first proposed by Scarf (1959), leading to a formulation as a Markov decision process (MDP), with the state given by the current knowledge about the demand distribution. In such a case, decisions do not affect collection of information, and there is no exploration-exploitation trade-off; the main issue is to deal with the dimensionality of the state space. In its most general formulation, the problem has an infinite-dimensional state; it can, however, be written as a finite-dimensional MDP if the demand distribution is chosen from a parametric conjugate family. See also the work of Azoury (1985) for another early reference on the topic. Lovejoy (1990), in the absence of demand censoring, argued for the near optimality of a myopic policy when inventory is nonperishable. In that case, the policy is myopic with regard to the anticipation of stock and information evolution in future periods. In particular, there is no notion of an exploration-exploitation trade-off in the class of problems considered, as decisions do not affect the information that is collected. In our work, the main focus is on demand censoring and the exploration-exploitation trade-off it induces. In particular, a myopic policy in our setting is one that does not anticipate the implications of demand censoring for information collection and its impact on future performance.

When demand observations are censored, decisions and information collection are coupled, and the problem becomes more challenging. Most of the literature on inventory control with Bayesian learning has focused on understanding the censored case. The main difficulty is that Bayesian updates lose their conjugate property in the presence of censoring for most distribution families, and hence the problem becomes intractable. Many studies have therefore focused on showing structural properties of the optimal policies. Ding et al. (2002)3 study the censored newsvendor problem for general distributions and show that, under certain conditions, the optimal order quantity is greater than the quantity one would order to minimize the current single-period cost, also referred to as the myopic quantity. This result was then generalized by Bensoussan et al. (2009a) for arbitrary parametric distributions. The main point is that the exploration-exploitation trade-off leads to collecting more information for future periods—this is, broadly speaking, as far as the general case has been characterized.

There exists, however one well-known family of distributions that preserves the conjugate property even under censoring. First introduced by Braden and Freimer (1991), the newsvendor distributions have been frequently used as a benchmark to study the problem in a more tractable fashion. Lariviere and Porteus (1999) study the nonperishable inventory problem with newsvendor demands, and they find that, if demands follow a Weibull distribution—a particular case of a newsvendor distribution—a state space reduction technique can be applied, and the problem can be solved by backward induction. Bisi et al. (2011) analyze the same case for perishable and nonperishable inventory, and they develop a specific recursive formula for the optimal solution in the perishable case. They also show that the Weibull is the only case in which such space reduction can be applied.

The effects of demand censoring have been studied in a wide range of environments, focusing on different aspects of the problem. In particular, many extensions of the base newsvendor case have been analyzed recently. Dai and Jerath (2013) study the impact of inventory restrictions and demand censoring in sales force compensation contracts, and Heese and Swaminathan (2010) analyze a similar problem when the firm has complete control over the sales effort. Chen and Han (2013) analyze the effects of learning and censoring on the supply side. Jain et al. (2014) study the newsvendor problem with censored demands when sales transaction timing is also available; see also Caro and Gallien (2010) for the use of such timing data. Mersereau (2013) focuses on the conjunction of demand censoring with inventory record inaccuracies. Most of these papers rely on specific model assumptions in order to circumvent the complexity of the analysis. From the computational point of view, heuristics have been developed in the literature to overcome the intractability of the problem for general prior/demand combinations. Chen (2010) and Lu et al. (2005a) propose heuristics and provide bounds on the optimal quantity.

All of the above-mentioned papers focus on modeling uncertainty in a Bayesian framework. There is also a stream of literature that analyzes the design of policies under different informational assumptions, when there is no parameterization of the demand distribution. In such cases, performance is measured through regret (e.g., the gap between the performance of a policy and that of an oracle with knowledge of the demand distribution). Kunnumkal and Topaloglu (2008) and Huh and Rusmevichientong (2009) are examples of such approaches. In that line of work, the study by Besbes and Muharremoglu (2013) is related in spirit to the present study as it measures the impact of demand censoring on performance. However, it does so under different informational assumptions (nonparametric) and through the lens of minimax regret, and it only measures performance of policies through the growth rate of regret as opposed to finite time analysis. Under such assumptions, there is no universal notion of an exploitation-only policy (as such a policy would need to be defined with respect to an estimation method) and hence no exact metric to quantify universally and precisely the exploration-exploitation trade-off. In the present Bayesian setting, the decision maker has a prior on the demand distribution, and the exploitation-only policy is uniquely defined, allowing one to exactly quantify the exploitation-exploration trade-off. In turn, we are able to quantify the myopic optimality gap for arbitrary time horizons through fine parametric bounds that account not only for the time horizon but also for the specific informational structure operating regime (across different probabilities of being censored).

At a high level, the paper is also related to Lee et al. (2016), who show that stockouts have a limited impact on the inference process of a choice model in textbook retailing. Finally, we refer to the recent review of Chen and Mersereau (2015) for a broader overview of the literature on demand censoring.

Our paper relates to studies that have highlighted classes of learning problems where myopic policies can perform well. In the context of dynamic pricing with unknown demand parameters, Besbes and Zeevi (2011) and Broder and Rusmevichientong (2012) highlight the “well-separated” case in which a myopic policy is near optimal, as no incomplete learning may occur. Indeed, at any price, the possible demand curves do not intersect, and as a result, all prices are informative regarding the underlying true parameters. In related work, den Boer and Zwart (2015) show that even when the demand curves are not well separated, if there are capacity constraints, then variation in prices will be induced naturally through those, and a myopic policy will be near optimal. These three studies operate in a frequentist framework and analyze the regret of a myopic policy (compared with an optimal policy with knowledge of the true demand curve) through its growth rate with the time horizon. All studies rely on showing that the parameter estimates are suitably close to the true parameters, and in turn, the regret of a myopic policy is small. In the present paper, we operate in a Bayesian framework, so the decision maker has more information at hand. Furthermore, our general result regarding the almost sure convergence of myopic decisions does not rely on a well-separated assumption, as our approach is different. We actually do not even track the parameter estimates and rely on a martingale argument to establish that convergence to the correct decisions needs to take place. In particular, it could be that the demand distributions are not well separated, but in a newsvendor problem, this is not critical, as one, roughly speaking, only needs to learn a particular quantile. Finally, given a tractable family of problems (Weibull-Gamma), we are also able to evaluate numerically the worst-case myopic optimality gap for the full parameter space (which is not possible in the aforementioned frequentist studies, as there is no handle on the optimal policy). This highlights the quality of the myopic heuristic not only for long time horizons but for any time horizon.

The rest of this paper is organized as follows. In Section 2, we define the model and problem primitives. In Section 3, we conduct an asymptotic analysis of the decisions induced by the myopic policy. In Section 4, we turn to a finite time analysis, restricting attention to a tractable family of prior-demand distributions. In Section 5, we develop parametric upper bounds on the myopic optimality gap, and Section 6 is then devoted to an exact analysis. Section 7 presents an analysis of the cost of censoring. Section 8 concludes. All proofs are collected in the online supplement to this paper.

2. Problem Formulation

We consider a multiperiod newsvendor problem defined over a probability space (Ω,F,P), consisting of a discrete-time sequence of ordering periods indexed by t=1,2,. We assume zero lead time. At the beginning of each period, the retailer may order units; demand is then realized and is fulfilled to the extent possible. If the decision maker runs out of stock, demand beyond the initial inventory level is lost. On the other hand, we assume that inventory is perishable; that is, if items remain, these are discarded.4

The probability measure P is defined as follows. We assume that the demands are independent and identically distributed (i.i.d.), belonging to a family of distributions parameterized by a vector of unknown parameters θ, which takes values in the parameter set Θ. Given vector θ, the demand distribution has a probability density function denoted by f(·|θ) and a cumulative distribution function denoted by F(·|θ). Following a Bayesian approach, we assume that the decision maker has a prior distribution over θ.

The retailer incurs a cost h > 0 for each unit of unsold product and a penalty cost p > 0 for each unit of unmet demand. In other words, the single-period cost given stocking decision y0 and realized demand D0 is

L(y,D)h(yD)++p(Dy)+.

Let rp/(h+p) denote the critical ratio, also sometimes referred to as the service level. Without loss of generality, we assume that h = 1 and parameterize the cost structure through r(0,1):

L(y,D)(yD)++r1r(Dy)+.

The retailer can only observe sales, rather than full demand realizations. In period t, the observed sales are given by Dtytmin{Dt,yt}, where Dt is the demand realized and yt the ordered quantity. We denote by Ft the filtration generated by the censored demand process; that is, for t1,

Ftσ(D1y1,y1,,Dtyt,yt).

Note that the information collected about the demand up to period t is impacted by the past stocking decisions, y1,y2,,yt1.

We assume that E[D1]<, which implies that E[L(y,Dt)]< for all t=1, and any y0. We denote by Pc the set of nonanticipatory policies with respect to the censored information system—that is, policies for which the decision yt prescribed in period t is Ft1-measurable for all t1.

The objective is to minimize the cumulative expected cost over a finite time horizon of length T, and the optimal value is given by

VT*=infμPct=1TEμ[L(yt,Dt)],(1)
where the expectation is taken assuming that decisions are made according to the policy μ.

Because the items are perishable, decisions across periods are only coupled through the information collected about the unknown parameter θ of the demand distribution. Indeed, given knowledge of θ, the optimal decision is simply to minimize the single-period cost:

yi(θ)arg miny0E[L(y,Dt)|θ].(2)

We refer will to yi as the informed order quantity. On the other hand, in the absence of full knowledge of the demand parameter θ, when deciding on an order quantity in period t, the decision maker has to balance the impact of this decision on the current single-period cost with the impact on future costs that stem from the information collection process. This leads to the exploration-exploitation trade-off.

Computing an optimal policy that solves the optimization problem (1) is intractable in general, leading to the need for heuristics. One of the simplest possible such heuristics that is also computationally appealing is the myopic policy that seeks to minimize the single-period cost at each time period, given the past information at hand. In other words, the myopic policy focuses exclusively on exploitation, ignoring any exploration considerations. Formally, for each time period t0, the myopic policy prescribes the order size

ytmarg miny0E[L(y,Dt)|Ft1].(3)

As we see, both the informed and myopic policies admit similar structure, minimizing single-period costs; however, the latter does so only with past information observations rather than full knowledge of the demand distribution. In order to guarantee the existence and uniqueness of the informed and myopic policies, we will make the following technical assumption.

Assumption 1.

For P-almost every θΘ, the cumulative demand distribution F(·|θ) satisfies the following conditions:

  1. F(·|θ) is continuous over R+, and

  2. F(·|θ) is strictly increasing over R+. That is, for all 0x<y, F(x|θ)<F(y|θ).

We will assume throughout that Assumption 1 holds, and we formalize the characterization and uniqueness of the informed and myopic decisions in the following.

Proposition 1.

For P-almost every θ, the informed order quantity yi(θ) satisfying (2) exists and is unique, and further this order quantity uniquely satisfies

P(Dtyi(θ)|θ)=r.(4)

Furthermore, for each t1, P-almost surely, the myopic order quantity ytm satisfying (3) exists and is unique, and further, this order quantity uniquely satisfies

P(Dtytm|Ft1)=r.(5)

We note that the requirements of Assumption 1 are not necessary, however, and can be weakened. For example, Assumption 1 requires that the cumulative demand distribution be strictly increasing over all of R+. If, instead, the demand has support over a fixed subset of R+ for all θ and is strictly increasing on the subset, this is sufficient.

3. Asymptotic Analysis

Our goal is to establish the asymptotic, long-run optimality (as T) of the myopic policy. We will do so by comparison with the informed policy, which would be optimal given full knowledge of the demand distribution. Indeed, our ultimate goal will be to establish that, under general conditions, P-almost surely, ytmyi(θ) as t. In other words, we will establish that, asymptotically, the myopic policy learns and converges to the optimal order quantity given knowledge of the demand distribution.5 Note that this does not necessarily imply that the myopic policy asymptotically learns the true underlying demand distribution (i.e., learns θ) but rather that it learns a statistic of the underlying demand distribution (the r-fractile) that is sufficient for optimal decision making.

We proceed in two steps. First, we will establish that in the absence of any additional assumptions, the myopic policy orders in a way that it is censored asymptotically at the same rate as the informed policy. Then, under mild regularity assumptions on the demand distribution, we will show that this implies that, asymptotically, the myopic order quantity converges to the informed order quantity. In this way, the myopic policy is asymptotically optimal in a very general setting.

3.1. Asymptotic Rate of Censoring

To start, observe that (4) implies that the informed policy is asymptotically censored on a fraction 1r of the periods. Formally,

limT1Tt=1TI{Dtyi(θ)}=r,(6)

P-almost surely. This is an immediate consequence of the strong law of large numbers. The following result guarantees that this also holds for the myopic policy.

Theorem 1.

Under the myopic policy,

limT1Tt=1TI{Dtytm}=r,(7)
P-almost surely.

Theorem 1 is a consequence of the martingale strong law of large numbers. In particular, because of (5), the process

Wtτ=1t(I{Dτyτm}r),t0,(8)
is an Ft-adapted zero mean martingale. This, in combination with the martingale strong law of large numbers, implies that, P-almost surely, WT/T0 as T. This, in turn, yields (7).

3.2. Asymptotic Order Quantity

Theorem 1 establishes that in the absence of any additional assumptions, the myopic policy is censored at the same rate as the optimal informed policy over the long run. Although this is encouraging, it does not quite imply that the ordered quantity converges. In order to make further progress, we will need an additional technical assumption on the demand distribution.

We begin with some definitions. Define Σ to be the set of Lebesgue measurable subsets of the positive interval R+, and define B to be the Banach space of signed finite measures on (R+,Σ), equipped with the total variation metric,

μTVsupAΣ|μ(A)|+|μ(Ac)|,
for μB. Note that if μ,νB are also probability measures, then
μTV=1,μνTV=2supAΣ|μ(A)ν(A)|.

Now, we can define the B-valued random variable μθ:ΩB corresponding to the (unknown) demand distribution by

μθ(ω)(A)AdF(x|θ(ω))=P(DtA|θ(ω))
for each sample path ωΩ and set AΣ (we will sometimes suppress the explicit dependence on ω). In other words, μθ is the random measure corresponding to the distribution of demand given θ.

We will make the following assumption for the rest of this section.

Assumption 2

(Strong P-Measurability). Assume that the random measure μθ:ΩB is strongly P-measurable. That is, there exists a sequence of functions (fn), for n1, with each fn:ΩB taking the form

fn(ω)=i=1NnIAin(ω)bin,
where each binB is a measure, and each AinΩ is a subset of the probability space for each 1iNn, and for almost every ωΩ,
limnfn(ω)μθ(ω)TV=0.

In other words, μθ is the pointwise limit (in B) of a set of measure-valued functions each taking finitely many values.

Strong P-measurability (see, e.g., Hytönen et al. (2016)) is a technical condition that guarantees a mild degree of regularity of the demand distribution across realizations of θ. This assumption is very mild; for example, it is satisfied for classical newsvendor distributions that will be defined in Section 4.1. We will defer further discussion of this assumption to Section B.2 of the online supplement.

We have the following result.

Theorem 2.

Under Assumption 2, P-almost surely,

limtytm=yi(θ).

Theorem 2 establishes that the myopic policy is asymptotically optimal in the sense that the myopic order quantities converge to that of the informed policy (i.e., the optimal order quantity with full knowledge of the demand distribution). This holds in a remarkably general setting. For example, under the present hypotheses, it is not clear that θ, and thus the entire underlying demand distribution can be determined asymptotically. However, under the myopic policy, the r-fractile of this distribution can be determined, and this is a sufficient statistic for optimal decision making.

Although the proof of Theorem 2 is provided in Section B.3 of the online supplement, we outline here the main steps and key ideas of this proof.

  1. Consider the random measure μt corresponding to the posterior distribution of demand after observing sales in period t; that is,

    μt(A)P(DtA|Ft)
    for all AΣ. Under Assumption 2, this is a (B-valued) bounded martingale, and a Banach space version of the martingale convergence theorem establishes that μt converges P-almost surely to a limiting distribution μ (which may be different than the true, underlying distribution μθ). This pointwise convergence occurs under the total variation metric for B.

  2. Because μtμTV0 P-almost surely, it must be the case that the r-fractile of each μt converges to that of μ. Therefore, under the characterization of the myopic order quantity from Proposition 1, it must be the case that ytm converges also.

Theorem 1 demonstrates that the myopic policy is censored a fraction 1r of the time. However, an application of the Glivenko–Cantelli theorem implies that the only fixed order quantity with this property is yi(θ). Because ytm converges to a fixed order quantity, it must be then that ytmyi(θ).

3.3. Nonperishable Case

The asymptotic optimality result developed in the preceding also holds in the nonperishable case, where excess inventory carries over from one period to the next.

Here, we denote by xt0 the inventory at the beginning of period t1. We denote by yt0 the order-up-to6 quantity for period t, so that ztxtytmax(xt,yt) is the postreplenishment inventory. Given demand realization Dt, sales of Dtzt are observed, and the cost L(zt,Dt) is incurred. The initial inventory for the next period is given by

xt+1=zt(Dtzt)=(ztDt)+max(ztDt,0).

We denote by

Ftσ(x1,D1z1,y1,,Dtzt,yt)
the information filtration generated by the censored demand process. The objective we consider is to minimize the long-run average expected cost
limT1Tt=1TE[L(zt,Dt)].

Note that relative to the perishable problem formulation of Section 2, we have abused notation to redefine yt as an order-up-to quantity (rather than an order quantity) and Ft as a filtration in the presence of carryover inventory.

In this setting, we define a myopic decision maker to be unconcerned with the impact of decisions on future information or future inventory, and as a result, it is clear that a myopic order-up-to quantity can be defined according to (3), exactly as before. It is less obvious but also well known that when the demand parameter θ is known, ordering up to the quantity yi(θ) given by (2) is optimal, just as in the perishable case. This is because when the demand distribution is known, the demand realizations are i.i.d., and the constraint that ztxt1 is not binding.7 In other words, the multiperiod perishable newsvendor under known demand is completely equivalent to the nonperishable case (see, e.g., the discussion in Huh and Rusmevichientong (2009)). Moreover, under Assumption 1, Proposition 1 continues to hold.

In order to analyze the myopic policy in the nonperishable setting, first observe that the following analog of Theorem 1 continues to hold in the nonperishable case.

Theorem 1NP.

In the nonperishable case, under the myopic policy, P-almost surely,

limT1Tt=1TI{Dtytm}=r.

Theorem 1NP holds because the process Wt defined in (8) remains an Ft-adapted martingale in the nonperishable case—the key here is that the information filtration is strictly larger than for the perishable case because ztytm under the myopic policy. Hence, the event that Dtytm is Ft-measurable. Given this observation, Theorem 1NP follows from the same arguments as in the proof of Theorem 1 and is omitted.

Note that Theorem 1NP guarantees that even in the nonperishable case, the myopic policy will be censored asymptotically exactly on a fraction 1r of the periods. Then, the analysis of Section 3.2 can be repeated to obtain the following analog of Theorem 2 in the nonperishable case.

Theorem 2NP.

In the nonperishable case, under Assumption 2, P-almost surely,

limtytm=yi(θ).

In other words, in the nonperishable case also, the myopic order quantity asymptotically converges to the optimal informed order quantity. The proof of Theorem 2NP is omitted because it is exactly the same as that of Theorem 2.

4. Dynamic Programming Analysis

Our next goal is to analyze more finely the finite time performance of the myopic policy. As mentioned in the introduction, it has been shown under general conditions that it is optimal to “order more.” That is, in a multiperiod problem with censored observations, a decision maker will want to stock at a level higher than would be optimal to minimize the current single-period cost. Intuitively, it is desirable to explore by ordering more so as to reduce the likelihood of censoring and collect more information about the demand distribution for use in future periods. However, precisely pinning down the optimal order quantity is, in general, difficult. As a result, the exact form of what type of exploration ought to be conducted, and the value associated with it, are not known. To quantify the exploration-exploitation trade-off, we will isolate the value of exploring (i.e., ordering more) by comparing the performance of an optimal policy to that of the myopic policy which sequentially minimizes current single-period expected costs, fully ignoring the impact of decisions on information collection.

The cumulative cost of the myopic policy, formally defined in (3), is given by

VTm=t=1TE[L(ytm,Dt)].

By definition, VT*VTm, and we are interested in qualifying the relative gap between VTm and VT*, that is, the relative additional cost incurred by ignoring the exploration-exploitation trade-off and applying a full exploitation policy. We refer to this as the myopic optimality gap (MOG), formally defined as

MOGVTmVT*VT*.

An Upper Bound on the MOG.

The direct analysis of the MOG appears intractable in general. In the present section, we introduce an upper bound on the MOG that is more tractable. If demand were observable, the collection of information would be unaffected by the policy used by the decision maker. We denote by

Ftoσ(D1,y1,,Dt,yt)
the filtration associated with the full observation process. Let Po be the set of admissible policies for the full observation problem—that is, policies for which prescribed decisions at period t are Ft1o-measurable. The optimal cumulative cost for the uncensored problem is given by
VTo=infμPot=1TEμ[L(yt,Dt)].

In the uncensored case, the information collected about the demand distribution is independent of past decisions, and there is no exploration-exploitation trade-off. Hence, in this case, the problem is easy to solve: the optimal decision at time t is to minimize the current single-period cost. Moreover, clearly, PcPo, and it follows that VTo is a lower bound on VT*; that is,

VToVT*.(9)

We define the myopic cost of censoring (MCC) as

MCCVTmVToVTo.

It can be interpreted as the relative increase in cost stemming from censoring, when a myopic policy is applied, and we have

MOG=VTmVT*VT*VTmVToVTo=MCC.

That is, the MCC provides an upper bound on the MOG.

Next, we restrict our analysis to a subset of prior-demand distributions pairs that are conjugate to evaluate the MOG and MCC. In Section 5, we focus on characterizing the upper bound, MCC, through upper and lower bounds. The upper bounds apply directly to the MOG. From an analytical perspective, this approach circumvents the necessity to determine the optimal solution to the dynamic program associated with the censored demand problem. Instead, our analysis in this section relies on the performance of myopic policies (in the censored and uncensored cases), which are more amenable to analysis. In Section 6, we conduct exact analysis on the MOG.

4.1. Gamma-Weibull Demand and Dynamic Programming Formulation

4.1.1. Newsvendor Distributions.

We now assume that a scalar parameter θ>0 is unknown. Before ordering in period t, the knowledge about θ can be summarized by the current prior distribution density8 πt1(θ)π(θ|Ft1). The current prior is updated at every period, following the Bayes rule. One general class of distributions that preserves the conjugate property even under censoring are the so-called newsvendor distributions. A random variable is said to belong to the newsvendor family if its cumulative distribution function is given by

F(z|θ)1eθd(z),for allz>0,(10)
where d:R++R++ is a differentiable, nondecreasing, and unbounded function with d(0)=0. The parameter space is defined to be ΘR++, and the prior distribution of θ is assumed to be a Gamma distribution with shape parameter SR++ and rate parameter aR++. Formally, the prior distribution on θ has a density given by
π(θ|a,S)Saθa1eSθΓ(a),for all  θ>0.

It has been shown (Braden and Freimer 1991) that this family of distributions preserves its structure even under censored observations; that is, if πt(·) is a Gamma distribution, then so is the posterior πt+1(·|y,πt) with or without censoring. In particular, given a demand level z and order quantity y, the update rules in the censored case for Gamma distribution hyperparameters (a, S) are given by

a=a+I{z<y},S=S+d(zy),(11)
where I{·} denotes the indicator function and ·· denotes the minimum. Equation (11) offers a natural interpretation of the hyperparameters (a, S): the scale hyperparameter S accounts for the sum of (a function of) the sales observed in the system, whereas the shape hyperparameter a counts the number of fully observed demand realizations. The shape hyperparameter plays a central role in our analysis, as it measures the level of information about the unknown demand parameter θ: the coefficient of variation of the demand parameter θ given hyperparameters (a, S) is
CV(θ|a,S)=1/a.(12)

Hence, higher values of a indicate less uncertainty regarding the value of θ.

4.1.2. Assumptions.

For tractability, our analysis will focus on Weibull distributions—that is, newsvendor distributions with

d(z)z,for>0.

Note here that the parameter is assumed to be known to the decision maker, and the only uncertainty pertains to the parameter θ. For the Weibull family, the Bayesian update (11) is written as at+1=at+I{z<y} and St+1=St+(zy), and the predictive distribution (the distribution of demand conditional on the current belief (a, S) for the parameter θ) has density

m(z|a,S)=aSaz1(S+z)a+1(13)
and cumulative distribution
M(z|a,S)=1Sa(S+z)a(14)
for z > 0. The predictive distribution is integrable (i.e., the demand has finite expectation) if and only if a>1, and therefore we will assume that this condition is satisfied throughout the rest of the paper. Note that it suffices to verify that the initial prior distribution of θ satisfies a>1, because the update rule (11) then guarantees that this will continue to hold at all future times.

4.1.3. Finite-Dimensional Dynamic Programs.

Given current hyperparameters (a, S), we denote by VTo(a,S),VTm(a,S), and VT*(a,S) the optimal cost function under uncensored demand, the myopic cost function under censored demand, and the optimal cost function under censored demand, respectively. Similarly, we define C(a, S) to be the optimal single-period cost given current hyperparameters (a, S); that is,

C(a,S)miny0E[L(y,D)].(15)

In Equation (15), the expectation is taken with respect to the pair (D,θ), which follows a newsvendor distribution with parameters (a, S). Unless otherwise stated, this will be the convention for expectation terms in the remainder of the paper.

With this notation in mind, we can decompose the cost functions of interest using standard dynamic programming backward induction.9 To begin, the optimal cost function under censored demand must satisfy the Bellman equation:

VT*(a,S)=miny0 {E[L(y,D)]+E[I{D<y}VT1*(a+1,S+D)]+P{Dy}VT1*(a,S+y)},
for all (a, S) and all T1, with the terminal condition V0*(a,S)=0 for all (a, S). Similarly, the myopic cost function under censored demand must satisfy
VTm(a,S)=C(a,S)+E[I{D<ym}VT1m(a+1,S+D)]+P{Dym}VT1m(a,S+(ym)),(16)
for all (a, S) and all T1, with the terminal condition V0m(a,S)=0 for all (a, S). In (16), the myopic decision ym is defined to be a solution to (15) (i.e., ymargminy0E[L(y,D)]). Note that ym depends on (a, S), although we suppress this dependence in our notation. Finally, the optimal cost function when demand samples are observable must satisfy
VTo(a,S)=C(a,S)+E[VT1o(a+1,S+D)],(17)
for all (a, S) and all T1, with the terminal condition V0o(a,S)=0 for all (a,S). Here, we have used the fact that the optimal policy in this case is myopic.

5. Parametric Upper Bounds and Structural Analysis of the MCC

We aim to develop an upper bound on VTm(a,S)VTo(a,S). Rewriting the recursion for the myopic policy in the censored case (16) in a way that parallels the recursion for the observable demand case (17), one obtains, for all (a, S) and T1,

VTm(a,S)=C(a,S)+E[VT1m(a+1,S+D)]+ΓT1(a,S),(18)
where
ΓT1(a,S)(1r)VT1m(a,S+(ym))E[VT1m(a+1,S+D)I{Dym}].

Combining (17) and (18), one has that

VTm(a,S)VTo(a,S)=ED[VT1m(a+1,S+D)VT1o(a+1,S+D)]+ΓT1(a,S).(19)

The correction term ΓT1(a,S) may be interpreted as the additional cost incurred over the next period as a result of the presence of censoring.

As a first step toward bounding the performance difference VTm(a,S)VTo(a,S), we bound ΓT1(a,S) as follows.

Lemma 1.

Suppose that the demand distribution is a newsvendor distribution. For all (a, S), and all T1,

ΓT(a,S)k=1T(1r)k{C(a,S(1r)k/a)CTk+1o(a,S(1r)k/a)},
where Cto(a,S) is the future expected single-period cost, t periods from now, when demands are uncensored:
Cto(a,S)E[C(a+t,S+D1++Dt)].(20)

Notably, Lemma 1 offers a bound on ΓT(a,S) in which all terms depend only on future costs in the uncensored setting. We next use Lemma 1 recursively to provide an explicit bound on VTm(a,S)VTo(a,S) by exploiting the Weibull structure.10

Theorem 3.

Suppose that the demand distribution is Weibull with parameter . Then, for any >0,T1,r(0,1), a > 0, S > 0, with a>1,

VTm(a,S)VTo(a,S)S1/[Q(a,r,l)]1/λλT+11λ[log(1+Ta1/)+1a1/],(21)
where λ(1r)11a, and Q(a,r,) is a function depending only on a, r, and , such that Q(a,r,l)=O(1/a) as a, when r and  are fixed.

The theorem provides a bound on the (absolute) myopic cost of censoring as a function of the problem parameters: the time horizon T, the “uncertainty” parameter a, and the service level r. We first provide a high-level overview of the proof of the result and then analyze in detail the parametric dependence.

The proof of the result is based on two steps. We first develop an intermediate bound that connects the (absolute) myopic cost of censoring to the difference between the observable demand cost and the expected total cost when the demand parameter θ is known: VT1o(a,S)(T1)Co(a,S), where

Co(a,S)E[miny0E[L(y,D)|θ]]
is the expected optimal single-period cost assuming θ is known.11 The latter difference can be interpreted as the increase in cost stemming from the fact that θ is unknown, in an uncensored environment. In a second step, we bound the latter difference based on the problem parameters by analyzing information accumulation and its implications on performance in the observable demand case. A key feature of the bound is that it captures the dependence on the problem parameters: the time horizon T, the economic trade-offs captured by r, the prior information captured by a, and the demand distribution shape, captured by . We analyze some of these dependencies in detail next.

5.1. Parametric Dependence

We first start with the time horizon dependence. Note that if T = 1, information collection does not play any role, the myopic policy is optimal, and we have

VTm(a,S)VTo(a,S)VTo(a,S)=0whenT=1.

We now analyze the performance for general time horizons.

Proposition 2

(Time Horizon Dependence). Suppose that the demand distribution is Weibull with parameter . Then, for any >0,r(0,1), a > 0, S > 0, with a>1,

VTm(a,S)VTo(a,S)VTo(a,S)=O(T1logT)asT.

The result captures the dependence of the MCC on the time horizon T. The fact that the MCC is equal to 0 when T = 1 simply stems from the fact that, in a one-period problem, the censored and uncensored problems are identical, and the optimal decision is myopic. For large time horizons, the MCC vanishes at rate O(T1logT). This reflects two things: (i) Over time, in the censored environment, the optimal order will become closer to the myopic order (see Theorem 2), as in the uncensored environment. (ii) The cumulative effects of deviations from a myopic solution in the censored case can only affect performance minimally, not by more than O(T1logT) in relative terms. In particular, the MOG cannot be large in both the extreme cases when T = 1 and T and converges to 0 rapidly.

Proposition 3

(Prior Information Dependence). Suppose that the demand distribution is Weibull with parameter . Then for any >0,r(0,1), S > 0, T1,

VTm(a,S)VTo(a,S)VTo(a,S)=O(1/a)asa.

In other words, the MCC converges to 0 at a fast rate as a. This captures the behavior of the MCC in a regime in which there is very little prior uncertainty about the demand distribution parameter θ. In this case, the presence of censoring does not affect performance significantly, because there is little additional information to be captured.

Proposition 4

(Service-Level Dependence). Suppose that the demand distribution is Weibull. Then for any >0,a>0,S>0,T1 with a>1,

VTm(a,S)VTo(a,S)VTo(a,S)=O((1r)11/a)asr1,VTm(a,S)VTo(a,S)VTo(a,S)=O(r1/)asr0+.

The result gives a characterization of the asymptotic properties of the MCC as the service level r becomes close to 0 or 1—that is, as the holding cost becomes arbitrarily large with respect to the penalty cost, and vice versa.

If the penalty cost is large (i.e., r is close to 1), the myopic order quantity will be high (corresponding the r-percentile of the predictive distribution) as the decision maker attempts to mitigate the large penalties associated with stockouts. This implies that the myopic policy will often observe full demand realizations, and one expects that the presence of censoring should not impact performance significantly.

Note here that when r is close to 0, a newsvendor with knowledge of θ would incur optimal expected cost close to 0. Without the knowledge of θ, the myopic newsvendor will be very often censored, and it is not possible a priori to characterize the magnitude of the MCC, as this quantifies relative performance. Remarkably, however, when r is close to 0, and hence, when the myopic quantity leads the decision maker to be censored very often, the performance implications of being myopic are still very limited; the MCC shrinks to 0 at a rate O(r1/). This stems from the fact that the holding cost is very high, and every unit of unconsumed inventory becomes very expensive, making exploration (or “ordering more” than myopic) very expensive.

As a corollary of the above-mentioned results, both the myopic optimality gap and the cost of censoring become negligible at the boundaries of the input parameters of the problem—namely, when T = 1, T,a,r0, and r1. Intuitively, as r0 and r1, the newsvendor trade-off becomes weaker and weaker, and the one-period optimal ordering quantity converges to 0 and infinity, respectively. As T or a, there is less and less uncertainty about the unknown parameter. In Section 6, we quantify more finely the MOG by using an exact analysis to compute it for various ranges of the parameters of the problem.

Remark.

In Section D of the online supplement, we provide a lower bound on MCC that highlights that the upper bound presented in Theorem 3 captures the appropriate parametric dependence.

6. Exact Analysis of MOG

6.1. Scalability

A notable feature of the problem when demand has a Weibull distribution is that the single-period optimal cost possesses the so-called scalability property—namely, that

C(a,S)=S1/C(a,1),for all a>1/,S>0.

As observed by Azoury (1985) and Lariviere and Porteus (1999), this property can be extended to the optimal and full observation cost functions: for any a>1/, S > 0, T1,

VT*(a,S)=S1/VT*(a,1),VTo(a,S)=S1/VTo(a,1).

This separability allows the exact cost functions to be determined in the optimal and full observation cases given only knowledge of VT*(a,1) and VTo(a,1), respectively, for all values of a>1/. Furthermore, exact recursions have been developed for these two quantities in the existing literature (Azoury 1985, Lariviere and Porteus 1999, Bisi et al. 2011), and we will use those to evaluate these costs.

A similar reasoning yields that the myopic cost function VTm(a,S) also possesses the scalability property, and an exact recursion can be used to compute it, as summarized in the next result.

Proposition 5.

Suppose that demand distribution is Weibull. Then, for all a>1/, S > 0, T1,

VTm(a,S)=S1/VTm(a,1).

In addition, for all a>1/,

VTm(a,1)=C(a,1)+aa1(1(1r)11a)VT1m(a+1,1)+(1r)11aVT1m(a,1).

In the recursive equation provided by Proposition 5, VTm(a,1) can be computed exactly given VT1m(a+1,1) and VT1m(a,1). This implies that VTm(a,1) can be evaluated using backward induction, starting with the boundary condition V1m(a,1)=C(a,1). The value of C(a,1) itself has no closed-form expression but can be approximated to arbitrary accuracy by numerical integration for any a. This is analogous to the situation for the value function of an optimal policy in the censored and full observation cases.

Hence, using the scalability property, the cost functions VTm(a,S),VT*(a,S), and VTo(a,S) can be computed numerically using simplified dynamic programming recursions that are exact up to errors from numerical integration. In particular, no discretization of the state space is necessary.

6.2. Parametric Setup

We are interested in analyzing the behavior of the MOG across a broad range of input parameters. In the case of Weibull demand, the input parameters are given by

  • the Weibull parameter ,

  • the shape parameter a of the Gamma prior distribution,

  • the scale parameter S of the Gamma prior distribution,

  • the time horizon T, and

  • the service level r.

Given the scalability property discussed in Section 6.1, the value of the MOG does not depend on the scale parameter S, because it is a cost difference normalized relative to a baseline cost. We are therefore left with four parameters that summarize the three relevant dimensions of the problem: time (T), cost structure/service level (r), and the demand characteristics (a, l). We consider ranges for time and service level given by

1T100,r{0.1,,0.9,0.99}.

The demand parameters (a,) are more difficult to directly interpret. In order to clarify their role, it is convenient to consider two measures of demand uncertainty.

  • Conditional on the prior distribution hyperparameters (a, S), we measure the aggregate uncertainty in the next period demand (with predictive distribution (13) and (14)) through the coefficient of variation of demand (i.e., CV(D|a,S)). Note that CV(D|a,S) is a function of (a,) but does not depend on the shape parameter S.12 In particular, one may establish that

    CV(D|a,S)=Ea,S[D2](Ea,S[D])21=(2/)Beta(2/,a2/)((1/)Beta(1/,a1/))21,
    where Beta(·,·) refers to the Beta function. In the exponential case (=1), for example, one has that CV(D|a,S)=a/(a2) for a > 2.

  • The aggregate uncertainty is in part driven by the fact that even absent uncertainty regarding θ, the demand realizations themselves are random. This can be quantified by the coefficient of variation of demand given the parameter θ (i.e., CV(D|θ)). Here, demand is assumed to be distributed according to the distribution (10). Note that CV(D|θ) depends on but does not vary with θ. Indeed, if the random variable D follows the demand distribution (10) with parameter θ, then θ1/D follows (10) with parameter θ = 1. Thus, changing the parameter θ corresponds to a rescaling and leaves the coefficient of variation unchanged. One may establish that

    CV(D|θ)=E[D2|θ](E[D|θ])21=Γ(1+2/)(Γ(1+1/))21,
    where Γ(·) is the Gamma function. In particular, in the exponential case (=1), one has that CV(D|θ)=1 for all θ>0.

We define the uncertainty ratio (UR) as

UR(a,)CV(D|a,S)CV(D|θ).

Note that UR(a,) is always greater or equal than 1 and is the ratio of the overall aggregate uncertainty in the next-period demand to the uncertainty that would remain if θ was perfectly known. From this discussion, UR is a function of only (a,). For example, in the exponential case (=1), we have that UR=a/(a2) for a > 2.

We will parameterize demand uncertainty through input parameters (UR,). Note that we use UR rather than a because it is directly interpretable, as the relative value of uncertainty arising from the fact that θ is unknown. For example, if UR=3, the demand uncertainty would be reduced by a factor of 3 if θ were known.

On the other hand, we interpret the Weibull parameter as measuring the uncertainty of demand realizations given knowledge of θ (note that the parameter also affects the shape of the distribution). From (10), it is clear that larger values of lead to faster decaying tails for the demand distribution. Moreover, from the preceding discussion, CV(D|θ) depends only on . If, for example, increases, the coefficient of variation of the Weibull distribution (independent of θ) decreases, and hence there is less variation in demand realizations. The uncertainty about θ, however, remains constant; see (12). Because θ can be learned via demand observations, the potential value of exploration can be significantly affected by changing .

6.3. Analysis of the Myopic Optimality Gap

First, we consider the behavior of the MOG (i.e., the relative suboptimality of the myopic policy compared with an optimal policy).

6.3.1. General Weibull Case.

We are interested in analyzing the MOG. To obtain a broader understanding of the behavior and magnitude of the MOG, we next evaluate it for different values of the uncertainty ratio UR and service level r. In Figure 1, for varying choices of (r,UR) and varying values of the shape parameter , we depict the worst-case myopic optimality gap MOGwc over time horizons 1T100:

MOGwcmax1T100VTm(a,1)VT*(a,1)VT*(a,1).(22)

Figure 1. The Worst-Case Myopic Optimality Gap MOGwc as a Function of , r, and UR

Observe that the MOGwc tends to 0 for values of the service level r close to 0 or 1, consistent with Proposition 4. It is also clear that the MOGwc decreases as the uncertainty ratio UR decreases, consistent13 with Proposition 3. However, the most remarkable fact is the magnitude of the MOGwc: the maximum value of the MOG over all parameters tested is below 0.6% for =0.5, 0.14% for =0.7, and 3% for =3. It is notable that for these cases, this holds for any service level r. In particular, even when r is low and censoring occurs often, there is almost no value of deviating from the myopic policy. In summary, when  is not too high, the value of exploration beyond the myopic ordering quantity (or of “ordering more”) is negligible independent of the problem parameters. The MOG increases with the value of , with the worst cases in the instances tested given by =7 and UR=7. Except for these latter cases, the MOG is minimal. We explain the intuition behind these results in the next section.

6.3.2. Analysis of MOG Results.

We now explore in detail the intuition behind the results presented in the last sections. We start by analyzing, on a sample path basis, the evolution over time of the myopic and optimal order quantities. We measure the level of learning achieved by either policy, as the distance between the prescribed order quantity and the optimal order when θ is known. Specifically, if we let {ytm(D1t1)}t=1,,T denote the sequence of orders for a particular sample path D1T(D1,,DT), and y(θ)F1(r|θ) the optimal order when θ is known. We quantify errors in ordering quantities according to

MyopicErrortE[|ytm(D1t1)y(θ)y(θ)|],OptErrortE[|yt*(D1t1)y(θ)y(θ)|].

To assess the downstream impact that these errors have on costs, we define an alternative error measure. Let C(y|θ)E[L(y,D)|θ] be the expected cost of ordering y when the demand distribution parameter θ is known. We then define

CostErrortE[C(ytm(D1t1)|θ)C(yt*(D1t1)|θ)C(yt*(D1t1)|θ)].

In other words, CostErrort represents the relative distance in single-period cost between myopic and the optimal policies in terms of the true expected cost given θ.

For brevity, we fix T = 50 and UR=7, and we consider the =1 (exponential) and =7 cases for two values of r: r = 0.2 and r = 0.8. For each case we estimate the values of MyopicErrort,OptErrort, and CostErrort through Monte Carlo simulation. The results are depicted in Figure 2.

Figure 2. OptErrort,MyopicErrort, and CostErrort as a Function of t, for T = 50, UR=7, {1,7}, and r{0.1,0.8}

Consider first the =1 case. From Figure 2(a) two main conclusions can be drawn: first, there is nontrivial learning taking place, as there is at least a twofold reduction in MyopicErrort and OptErrort for all cases; because both policies prescribe higher orders for r = 0.8, the learning rate is faster in that case. Second, the curves of the optimal and myopic policies quickly become indistinguishable. In other words, both policies learn at very similar rates. This fact is reflected in the cost picture, Figure 2(c). The cost difference actually favors the myopic policy in the first periods (i.e., CostErrort<0), and then the optimal policy slowly takes over (using better information gleaned from the initial higher orders) through the rest of the time horizon.14

Consider now the case of =7. Recall that, in these examples, UR=7, and hence this case corresponds to the highest curve in the bottom right plots of Figure 1. These are the only cases where the MOG becomes significant. In these cases, the problem primitives satisfy two conditions: (i) there is little noise associated with demand (i.e., is large), and (ii) there is high uncertainty about the unknown demand parameter θ (i.e., UR is large). Suppose that we keep UR fixed, and hence the ratio between the two sources of uncertainty in the system is constant. When is large, the coefficient of variation of the demand distribution is small, and hence most of the uncertainty in the system comes from the prior distribution—that is, from the fact that θ is unknown. We are then in a situation where, if θ were known, demand itself would be highly predictable (roughly speaking, almost deterministic), and the subsequent demand-supply mismatch costs would be very low. This is evidenced in the steepness of the learning curves in Figure 2(b): particularly in the r = 0.8 case, most of the uncertainty around the true optimal order y(θ) vanishes after a few periods. Note that, in contrast to the =1 case, for r = 0.2, the learning rate of the myopic policy is markedly slower than that of the optimal quantity. This stems from the fact that even though very few demand observations suffice to nearly learn θ, the myopic censoring probability is high as a result of the low r, and hence the optimal policy can make a significant difference by ordering more at the beginning. This effect is clearly observed in the evolution of CostErrort in Figure 2(d): the optimal policy incurs a substantially higher cost in the first periods (CostErrort<10%) in order to quickly learn θ, and then this produces a substantial advantage in the following periods, where CostErrort>0. Overall, the total cost difference for the r = 0.2 case is no longer negligible—close to 10%, as shown in the bottom right picture of Figure 1.

To summarize, the prior discussion shows that the MOG can indeed be significant but only if the problem satisfies simultaneously very specific conditions: (i) there is little noise associated with demand, (ii) there is high uncertainty about the unknown demand parameter θ, and (iii) the holding cost h is high relative to the penalty cost p. In this situation, there is a high potential gain from exploring, and hence one may not ignore the exploration-exploitation trade-off. From a practical point of view, however, when the three aforementioned conditions are satisfied, one faces a problem of a qualitatively different nature than the typical newsvendor problem we started with: in these cases, demand is, roughly speaking, deterministic and can be essentially learned exactly with very few uncensored observations. Outside this specific case, the results in this section show that the MOG is almost uniformly negligible.

6.3.3. The Myopic Policy vs. Other Heuristics.

The objective of the paper is mainly to establish that the myopic policy, although possibly the simplest heuristic, is a viable alternative that performs well over all time horizons and for a very broad range of parameters. However, one may ask if alternative heuristics could be used. Some heuristics have been proposed in the literature. For example, in the nonparametric setting, Huh and Rusmevichientong (2009) propose a stochastic-gradient-based heuristic that one could also apply in our present setting. However, a comparison of performance to the myopic policy would be highly unfair as the myopic policy can make use of much more information through the knowledge of the demand distribution and the prior. A more fair comparison is to heuristics proposed in the Bayesian setting. Chen (2010) proposes heuristics for the nonperishable inventory case, one of which (proposed in Section 5.1) applies to the perishable case. In particular, it suggests to solve an equation based on the observable case to obtain a prescription for the censored case. The ordering quantity prescribed should solve

Ea,S[L(y,D)]+E[VT1o(a+1,S+D)]=(1+ρ)VTo(a,S)s.t.yytm(a,S),(23)
where ρ0 is a tuning parameter. We will call it the ρ-inflation heuristic. By definition, the prescribed ordering quantity is greater than the myopic ordering quantity.

In Figure 3, the uncertainty ratio UR is set to 3, and for different values of the Weibull parameter and varying choices of the service level r, we depict the worst-case optimality gap for the myopic policy and the ρ-inflation heuristic for two representative values of the tuning parameter ρ.

Figure 3. The Worst-Case Optimality Gap for the Myopic Policy and the ρ-Inflation Heuristic as a Function of and r for UR=3

We observe that the ρ-inflation heuristic appears to perform better than the myopic policy for high values of (=3 and =7) but that for smaller values of there is no uniform dominance. In particular, for ρ=10−3, the myopic policy performs uniformly better, and for ρ=10−4, the myopic policy performs better for low values of r and worse for high values of r. More important than the relative performance is the fact that both policies achieve very small optimality gaps uniformly over the parameter space. This further reinforces the viability and appeal of the myopic policy given the fact that it does not require the computation of any value function to implement it and that it possesses strong convergence properties as established in the earlier sections.

7. Analysis of the Cost of Censoring

The current framework and paper analysis can also be used to quantify the impact of censoring on the performance compared with what would be possible if demand samples were fully observable. This would be of interest if, for example, a firm is considering investments in technology to track lost sales.

In order measure the impact of censoring on performance, we introduce the cost of censoring (COC), defined as the relative difference between the optimal costs in the censored and uncensored systems:

COCVT*VToVTo.

One may note that

COC=VT*VToVToVTmVToVTo,
and hence the MCC bounds the COC, and all the upper bounds developed on the MCC in Section 5 apply to the COC.

We next assess the cost of censoring—that is, the relative cost of going from an uncensored to a censored environment. As in Section 6.3, we will consider the worst-case cost of censoring COCwc over values of the time horizon 1T100:

COCwcmax1T100VT*(a,1)VTo(a,1)VTo(a,1).(24)

In Figure 4, we plot COCwc as a function of the uncertainty ratio UR and the service level r, for different values of the Weibull parameter .

Figure 4. The Worst-Case Cost of Censoring COCwc as a Function of , r, and UR

Figure 4 shows that the overall shape and the behavior in extreme cases of the cost of censoring is similar to that of the myopic optimality gap. The main difference between the results in Figure 4 and those shown in Figure 1 is given by the magnitude of the gaps. The COCwc appears to be much larger than the MOGwc.15 To make a clearer comparison, Figure 5 combines the results of Figures 1 and 4 in single plot: the red lines represent the worst-case cost of censoring COCwc, and the increment of the blue areas represents the worst-case myopic optimality gap MOGwc, starting from COCwc (note that the sum of the two has no physical meaning and the plot is intended only for visual purposes to be able to compare the relative order of magnitude of COC and MOG). As we can see from the results, the MOG is not only low in absolute terms but also relative to the COC.

Figure 5. Comparison of the Worst-Case Cost of Censoring COCwc and the Worst-Case Myopic Optimality Gap MOGwc as a Function of , r, and UR
Note. The red lines represent the COCwc, and the blue areas represent the MOGwc, displayed with COCwc as a baseline.

7.1. Implications

At a high level, most practitioners are well aware of censoring but rarely fully recognize the exploration-exploitation trade-off, focusing more on attempting to record lost sales. The comparison in the preceding is informative in the following sense. It shows that the exploration-exploitation trade-off and need for forward-looking policies introduced by demand censoring (with the computational complexity that might be associated with it) is, for all practical purposes, a second-order problem compared with the value that might be generated by investing in processes and technology to uncensor (even partially) demand. An interesting future research direction would be to quantify analytically the differences between the MOG and the COC.

8. Concluding Remarks

In the present paper, we study the implications of demand censoring in inventory problems on optimal or near-optimal decision making, focusing on the perishable, or newsvendor, case. In particular, we study how censoring affects decisions and, in particular, how the exploration-exploitation trade-off introduced by censoring affects an (otherwise optimal) myopic policy. Through a combination of long-run asymptotic analysis of decisions and finite time analysis of performance in a more restricted family, we find that, for practical purposes, there is virtually no trade-off between instantaneous performance and information collection in this case: being myopic is “essentially” as good as optimal. From an operational standpoint, this surprising fact implies that there is no need to develop and apply optimal policies. The myopic policy is one of the easiest policies to apply in practice, and the present results suggest that it is a viable heuristic to apply in general, yielding near-optimal performance. Furthermore, it is worth noting that in general cases with nonconjugate families of distributions, the dynamic optimization problem becomes infinite dimensional, and obtaining an optimal policy is highly intractable, so it is not clear what other alternative one could use. An interesting avenue of future research would be to characterize finite time performance of the myopic policy for such general cases.

Acknowledgments

The authors are grateful to Shane Henderson as well as an anonymous associate editor and two anonymous reviewers for their feedback that helped improve the paper.

Endnotes

1 We anchor our analysis around this case but also extend some results to the nonperishable case.

2 See, for example, the experiments of Ding et al. (2002) and Chen (2010); the latter considers a case with nonperishable inventory, and the performance of the myopic policy is very rarely below 99% of the optimal. The numerical results in these papers show that the myopic policy can, in the specific instances considered, be near optimal. However, the generality of this fact is not established exhaustively—it is not the focus of those papers. The broad near optimality of myopic policies appears to be fundamentally unacknowledged in the literature.

3 See also the complementary note of Lu et al. (2005b).

4 In Section 3.3, we explore the nonperishable case.

5 Related is the work of Bensoussan et al. (2009b), who show the same convergence in the case of exponentially distributed demands with a Gamma prior. This special case is subsumed by our result.

6 Note that describing the inventory decision by an order quantity is equivalent to describing it by an order-up-to quantity; in this section, we will take the latter point of view.

7 If the initial inventory is high, this constraint may bind. However, once xtyi(θ) for some t, this will continue for all future periods, so in any event, yi(θ) is the optimal decision in the long run.

8 In order to keep our notation intuitive, we index prior beliefs using a forward-numbering scheme (i.e., we index by the current time), whereas we index cost-to-go functions using a backward-numbering scheme (i.e., we index by the remaining time horizon). The choice of indexing should be obvious by the context.

9 The existence of an optimal policy in this setting can be shown by applying proposition 3.4 of Bertsekas and Shreve (1978).

10 In what follows, given functions f(·) and g(·)>0, we say that f(x)=O(g(x)) as xa if limsupxa|f(x)|/g(x)<.

11 Compared with the definition of CTo(a,S) in (20), note that Co(a,S)=limTCTo(a,S).

12 Indeed, if the random variable D follows the predictive distribution (14) with parameters (a, S), then S1/D follows (14) with parameters (a,1). Thus, changing the parameter S corresponds to rescaling the random variable, leaving the coefficient of variation unchanged.

13 Note that for the exponential case, UR=a/(a2), and hence a as UR1. A similar conclusion holds for the general Weibull case.

14 Although it is not apparent from Figure 2(c), given the scale, CostErrort is positive in both cases for t20.

15 As in Section 6.3, one can observe that the most extreme cases are given in the lower right plot of Figure 4, particularly with higher values of UR. One can apply the same reasoning as before to explain the high gap values in this context: if there is high θ uncertainty and very low demand noise, an uncensored demand observation essentially reveals future demand values, leading to a problem of a different nature.

References

  • Azoury KS (1985) Bayes solution to dynamic inventory models under unknown demand distribution. Management Sci. 31(9):1150–1160.LinkGoogle Scholar
  • Bensoussan A, Çakanyildirim M, Sethi SP (2009a) Censored newsvendor model revisited with unnormalized probabilities. J. Indust. Management Optim. 5(2):391–402.Google Scholar
  • Bensoussan A, Royal A, Sethi SP (2009b) Bayesian and adaptive controls for a newsvendor facing exponential demand. Risk Decision Anal. 1(4):197–210.Google Scholar
  • Bertsekas DP, Shreve SE (1978) Stochastic Optimal Control: The Discrete Time Case (Academic Press, New York).Google Scholar
  • Besbes O, Muharremoglu A (2013) On implications of demand censoring in the newsvendor problem. Management Sci. 59(6):1407–1424.LinkGoogle Scholar
  • Besbes O, Zeevi A (2011) On the minimax complexity of pricing in a changing environment. Oper. Res. 59(1):66–79.LinkGoogle Scholar
  • Bisi A, Dada M, Tokdar S (2011) A censored-data multiperiod inventory problem with newsvendor demand distributions. Manufacturing Service Oper. Management 13(4):525–533.LinkGoogle Scholar
  • Braden DJ, Freimer M (1991) Informational dynamics of censored observations. Management Sci. 37(11):1390–1404.LinkGoogle Scholar
  • Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.LinkGoogle Scholar
  • Caro F, Gallien J (2010) Inventory management of a fast-fashion retail network. Oper. Res. 58(2):257–273.LinkGoogle Scholar
  • Chen L (2010) Bounds and heuristics for optimal Bayesian inventory control with unobserved lost sales. Oper. Res. 58(2):396–413.LinkGoogle Scholar
  • Chen F, Han Y (2013) Rethinking the value of demand learning in a supply chain with censored demand. Presentation, 2013 INFORMS Annual Meeting, October 9, INFORMS, Catonsville, MD.Google Scholar
  • Chen L, Mersereau AJ (2015) Analytics for operational visibility in the retail store: The cases of censored demand and inventory record inaccuracy. Agrawal N, Smith S, eds. Retail Supply Chain Management (Springer, Boston), 79–112.Google Scholar
  • Dai T, Jerath K (2013) Salesforce compensation with inventory considerations. Management Sci. 59(11):2490–2501.LinkGoogle Scholar
  • den Boer AV, Zwart B (2015) Dynamic pricing and learning with finite inventories. Oper. Res. 63(4):965–978.LinkGoogle Scholar
  • Ding X, Puterman ML, Bisi A (2002) The censored newsvendor and the optimal acquisition of information. Oper. Res. 50(3):517–527.LinkGoogle Scholar
  • Heese S, Swaminathan JM (2010) Inventory and sales effort management with unobservable lost sales. Eur. J. Oper. Res. 207(3):1263–1268.Google Scholar
  • Huh WT, Rusmevichientong P (2009) A non-parametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.LinkGoogle Scholar
  • Hytönen T, van Neerven J, Veraar M, Weis L (2016) Analysis in Banach Spaces, Volume I: Martingales and Littlewood-Paley Theory (Springer International, Cham, Switzerland).Google Scholar
  • Jain A, Rudi N, Wang T (2014) Demand estimation and ordering under censoring: Stock-out timing is (almost) all you need. Oper Res. 63(1):134–150.LinkGoogle Scholar
  • Kunnumkal S, Topaloglu H (2008) Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Oper. Res. 56(3):46–664.Google Scholar
  • Lariviere M, Porteus EL (1999) Stalking information: Bayesian inventory management with unobserved lost sales. Management Sci. 45(3):346–363.LinkGoogle Scholar
  • Lee J, Gaur V, Muthulingam S, Swisher GF (2016) Stockout-based substitution and inventory planning in textbook retailing. Manufacturing Service Oper. Management 18(1):104–121.LinkGoogle Scholar
  • Lovejoy WS (1990) Myopic policies for some inventory models with uncertain demand distributions. Management Sci. 36(6):724–738.LinkGoogle Scholar
  • Lu X, Song J-S, Zhu K (2005a) Inventory control with unobservable lost sales and Bayesian updates. Working paper, Duke University, Durham, NC.Google Scholar
  • Lu X, Song J-S, Zhu K (2005b) On “The censored newsvendor and the optimal acquisition of information.” Oper. Res. 53(6):1024–1026.LinkGoogle Scholar
  • Mersereau A (2013) Demand estimation from censored observations with inventory record inaccuracy. Working paper.Google Scholar
  • Scarf H (1959) Bayes solutions of the statistical inventory problem. Ann. Math. Statist. 30(2):490–508.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.