Optimal Local Storage Policy Based on Stochastic Intensities and Its Large-Scale Behavior

Published Online:https://doi.org/10.1287/stsy.2024.0093

Abstract

In this paper, we analyze the optimal management of local memory systems using the tools of stationary point processes. We provide a rigorous setting of the problem, and we characterize the optimal causal policy that maximizes the stationary hit probability. We then analyze a special case where the request processes come from a scale family and derive a suitable large-scale limit as the catalog size N when a fixed fraction c of items can be stored. In this limiting regime, the optimal policy amounts to comparing the stochastic intensity of the process with a fixed threshold, which is defined by a quantile of an appropriate limit distribution. We derive asymptotic performance metrics as well as sharp estimates for the prelimit case. Moreover, we establish a connection with optimal timer-based policies for renewal traffic and monotone hazard rates. We also present detailed validation examples of our results, including some closed-form expressions for the miss probability that are compared with simulations. We also use these examples to exhibit the significant superiority of the optimal policy for the case of regular traffic patterns.

Funding: This research was partially supported by the U.S. Air Force Office of Scientific Research [Grant FA9550-23-1-0350].

1. Introduction

In modern computing systems, a crucial component for improving performance at several layers is the use of a local memory, typically referred to as a cache. The rationale is that storing frequently used items at a readily available location can improve latency, reducing the cost of retrieval from a more remote location. This strategy is pervasive in computing systems: from local caching of instructions at the processor level to texture caching in graphics processing units, disk caching for quickly retrieving data from hard disks, content caching in web applications, geographically distributed caching in content delivery networks, and cloud storage gateways keeping readily available items stored in the cloud data centers.

A local memory consists of a certain amount of memory space that may store a subset of items locally and temporarily; the goal is to appropriately choose from a large catalog the subset of items more likely to be requested next. All of the aforementioned applications can be subsumed into this basic structure. In the case of homogeneous item sizes, the space constraint is reflected in the number of objects C that may be locally stored from the catalog of size NC.

Classical analyses of local memory systems have focused on modeling the (discrete-time) sequence of item requests. Based on this empirical sequence, the system must determine which items are stored and which must be evicted from memory to make room for others. In a stationary regime, one natural strategy would be to store at all times the C items with the highest popularity measured by their mean intensity of requests; this simple static policy requires, however, popularities to be known. A practical approximation is the least frequently used (LFU) eviction policy; here, item request intensities are dynamically estimated from the arrival sequence, and the least popular ones are evicted. Under mild assumptions on stationary arrivals, LFU will eventually converge to the static policy.

A popular alternative is the least recently used (LRU) policy, which keeps in memory the C most recently requested items. Upon receiving a request, it will store the item if not already present, and in that case, it will evict the oldest item in the sequence of requests. This method is better suited to handle highly correlated demands, where a requested item is more likely to be fetched again (i.e., bursty request patterns). Smoother variants have been analyzed, and also, combinations of both approaches, such as Least Frequent Recently Used (cf. Bilal and Kang 2017), have been proposed. We review the relevant literature below.

A main drawback of the classical analysis is that relevant time (continuous-time) information is neglected; by focusing only on the sequence of requests, the models ignore interrequest times that are important characteristics of the request processes. An alternative approach that gives time the center stage is timer-based (time-to-live (TTL)) caching policies; here, items are stored upon request and evicted only after a certain amount of time has elapsed because their last appearance in the request stream. This is a common approach in internet-based systems, such as domain name system queries and web applications.

A crucial step toward incorporating time information is the seminal paper by Fofack et al. (2014), where the incoming request stream is assumed to be a stationary point process on the real line. The mean intensities of the underlying request processes for each item capture their relative popularity, whereas by modeling the interrequest times, we can express different types of behavior; for instance, heavy-tailed interrequest times are well adapted to bursty arrival patterns. This approach has led to new insights in the analysis of both replacement and timer-based caching policies, which we also review below.

Within this modeling framework, we ask a natural question. What is the optimal memory management policy? By this, we mean the one that maximizes the hit rate, which is the frequency of successful retrievals from local memory. In Ferragut et al. (2016, 2018), the optimality question for TTL caching policies was investigated; in particular, it was shown that if requests form a general stationary simple point process, optimality may be characterized by the hazard rate function of the interarrival distribution. Later, Panigrahy et al. (2022) analyzed optimal replacement policies (i.e., fixed memory systems), and they identified the optimal local storage policy in terms of the stochastic intensity, a generalization of the hazard rate function.

In this paper, our goal is to develop this connection more extensively, bringing to bear the machinery of stationary point processes as in Brémaud (2020) to formally characterize causal memory management policies and the condition for optimality, in particular for general stationary request processes and correlation structures. Subsequently, our aim is to understand the large-scale behavior of the optimal replacement policy for large N under some appropriate assumptions: namely, that request processes are independent and that interarrival times for different items are drawn from a scale family parametrized by a distribution of intensities that has an asymptotic limit. We characterize the optimal memory management as a threshold policy on the hazard rates observed by the system, and we prove that the threshold has a deterministic limit. We also obtain a closed expression for the optimal performance in the large-scale limit as a function of the fundamental parameters of the scale family and the popularity distribution. As a further contribution, we investigate the properties of threshold policies before the asymptotic limit, showing that under monotonicity of the hazard rate function, they become equivalent to timer-based policies. As a consequence, we obtain the convergence of the two types of policies (replacement or TTL) to a common optimum in the asymptotic limit, a striking connection between both approaches. Our results are illustrated by a series of parametric examples, and their properties are further exhibited through stochastic simulations.

1.1. Related Work

Being essential to modern computing systems, the performance analysis of caching and local memory management has a long history. The seminal work on replacement-based algorithms with a fixed memory size started in King (1971) and Gelenbe (1973). Despite its apparent simplicity, the analysis of replacement policies is complex, even for a single cache; King (1971) gives an explicit expression of the hit probability of the LRU policy with exponential complexity. In Gelenbe (1973), it is shown that under the so-called independent reference model (IRM) assumption, where requests are independent and identically distributed, replacement policies, such as first in, first out, achieve the same hit probability. Exact computation is, however, intractable. Dan and Towsley (1990) provide an approximate computational procedure. Their method has been further extended in Rosensweig et al. (2010) to the case of networks of cache systems. Also, in Gast and Houdt (2015), another approach is given for list-based replacement policies.

A related line of work that includes replacement policies is analyses based on the move-to-front (MTF) rule, which is equivalent to LRU. A first step in this direction is the work by Fill (1996), which addresses the limit cost of the move-to-front rule. By exploiting the connection between MTF and LRU, Jelenković (1999) and Jelenković and Radovanović (2008) provide asymptotic expressions for the hit probability under the IRM assumption and Zipf popularities with parameter β. With a fundamentally different approach, based on Laplace transforms, the same limit result is obtained in Barrera and Fontbona (2010). Our asymptotic limit assumptions are related to this latter approach. Few results on replacement policies move beyond the IRM assumptions, such as the works by Jelenković and Radovanović (2003, 2004) and Jelenković et al. (2006), which show some insensitivity properties of LRU in a large-scale regime and dependent requests. However, their approach is only valid for light-tailed popularities, and it is related to our result below on optimal performance for the nonuniformly integrable popularity limit. Extensions to a network of caches working cooperatively to optimize performance are proposed in Borst et al. (2010), Ioannidis et al. (2010), and Ioannidis and Yeh (2016).

Because exact analysis of LRU is difficult, the most popular technique for approximating its performance is the so-called “Che approximation” introduced in Che et al. (2002) for web caches. The crucial point is to define a characteristic time representing the average permanence time common to all files. This assumption is valid in a large-scale regime, and Fricker et al. (2012) perform a second-order analysis to explain why this is a good approximation.

On the other hand, the analysis of policies based on timers is more recent and associated with the growth of internet caches. A first contribution in this line is Jung et al. (2003), with expressions for the steady-state hit probabilities for web pages; this line of work was later extended in Bahat and Makowski (2005) to include update delay. However, a crucial contribution is the introduction of point process theory to capture request correlations and in particular, move beyond the sequence of requests to a continuous-time model with general arrival processes. The foundations for this approach were laid down in Fofack et al. (2012, 2014). In Berger et al. (2014), the analysis is extended to a family of TTL policies with the focus of approximating LRU performance in linear and tree networks, and a different policy is proposed in Berger et al. (2015) with the aim of maximizing hit ratios by variance reduction. In Bianchi et al. (2013), the Che approximation is analyzed in a more general setting, and in Martina et al. (2014), it is extended using TTL cache tools to renewal arrivals, showing good accuracy in the case of small caches and Zipf-law popularities and making the connection between replacement and timer-based policies. The fact that timer-based policies decouple the analysis over independent request streams has led to more amenable generalizations to the case of cache networks (Panigrahy et al. 2017, 2020; Dehghan et al. 2019).

The search for optimal policies with respect to the hit rate has received recent attention. In Ferragut et al. (2016, 2018), the problem of finding the optimal timer-based policy is formulated and related to the hazard rate function of the interrequest times. Using tools from convex optimization, Ferragut et al. (2016, 2018) show that decreasing hazard rates lead to a nontrivial timer-based policy that achieves optimality and characterize its large-scale behavior. However, more regular request traffic with increasing hazard rates does not benefit from caching at all, a striking result that highlights the role of traffic regularity in local memory management systems. More recently in Ferragut et al. (2024), a generalization of timer-based policies to consider prefetching was introduced, leading to a nontrivial optimal policy for increasing hazard rates. A parallel research line developed in Panigrahy et al. (2022) considers the optimality question for replacement-based policies; the structure of the optimal causal policy is identified, incorporating for the first time the notion of stochastic intensity of point processes, a generalization of the hazard rate function. The current paper builds on these immediate precursors.

1.2. Main Contributions

The first contribution of this paper is a general foundation for the problem of optimal causal policies for the local memory or caching problem. The setup is the same as the one considered in Panigrahy et al. (2022), but we provide here a more complete mathematical treatment within the framework of point processes in the real line following Brémaud (2020). In particular, this approach enables us to properly define predictable policies in terms of the underlying natural filtration of the involved processes, which is a key technical issue to be addressed in the proofs. We also aim at greater generality, determining the optimal causal policy for a general superposition of request processes, including the correlated case, as a function of stochastic intensities of the underlying request processes and their correlation structure. We then specialize this result to independent sources.

The second most significant part of the paper concerns the asymptotic behavior of the optimal policy under a large-scale regime (i.e., when the item catalog size goes to infinity). Assuming that the request processes for different items are independent with stochastic intensities from a common scale family and that their popularities have a limiting distribution, we prove that the optimal policy converges to a threshold policy in the stochastic intensities, with a deterministic threshold that follows closed-form expressions. Armed with this result, we characterize the limit of the optimal miss rate in large-scale systems through an explicit formula. Therefore, we provide a universal asymptotic bound on performance for any practical policy.

A third contribution concerns the behavior of deterministic threshold policies before the asymptotic limit; under renewal traffic and additional monotonicity assumptions of the hazard rate function of the interrequest times, it is shown that previously analyzed timer-based policies are, in fact, of the threshold type and in particular, that under renewal assumptions, the optimal timer policy for decreasing hazard rates identified in Ferragut et al. (2018) achieves the universal bound. For the dual case of increasing hazard rates, we further analyze the timer-based prefetching policy introduced in Ferragut et al. (2024) and prove that it also achieves the universal asymptotic bound.

Although the preceding theory is developed for the case of homogeneous item sizes, we also explore the generalization to the heterogeneous case.

In our simulations section, we extensively analyze certain concrete parametric examples that illustrate the theory and serve as a benchmark for classical caching policies in comparison with the optimal. Finally, we provide some remarks on the practical challenges in implementing the optimal policy.

1.3. Organization of the Paper

The paper is organized as follows. In Section 2, we lay out the main tools of point processes and stochastic intensities required to analyze our system. We then define our local memory model, establish the framework for causal policies, and characterize the optimal policy in Section 3. Our main theorem describing the large-scale limit is presented in Section 4, and the ensuing universal performance bound is presented in Section 5. In Section 6, we describe the connection between the optimal policy and timer-based policies. In Section 7, we discuss the extension of the results to heterogeneous item sizes. Simulations and examples are presented in Section 8, and conclusions are given in Section 9.

2. Preliminaries and Notation

Throughout this paper, we will consider stationary point processes defined on a common probability space (Ω,F,P). We recall now some basic concepts that will be useful in the following and introduce our notation; we refer the reader to Brémaud (2020) for a thorough treatment.

A simple and locally finite point process Φ on the real line is a random and strictly increasing sequence of points Φ={τk}kZ satisfying limk±τk=±. More formally, Φ can be cast as a random counting measure (i.e., Φ=kδτk), a measurable map from (Ω,F)(M#(R),M#(R)). Here, M#(R) is the space of locally finite measures on R taking values in N{}, and M#(R) is the smallest σ-algebra such that for all Borel sets BB(R), Φ(B)=k1{τkB} is measurable. Moreover, Φ(B) is assumed to be finite for bounded B, and thus, it is a nonnegative integer-valued random variable. By definition, all points τk are different, and thus, Φ({x})1 P almost surely (a.s.) for all xR. In order to label the points, we follow the usual convention (Brémaud 2020), where τ0(Φ)0 and τ1(Φ)>0. With this convention, τ0=τ0(Φ) represents the first point before the time origin of the process Φ.1

Let St(Φ) denote the shift operator for measures in R (i.e., St(Φ)(B)Φ(B+t) for all Borel sets BR, where B+t{x+t:xB}). The point process Φ is stationary if St(Φ) has the same distribution as Φ for all tR. The mean measure of the point process is λ(B)E[Φ(B)]. If the process is stationary, then this measure is translation invariant and thus, a multiple of the Lebesgue measure on R (i.e., λ(B)=λm(B)). The constant λ is called the (average) intensity of the stationary point process. In what follows, we assume λ>0 to avoid the trivial case where the process has no points.

For a simple stationary point process Φ, an important measure is the Palm probability PΦ0. This is a probability measure defined in (Ω,F) that captures the stochastic behavior of the point process when the observer is synchronized with it. In particular, PΦ0(τ0=0)=1 (i.e., there is PΦ0-a.s. a point at the origin). We refer the reader to Brémaud (2020) for a formal definition.

The key relationship between the Palm probability and the stationary probability is the following inversion formula valid for any nonnegative real-valued measurable function f:M#(R)R+:

E[f(Φ)]=λEΦ0[0τ1f(St(Φ))dt].(1)

That is, in order to know the average value of a property in the stationary measure, we can integrate over one cycle of the process using the Palm measure and scale by λ.

The interarrival distribution of the point process Φ is defined as F0(t)PΦ0(τ1τ0t)=PΦ0(τ1t). Because the process is simple, F0 has support in R+. Moreover, EΦ0[τ1]=1/λ, which follows from taking f1 in (1).

A second important distribution is the age distribution, which is the age of the current interval when the process is observed at a point t not synchronized with it. Because the process is stationary, we can take without loss of generality t=0, and thus, the age distribution is just

F(t)P(τ0t)=λ0t(1F0(s))ds,(2)
a result that also follows from (1) with f(Φ)=1{τ0(Φ)t}. Note that these distributions are different in general because of the bias toward larger intervals when sampling in the steady state. A depiction of this sampling effect is shown in Figure 1.

Figure 1. Interarrival and Age Distribution of a Stationary Point Process

2.1. Stochastic Intensity

We now introduce the concept of stochastic intensity, which is crucial for the analysis in this paper. First, we need the following definition.

Definition 1.

A filtration {Ft}tR in (Ω,F) (i.e., an increasing family of σ-algebras contained in F) is a history of the simple locally finite point process Φ if Φ((a,b]) is Ft measurable for all a<bt. The natural history of Φ is Ft=σ(Φ((a,b]):a<bt), the smallest filtration satisfying this property.

We define the stochastic intensity of a process for a given history (Brémaud 2020, definition 5.1.1).

Definition 2.

Let Φ be a simple locally finite point process, and let Ft be a history of Φ. If there exists a locally integrable Ft-adapted process φ(t)0 satisfying

E[Φ((s,t])|Fs]=E[stφ(u)du|Fs]for all s<t,(3)
then φ(t) is called an Ft stochastic intensity of Φ.

The process φ(t) acts as the local likelihood of a point appearing at time t given past information. This notion will play a key role in our particular application. Also, directly from the definition, one can prove that E[φ(t)]=E[φ(0)]=λ, the average intensity of the process.

As an example, the stationary Poisson process of intensity λ with its natural history satisfies (3) with φ(t)λ, a deterministic constant, and thus, the likelihood of a point appearing in time is independent of the past, reflecting the total randomness property of the Poisson process.

We now highlight some key properties of the stochastic intensity that will be useful for our later analysis. The first one is related to predictability; for this, we need the following definition.

Definition 3.

Let {Ft}tR be a filtration on (Ω,F). The predictable σ-algebra P(F.) associated with Ft is the σ-algebra on R×Ω generated by the sets of the form

(a,b]×A,a<b,AFa.

A stochastic process X(t,ω) taking values on a measurable space (E,E) is Ft predictable if the mapping (t,ω)X(t,ω) is P(F.) measurable.

The following properties can be found in Brémaud (2020); any real-valued and left-continuous stochastic process adapted to Ft is Ft predictable. Moreover, the stochastic intensity of a point process Φ if it exists can always be chosen to be a.s. predictable up to a set of Lebesgue measure 0, and thus, it is essentially unique. Also, the following result provides a smoothing formula for predictable processes based on the stochastic intensity.

Theorem 1.

Let Φ be a simple point process in R with history Ft, and let φ(t) be an Ft-adapted and a.s.-integrable process. Then, φ(t) is an Ft-stochastic intensity of Φ if and only if

E[Z(t)Φ(dt)]=E[Z(t)φ(t)dt](4)
holds for all Ft-predictable processes Z(t).

Finally, we quote the following result, known as Papangelou’s formula (Brémaud 2020, theorem 7.7.5).

Theorem 2.

Let Φ be a simple point process in R with history Ft, stochastic intensity φ(t), and mean intensity λ. If Z(t) is an Ft-predictable process and f(z) is a real-valued function, then

E[f(Z(0))φ(0)]=λEΦ0[f(Z(0))].(5)

For the purposes of our analysis, an important random variable is the stochastic intensity observed when sampling the renewal process at a fixed time; for convenience, it is chosen to be t=0. We call it the observed stochastic intensity (OSI) and denote it by Xφ(0).

To find the distribution of the OSI, there are two underlying probability measures to consider.

Definition 4.

Let Xφ(0) be the observed stochastic intensity at time 0. Define

G(x)P(Xx)=E[1{φ(0)x}];G0(x)PΦ0(Xx)=EΦ0[1{φ(0)x}].(6)

Here, G(x) is the distribution of the stochastic intensity at a sampling point not synchronized with the process. G0 is the distribution of the OSI at the time of a request; because the stochastic intensity can be chosen to be left continuous, G0 identifies the stochastic intensity distribution prior to this arrival. A useful inequality that will be needed later is the following lemma.

Lemma 1.

The distribution of the OSI prior to arrival events satisfies

G0(x)xλ.(7)

Proof.

The result follows from the formula of Papangelou; taking Z(t)=φ(t) and f(z)=1{zx} in (5), we get

λG0(x)=λEΦ00[1{φ(0)x}]=E[φ(0)1{φ(0)x}]E[x1{φ(0)x}]x. □

2.2. Renewal Point Processes

As an important special case, consider now that Φ is a stationary renewal process, meaning that the interrequest time sequence {τk+1τk}kZ includes PΦ0 independent and identically distributed random variables with common distribution F0. We assume that F0 has a density f0 and recall the definition of the failure rate or hazard rate function associated with F0:

η(t)=f0(t)1F0(t).(8)

Then, if Ft is the natural history of the process, the stochastic intensity of Φ is given by Daley and Vere-Jones (2003, chapter 7):

φ(t)=η(tτ(t)),(9)
where τ(t)=sup{τk:τk<t} is the last point before t. Note that τ(t) is a left-continuous process. In particular, because of the renewal property, the local intensity of the process depends only on the age of the current interval and the hazard rate function of the interarrival distribution.

When F0 is the exponential distribution as in the Poisson process, η(t)λ, so the stochastic intensity is constant as previously stated. Another interesting parametric example that we will use for illustration is the case of Pareto interarrival times; it is presented below in Section 2.3.

In the renewal case, we can provide more explicit formulas for the distribution of the observed stochastic intensity Xφ(0)=η(τ(0)). In particular, at a time not synchronized with arrivals, we will have

G(x)P(Xx)=P(η(τ0)x)=P(τ0η1([0,x]))=η1([0,x])F(dt)(10)
because τ0F, the age distribution. We note that the set η1([0,x]) will be an interval if the hazard rates are monotone.

If instead, we are evaluating the OSI at an arrival time, we must use the Palm probability PΦ0, for which τ00 a.s. and τ(τ0)=τ(0)=τ1. Therefore, X=η(τ1), and its distribution is given by

G0(x)PΦ0(Xx)=PΦ0(η(τ1)x)=η1([0,x])F0(dt)(11)
because τ1F0 under the Palm probability.

2.3. Pareto Interarrival Times

An interesting parametric example of stationary renewal processes, which is considered in Ferragut et al. (2016) for the same application, is when interarrival times follow a heavy-tailed Pareto distribution. In this case, all of the above magnitudes have explicit expressions, which we now compute.

Example 1

(Renewal Pareto Process). In the above setting, choose

F0(t)=1(11+γt)α,f0(t)=αγ(1+γt)α+1(t0).(12)

Thus, F0 is a Pareto distribution (starting at zero) with tail parameter α>1. The number γ acts as a scale parameter, and by direct computation, in order to have EΦ0[τ1]=1/λ, it should be chosen such that γ=λα1.

From Equations (2) and (8), we can compute

F(t)=1(11+γt)α1,η(t)=αγ1+γt,(t0).(13)

For this example, note that the hazard rate function is decreasing for any choice of the parameters. Therefore, at an arrival time τk, the stochastic intensity increases (the hazard rate resets to η(0) following (9)), and a subsequent arrival becomes more likely. This gives rise to bursty traffic as depicted in Figure 2.

Figure 2. Stochastic Intensity of a Renewal Pareto Process Showing Decreasing Hazard Rates

For this process, we can also compute the distributions of the observed stochastic intensity at nonsynchronized or synchronized times:

G(x)=η1(x)F(dt)=αx1γF(dt)={(xαγ)α1,0xαγ,1x>αγ.(14)
G0(x)=η1(x)F0(dt)=αx1γF0(dt)={(xαγ)α,0xαγ,1x>αγ.(15)

In Figure 3, we depict the distributions F0, F, G, and G0 as well as the hazard rate function η for the case α=2 and γ=1 (with average intensity λ=1). In this particular case, the nonsynchronized OSI distribution is uniform in [0, 2].

Figure 3. Cumulative Distribution Function (cdf) Shapes for the Pareto Renewal Example with α=2,γ=1, and Therefore, λ=1
Notes. (a) Interarrival cdf. (b) Age cdf. (c) Hazard rate. (d) Stationary observed hazard rate cdf. (e) Observed hazard rate upon arrival cdf.

3. Local Memory Systems and Optimal Storage Policy

We start by rigorously defining our model for a local memory system. Requests from a catalog of N equally sized items are received (we defer the discussion of heterogeneous sizes to Section 7). We model item requests by stationary stochastic point processes Φi, i=1,,N, which are defined on a common probability space and with finite intensity λi>0, a measure of the popularity of item i. By appropriate labeling, we can choose λ1λ2 (i.e., the objects are ordered by decreasing popularities). Thus, the complete request process is the superposition Φi=1NΦi, and its total intensity is λN=i=1Nλi.

The local memory is limited, and thus, it can only keep readily available a subset of size C<N of the items. These items can upon request be served from the local memory with lower cost than retrieving them from a central repository. This formulation is quite general, and it subsumes the typical notion of caching, which is useful in many applications.2 Mathematically, the local memory can keep at any point in time a subset

C(t)={i1,,ik}{1,,N} with 0kC.

We call the process C:Ω×RPC(N) the storage process of the system, where PC(N) denotes the subsets of {1,,N} with size less than C.

For a given storage process, we can define the hit-counting process of the local memory system as

ΨC(B)=i=1NB1{iC(t)}Φi(dt)(16)

(i.e., the thinned process that counts the requests of objects only if they are stored in memory at the request time). The mean intensity hC of the process ΨC is called the hit rate, and accordingly, it satisfies hCλN.

The natural objective in this setting is to maximize hC (i.e., the rate at which requests can be served directly from the local memory) by choosing an appropriate policy that defines the storage process C(t). However, this policy should be causal (i.e., it cannot make use of future information). Otherwise, the policy that only stores the next arriving item achieves a maximum rate using only one unit of memory for any N, and the problem is trivial. This is where the predictability notion introduced in Section 2.1 becomes important.

Let us denote by Ft(i) the natural history of the ith request process Φi, and φi(t) is its stochastic intensity with respect to Ft(i). Define Ft=σ(Ft(i);i=1,,N) as the aggregated history (i.e., the information about all arrivals up to time t). Note that we are not assuming independence between the request processes. We have the following definition.

Definition 5.

Consider a local memory system with request processes {Φi:i=1N}, aggregated natural history Ft, and capacity C. A causal memory management policy is a stationary Ft-predictable stochastic process C(t), with values in the space PC(N) of all subsets of {1,,N} of size at most C (equipped with the discrete σ-algebra).

We would like to characterize the causal policy that maximizes the hit rate hC. We need the following assumption.

Assumption 1.

The aggregated process Φ is simple, and each process Φi admits a stochastic intensity φ˜i(t) with respect to the aggregated natural history Ft for all i=1,,N.

Note that the stochastic intensity φ˜i(t) need not be equal to the individual stochastic intensity φi(t) because of crossinformation across arrivals. However, the average intensity should still satisfy E[φ˜i(0)]=λi. The following example illustrates this fact.

Example 2.

Consider the following process; objects 1,N are requested alternatively one after another. When a request for item i arrives, an exponential distribution with parameter λi is started, and after expiration, the request for object i+1 arrives (with the cyclic convention N+11). Let Φi denote the requests for item i and Φ denote the total request process. Then, it is clear that Φi is a renewal process with interval length following a hypoexponential (phase-type) distribution (i.e., the sum of independent exponentials with parameters λi). In this case, by Equation (9), φi(t)=η(tτi), where η is the hazard rate function of the resulting hypoexponential, and it is the same for all i.

However, when looking at the aggregated history, it is clear that arrivals for item i can only come at rate λi after a request from process i1; therefore,

φ˜i(t)=λi1{τ(t)=τi1(t)},i=1,,N,
where τ(t) is the last point before t of the aggregated process.

Assumption 1 is also needed to guarantee the existence of φ˜; as an extreme example, consider in particular two processes Φ1 and Φ2=St0(Φ1) (i.e., a delayed version of the first with t0>0 fixed). Then, their stochastic intensities with respect to their own natural histories are just delayed versions of one another: φ2(t)=φ1(tt0). However, in the combined history of both processes, the stochastic intensity of process 2 is zero everywhere except that it degenerates into Dirac δ functions at the t0-shifted points of process 1. Hence, these processes do not have a stochastic intensity in the sense of Definition 2 (which requires a proper stochastic function φi(t)) with respect to the combined history of both processes. Assumption 1 ensures that these pathological examples with complete correlation are excluded.

We are now in position to compute the stochastic intensity of the hit process.

Lemma 2.

Let C(t) be a causal policy. The stochastic intensity of its associated hit process ΨC(t) defined by (16) with respect to the aggregated history Ft is

φC(t)=i=1Nφ˜i(t)1{iC(t)}.(17)

Proof.

We apply the smoothing Equation (4) in Section 2.1. For an Ft-predictable process Z(t), we write

E[RZ(t)ΨC(dt)]=i=1NE[RZ(t)1{iC(t)}Φi(dt)]=i=1NE[RZ(t)1{iC(t)}φ˜i(t)dt]=E[RZ(t)(i=1N1{iC(t)}φ˜i(t))dt]=E[RZ(t)φC(t)dt].

The second equality follows from the smoothing Equation (4) because by Assumption 1, the stochastic intensity of Φi with respect Ft is φ˜i(t). The last equality is because of the process Z(t)1{iC(t)} being Ft predictable. This is the case because C(t) is causal and thus, P(F.) measurable; therefore, Z(t)1{iC(t)} is also P(F.) measurable for all i.

Because the identity holds for any Ft-predictable process Z(t), the converse implication in the smoothing formula establishes (17). □

Define now the policy C*(t) as follows. At any point in time, rank the items by decreasing stochastic intensities φ˜i(t), and store the C highest; ties may be broken arbitrarily. Note that this policy maximizes the sum of the stochastic intensities of the items in memory

C*(t)argmaxCPC(N)iCφ˜i(t).(18)

Theorem 3

(Optimal Causal Memory Management Policy). Given a local memory system with capacity C fed by a family of stationary point processes, {Φi:i=1,,N} with aggregated natural history Ft, satisfying Assumption 1. Then, for any causal policy C(t), its stationary hit rate hC satisfies

hC=E[φC(0)]E[φC*(0)]=hC*,
with φC(t) as in (17) and C* as defined in (18). In addition, the hit process (16) is also simple, and the hit probability
HCPΦ0(ΨC({0})=1)
is maximized by the policy C*.

Proof.

Note from Equation (17) and the construction of C* that for any realization,

φC(0)=i=1Nφ˜i(0)1{iC}max{i1,,iC}i{i1.,iC}φ˜i(0)=φC*(0).(19)

The first inequality follows by taking expectations on both sides with respect to the joint probability measure P of the arrival processes (nonsynchronized with arrivals).

To derive the second statement, we begin by observing that by definition (Equation (16)), ΨC(B)Φ(B) for any BB(R), so ΨC must also be simple. Therefore, it suffices to show that for any causal policy, hC=λNHC (i.e., the hit rate is the total rate “thinned” by the hit probability). This is a natural property but nontrivial because the thinning is not independent of the arrival process.

From the superposition properties of stationary point processes (Brémaud 2020, example 7.2.11), under Assumption 1, the aggregated process is simple, and for any event A, we must have

PΦ0(A)=i=1NλiλNPΦi0(A).(20)

To interpret the above, note that given a point in Φ occurring at time 0, λi/λN is the probability that this point comes from process i. Now, apply (20) to the event A={ω:ΨC({0})=1}:

HC=PΦ0(ΨC({0})=1)=i=1NλiλNPΦi0(ΨC({0})=1).

We now compute PΦi0(ΨC({0})=1) as

PΦi0(ΨC({0})=1)=PΦi0(i{Φi({0})=1,iC(0)})=PΦi0(iC(0))
because under PΦi0, PΦi0({Φi({0})=1})=1 and PΦi0({Φj({0})=1})=0 for ji because of Φ being simple. We conclude that
HC=i=1NλiλNPΦi0(iC(0)).(21)

Now, return to (17) and its expectation at t=0:

hC=E[i=1Nφ˜i(0)1{iC(0)}]=i=1NE[φ˜i(0)1{iC(0)}].

Because 1{iC(t)} is Ft predictable for any causal policy, it follows from Papangelou’s formula (Equation (5)) that

E[φ˜i(0)1{iC(0)}]=λiEΦi0[1{iC(0)}]=λiPΦi0(iC(0)).

Summation over i gives together with (21), hC=λNHC as claimed, and the result follows. □

Thus, Theorem 3 gives a clean formulation of what the optimal policy is: just keep track of the most likely arrivals in the system given the complete past history of the arrival stream. In particular, this generalizes Panigrahy et al. (2022) to general (possibly correlated) superpositions that provided that they satisfy Assumption 1.

3.1. The Case of Independent Request Streams

In the preceding analysis, computing the optimal policy requires knowledge of the stochastic intensities {φ˜i(t):i=1,,N} with respect to the aggregated history. This may require the knowledge of the underlying correlations. This is greatly simplified if the request processes are independent.

Consider now a local memory system with capacity C fed by N-independent stationary request processes Φi with average intensities λi>0 as before (in decreasing order). Again, let Ft(i) be the natural history of the ith process, and let Ft be the aggregated history. Again, let PΦ0 denote the Palm probability of the superposition process, and let PΦi0 be the Palm probability of the ith process.

The superposition of independent simple processes is simple (Baccelli and Brémaud 2013, property 1.1.1), and besides Equation (20), we also have (Brémaud 2020, example 7.2.8)

PΦi0(Φ1Γ1,,ΦNΓN)=PΦi0(ΦiΓi)jiP(ΦjΓj)(22)
for any ΓiM#(R) and i=1,,N.

The interpretation of Equation (22) is the following; in order to compute Palm probabilities given that the point comes from process i, we must use the Palm probability for process i and the stationary probability for any other ji, and the processes remain independent.

This, in turn, allows us to prove the following property.

Lemma 3.

If Φ=iΦi is the superposition process, then Ft is a history of Φ. Moreover, φi(t) is a stochastic intensity for process i with respect to the enlarged history Ft, and the total stochastic intensity of Φ is simply φ(t)=i=1Nφi(t).

Proof.

First, Φ((a,b]) is Ft measurable for all a<bt because the sum function is measurable and Ft(i)Ft. Second, φi(t) is the stochastic intensity of Φi with respect to the shared history Ft because the conditional expectation given Ft(i) coincides with the conditional expectation given Ft for any random variable independent of Ft(j) with ji. This fact is, in turn, a consequence of the independence of the filtrations {Ft(i)}i=1N and that Ft is generated by events of the form i=1NAi with AiFt(i), i=1,,N. Finally, that φ(t) is the stochastic intensity of Φ(t) with respect to Ft follows immediately by linearity of conditional expectation. □

The above lemma guarantees that the stochastic intensity of process i is not altered by superposition with other independent processes; there is no crossinformation between the processes. In particular, Assumption 1 is satisfied, and therefore, we can replace φ˜i(t) by φi(t) in Theorem 3 to obtain Theorem 4.

Theorem 4

(Optimal Causal Memory Management Policy, Independent Case). Given a local memory system with capacity C fed by independent stationary point processes, {Φi:i=1,,N} with individual stochastic intensities φi(t). Then, the optimal causal policy C* corresponds to the following.

  • At any point in time, rank the items in decreasing order of their individual stochastic intensities φi(t).

  • Store in memory the first C objects in the ranking.

Any other causal policy C will satisfy hChC* and HCHC*.

The above theorem characterizes the structure of the optimal causal policy for any superposition of independent request processes (i.e., where no additional information is available about the future other than the natural history of the requests and where there is no crossinformation between them). If the processes are correlated among them, this statement will not be true in general, and we will have to keep track of the shared stochastic intensity with respect to the common history φ˜i(t).

We now analyze some examples where the above policy can be further characterized. The simplest is the Poisson case.

Example 3

(Poisson Arrivals). In case all processes Φi are Poisson with λ1λN, φi(t)λi. By Theorem 3, the optimal policy C* is the static policy that stores the C objects with higher (average) intensities at all times. This is, of course, connected to the total independence property of Poisson processes, and it is also well known in the caching literature (see, e.g., Garetto et al. 2016) under the name independent reference model.

Example 4

(Renewal Arrivals). In the more general class of renewal arrival processes, by Equation (9), φi(t)=ηi(tτi(t)). In this case, the optimal policy can be recast in the following way.

  • Rank all contents in decreasing order of the current interval hazard rates, ηi(tτi(t)).

  • Store in memory the first C objects in the ranking.

As we can see, the role of the hazard rates is crucial as already identified in Ferragut et al. (2016) for timer-based caching. Different monotonicity assumptions on these hazard rates lead to completely different optimal policies as we shall see in Section 6.

An issue with the optimal policy is that its main performance metric, the hit rate, cannot be computed exactly except in some special cases. We now derive an asymptotic result that characterizes the optimal policy for large-scale systems and allows us to calculate asymptotic performance limits.

4. Large-Scale Analysis of the Optimal Causal Policy

In this section, we will present results concerning the asymptotic behavior of the optimal policy in a large-scale regime, where both the system load and the memory size scale appropriately to infinity. For this purpose, we need to introduce more structure into the problem; in particular, we will assume that requests for different items come from independent processes from a common scale family in the following sense.

Consider a base process with unit intensity Φ0, natural history Ft(0), and stochastic intensity φ0(t) with respect to Ft(0). In particular, E[φ0(t)]=1. For our asymptotic analysis, we specify our scale family as satisfying the following.

Assumption 2.

The request processes Φi for items i{1,,N} are independent with stochastic intensities

φi(t)=dλiφ0(λit),
where λi is the process average request rate. Without loss of generality, we will assume that these rates are decreasing (i.e., λ1λN>0); equivalently, items are ordered in decreasing popularity.

A construction that generates processes Φi satisfying this assumption is to take independent copies of Φ0 and set Φi(B)=dΦ0(λiB) (i.e., rescale the time axis).

It is direct from Definition 6 and Assumption 2 that if Xi=φi(0) is the observed stochastic intensity for process i, then the following scaling relationships hold:

G(i)(x)P(Xix)=P(φ0(0)x/λi)=G(x/λi);(23a)
G0(i)(x)PΦi0(Xix)=PΦ00(φ0(0)x/λi)=G0(x/λi).(23b)

Example 5

(Scaled Renewal Process). A simple example where all of the above assumptions hold is when the base process is renewal with some base distribution F0(t) with a mean of one. In this case, the interarrival distribution of the ith process is

F0(i)(t)=F0(λit).

We can also express the age distribution F(i)(t) and hazard rate function η(i)(t) for the ith request process in terms of the base case F(t), η(t):

F(i)(t)=λi0t(1F0(λis))ds=0λit(1F0(u))du=F(λit);η(i)(t)=f0(i)(t)1F0(i)(t)=λif0(λit)1F0(λit)=λiη(λit).

4.1. Threshold Characterization of the Optimal Policy

Under Assumption 2, the optimal policy of Theorem 4 can be recast as follows; at any given time, we have a sample of random variables Xi,i=1,,N, each representing the observed stochastic intensity (for the renewal case, hazard rate since the last request) of the ith request process. Because the processes are independent, the Xi’s are independent but nonidentically distributed; in fact, XiG(i).

According to Theorem 4, an item will be stored in memory if and only if its OSI is one of the C highest. An alternative way to cast this optimal policy is to consider the empirical distribution of the observed stochastic intensities defined by

G^N(x)=1Ni=1N1{Xix};(24)
then, object i is locally stored if and only if G^N(Xi)NCN=1CN.

Introducing the quantile function (inverse of the empirical distribution)

Q^N(p)inf{x:G^N(x)p},p[0,1],(25)
the condition for storing item i in memory at a certain time may be expressed as
Xiθ^NQ^N(1CN).(26)

The random variable θ^N acts as a threshold in the OSI that determines the items to optimally store in local memory. Equivalently, θ^N=YNC, where {Yi:i=1,,N} are the order statistics of the sample {Xi,i=1,,N}.

We will pursue asymptotic results for a system with a very large catalog size N. If we can find a suitable limit for the empirical distribution G^N(x) as N and if the memory size scales linearly with N such that CNNc, then the quantile θ^N should approach a limit (i.e., the large-scale behavior should resemble a policy with a fixed threshold in OSI).

Because the Xi’s are not identically distributed, we cannot use classical Glivenko–Cantelli arguments for empirical distribution convergence. Nevertheless, we will show that a suitable limit arises under Assumption 2 if the system load as defined by the request intensities also scales appropriately with N.

4.2. Scaling of the Request Intensities

We will construct a sequence of systems indexed by N, the number of arrival streams or in other words, items in its catalog. Denote by {λi(N)}i=1N the arrival rates of the system of size N, with the above convention that λ1(N)λN(N)>0. We may think of these points as a discrete measure on the axis λR+ of possible process intensities. Normalizing this measure to total unit mass, we may write its distribution function

LN(λ)1Ni=1N1{λi(N)λ}.(27)

The total arrival rate for system of size N can then be expressed as

λN=iλi(N)=N0λLN(dλ).

Our limit theorems will assume that this family of discrete distributions has a limit with N.

Assumption 3.

There exists a fixed distribution L with no atoms at λ=0 such that LNL when N. Here,  denotes usual weak convergence of probability distributions.

To gain some intuition on the above condition, we turn to the following important example.

Example 6

(Scaling for Zipf Popularities). A widely used model of popularity among different items is the so-called Zipf distribution with the request rates λiiβ, where β0 is the tail parameter of the Zipf law. Note that β=0 corresponds to a uniform distribution, whereas as for large β, relative popularities decay very fast.

In order to model this in our setting, we can use the following arrival rates for the Nth system:

λi(N)=(Ni)β
with β0. Under this scaling, the least popular object has intensity 1 for all N. As N grows, larger intensities are included in the mix. Now, for any λ>1, we have
1LN(λ)=1Ni=1N1{(Ni)β>λ}=1Ni=1N1{i<Nλ1/β}=1NNλ1/βNλ1/β.

Because the above convergence is point wise and the limit is continuous, we have LN(λ)L(λ) given by

L(λ)=1λ1/βfor λ1.(28)

In the limit, the popularities follow a continuous distribution over [1,): namely, a standard Pareto distribution with tail parameter 1/β.

If β1 (i.e., the popularities are light tailed), some objects are extremely more popular than others; in this case, L does not have a finite mean. If instead, 0<β<1, where popularities are heavy tailed and thus, more homogeneous, L has finite mean 1/(1β). For β=0, the system degenerates into every object having the same popularity, and thus, L is the step function at λ=1.

It is also worth observing that the total arrival rate of the Nth system satisfies

λN=i=1Nλi(N)=Nβi=1N1iβNβSN(β),
where SN(β) is the generalized harmonic series partial sum. Using the well-known equivalents for this series, we have that
λN={O(Nβ)if β>1,O(N log N)if β=1,O(N)if β<1.

In particular, with our scaling, the total arrival rate λN as N, albeit at different rates depending on the tail parameter β.

4.3. Asymptotic Behavior of the Optimal Causal Policy

We now return to our family of systems indexed by N with independent request streams coming from a common scale family with base distribution F0 (Assumption 2) and incorporate the scaling in Assumption 3 on the arrival rates {λi(N)}i=1N. Namely, their empirical distribution LN(λ) in (27) has a weak limit L(λ).

If we look at each system at a fixed time t=0, we obtain a sample of the current observed stochastic intensities {Xi(N):i=1,,N}. Considered collectively that for all N1, these random variables constitute a triangular array; without loss of generality, we may assume that they are all defined in a common probability space (Ω,F,P).

For each N, we can define the random function G^N by Equation (24). The main result below concerns the asymptotic behavior of these empirical stochastic intensity distributions.

Theorem 5.

Consider a family of local memory systems indexed by N with request processes satisfying Assumption 2 and with intensities {λi(N)}i=1N satisfying Assumption 3. Then,

G^NNGPa.s.,
where the function G is given by
G(x)0G(x/λ)L(dλ).(29)

Moreover, assume that the memory size of the Nth system satisfies CNNNc with 0<c1.

Then, if 1c is a continuity point of the quantile function Q=G1, the random threshold θ^N defined by Equation (26) converges P almost surely to θ*=Q(1c).

Proof.

The proof begins by computing the expected value of the random function G^N:

G¯N(x)E[G^N(x)]=E[1Ni=1N1{Xi(N)x}]=1Ni=1NG(i)(x)=1Ni=1NG(xλi(N)),(30)
where we have applied the scaling in (23a). G¯N is a (deterministic) distribution function in x0. Using the definition of LN, we can rewrite the above as
G¯N(x)=0G(x/λ)LN(dλ).

We first show that as distribution functions, G¯NG as N. To do so, it is convenient to interpret (30) as follows; consider a pair (X,ΛN) of independent random variables in R+ with respective distribution functions G(x)=P(Xx) and LN(λ)=P(ΛNλ). Then, G¯N(x) is the distribution of the product ZN=XΛN. Indeed,

P(ZNx)=n=1NP(XΛNx|ΛN=λi(N))P(ΛN=λi(N))=1Nn=1NP(Xx/λi(N))=G¯N(x).

Now, consider the limit in the distribution of the pair (X,ΛN). Because of their independence and because LNL, the limit corresponds to the distribution of (X,Λ), a vector of independent random variables with marginal distributions G(x) and L(λ). By continuity of the map (x,λ)xλ, we conclude that ZN=XΛNdZ=XΛ. It remains to compute the distribution function of the latter product:

P(Zx)=R+21{ξλx}L(dλ)G(dξ)=0L(dλ)01{ξx/λ}G(dξ)=0G(x/λ)L(dλ).

We conclude that G¯NG as N. Equivalently, we have point-wise convergence G¯N(x)G(x) at any continuity point x0 of G.

To relate the mean function G¯N(x) to the stochastic one G^N(x), we now resort to Shorack (1979, theorem 2.1), a generalization of the classical Glivenko–Cantelli theorem for empirical distributions for random variables that are not identically distributed. The theorem states that in the probability space (Ω,F,P) where the triangular array {Xi(N):i=1,,N,N1} is defined, we have

G^NG¯NN0P a.s.(31)

In particular, with P probability 1, |G^N(x)G¯N(x)|0 as N for all x0. Combining this with the previously obtained point-wise convergence of G¯N(x), we conclude that P almost surely G^N(x)G(x) at any continuity point x0 of G, and thus, G^NG.

Finally, convergence of θ^N now follows from the fact that convergence in distribution implies convergence of quantiles; specifically (van der Vaart 1998, lemma 21.2), G^NG is equivalent to Q^N(p)Q(p) for all continuity points p[0,1] of Q=G1. See also Proposition A.2 in Appendix A.1. □

We have thus established that the optimal causal local memory policy converges in the large-scale limit to a deterministic threshold policy in the observed stochastic intensities. Specifically, when memory scales as a fraction c of the catalog, a large-scale system should store at any given time the items whose current OSI exceeds a threshold θ* chosen as the 1c quantile of the limit distribution G in (29).

5. Asymptotic Optimal Performance

In this section, we analyze the performance achieved by the optimal policy in the large-scale limit. As discussed in Section 3, performance in local memory systems is measured by the hit probability HC or equivalently, the hit rate hC=λNHC.

We will find it more convenient to derive expressions for the complementary miss probability MC=1HC and the miss rate mC=λNMC=λNhC. Referring back to Equation (21), we can write the following expressions for these quantities in a system of size N.3

MC(N)=i=1Nλi(N)λNPΦi0(iC(0)).(32a)
mC(N)=i=1Nλi(N)PΦi0(iC(0)).(32b)

The key challenge in the evaluation of the above formulas is to compute PΦi0(iC(0)) (i.e., the probability that an incoming arrival does not find its file in local memory). In the optimal policy C* for fixed N, the determining condition is whether the observed stochastic intensity Xi(N) of the requested item finds itself among the C largest. To correctly evaluate this probability, however, we must recognize an asymmetry. The competing Xj(N),ji are being evaluated at a time not synchronized with their arrivals and thus, follow independent distributions G(·/λj(N)); for the item in question, we must use the distribution G0(·/λi(N)) that applies to synchronized sampling of the stochastic intensity.

For this reason, to analyze the comparison, it is convenient to define the empirical distribution

G^N(i)(x)=1N1ji1{Xj(N)x}(33)
of the stochastic intensities of nonrequested items and express the miss condition as follows:
iC(0)ji1{Xj(N)>Xi(N)}Cji1{Xj(N)Xi(N)}N1C.G^N(i)(Xi(N))1CN1pN.

The above equivalence assumes that there are no ties in the comparison of the OSIs; to simplify the analysis to follow, we will make this a standing assumption in this section.

Assumption 4.

For each N1 and 1ijN, PΦ0(Xi(N)=Xj(N))=0.

Returning to (32b), we may now express the performance criterion of the optimal policy as

mC*(N)=i=1Nλi(N)PΦi0(G^N(i)(Xi(N))pN)=i=1Nλi(N)PΦi0(Xi(N)Q^N(i)(pN)),(34)
where Q^N(i) is the inverse (quantile) function of G^N(i).

We now outline informally the essence of the analysis that follows. Because the empirical distribution G^N(i) corresponds to N1 intensities, which are sampled at a nonsynchronized point, its asymptotic behavior under the assumed scaling should follow the conclusions of Theorem 5 (i.e., converge to the distribution G). Under C/Nc, we have pN1c, so we should have Q^N(i)(pN)Q(1c)=θ* as N.

This leads us to consider the approximate formula

mC*(N)i=1Nλi(N)PΦi0(Xi(N)θ*)=i=1Nλi(N)G0(θ*/λi(N));
in the last equality, we have invoked the distribution of Xi(N) synchronized with the arrival under the Palm probability PΦi0. Of course, the approximation is not a rigorous step. We have taken the limit in the quantile but not in the rest of Equation (34); nevertheless, it will help us arrive at the right conjecture.

Invoking the distribution of intensities in (27), we may express the approximation as

mC*(N)N1Ni=1Nλi(N)G0(θ*/λi(N))=0λG0(θ*/λ)LN(dλ).(35)

Under Assumption 3, LNL. The integrand function λG0(θ*/λ) is bounded by θ*, invoking (7); if it is continuous, the right-hand side will converge to the corresponding integral with the limit distribution L(λ). This formula for the asymptotic optimal miss rate is the main conjecture that we will prove in Theorem 6 below, making rigorous the preceding approximate reasoning.

We will require some additional technical conditions.

  • For continuity of λG0(θ*/λ), it will suffice to guarantee it at the atoms of the measure L(dλ) (i.e., the discontinuities of L(λ)); denote this set by DL, and recall by Assumption 3 that 0DL. Analogously, denote by DG0 the set of discontinuities of G0, and denote by DL·DG0{λx:λDL,xDG0}; we will require for our limit theorem that θ*DL·DG0.

  • To carry out the limit result for mC* in (34), we will treat separately two cases in regard to the uniform integrability of the family of popularity distributions {LN}N1. The family is uniformly integrable if

    ϵ>0,K0 such that supN11Nλi(N)Kλi(N)ϵ.

Uniform integrability is a standard assumption (see, e.g., Billingsley 1999, section 1.3) that connects convergence in distribution with convergence of the first moment. In particular, if LNL and the {LN}’s are uniformly integrable, then

0λLN(dλ)N0λL(dλ)<.(36)

A partial converse also holds; if LNL, with first moments satisfying (36), the family must be uniformly integrable.

Example 7.

The Zipf family considered in Example 6 has continuous limit L and thus, satisfies all necessary continuity assumptions. Furthermore, the family LN is uniformly integrable for β<1; for β1, the limit distribution has infinite mean.

We are now ready to state our main performance result.

Theorem 6

(Asymptotic Miss Rate—Uniformly Integrable Case). Suppose {LN}N1 is uniformly integrable. Let c(0,1] be such that 1c is a continuity point of Q and that θ*Q(1c)DL·DG0. If C/Nc as N, then

limNmC*(N)N=0λG0(θ*λ)L(dλ).(37)

The proof is given in Appendix A.

The result states that for a uniformly integrable scaling of popularity distributions, under some regularity assumptions, the miss rate of the optimal policy scales linearly with N with a proportionality constant given by the integral in (37), where the distribution function G0 of the OSI under synchronous sampling appears explicitly. On the other hand, the nonsynchronous distribution G of the OSI also influences the formula because it determines the distribution G in (29), whose quantile is the asymptotic threshold θ*.

Corollary 1

(Asymptotic Miss Probability—Uniformly Integrable Case). Under the same hypothesis as Theorem 6, the optimal miss probability satisfies

limNMC*(N)=0λG0(θ*λ)L(dλ)0λL(dλ).(38)

Proof.

Note that

MC*(N)=mC*(N)λN=mC*(N)/NλN/N;
the limit of the numerator is given in Theorem 6. For the denominator, we use (36) because λNN=0λLN(dλ), and we have uniform integrability. □

Note regarding Equation (38) that the numerator is bounded by θ*; indeed, we have λG0(θ*/λ)θ* from Lemma 1, and L has unit mass. As the first moment of this distribution (in the denominator) becomes larger, the miss probability is smaller. This suggests that for L with infinite first moment, the miss probability will be zero. This is indeed true, but we must provide a separate argument because uniform integrability will not hold in this case. Our strategy will be to show that a suboptimal policy (the static one with Cs={1,,C}) already achieves vanishing asymptotic miss probability.

Theorem 7

(Asymptotic Miss Probability—Nonintegrable Case). Suppose that 0λL(dλ)=. Assume that C/Nc(0,1] as N. Then,

MC*(N)MCs(N)N0.(39)

Proof.

The first inequality is immediate by optimality of C*. So, we focus on the static policy, where misses are certain for i>C; therefore,

MCs(N)=i=C+1Nλi(N)λN=1Ni=C+1Nλi(N)λN/N=0αNλLN(dλ)0λLN(dλ),(40)
where αN=LN1(pN) is the pNth quantile of LN and pN=1CN.

Because LNL and pN1c<1, we have that AsupNαN<. Then,

limsupN0αNλLN(dλ)limN0AλLN(dλ)=0AλL(dλ)<.(41)

For the denominator, we have (Billingsley 1999, theorem 3.4)

liminfN0λLN(dλ)0λL(dλ)=.(42)

Together, (41) and (42) imply (39). □

As a comment on the above results, we note the following. When the distribution of item intensities is not uniformly integrable as, for example, in the Zipf case with β1, items with the largest intensities dominate the rest; thus, it is natural that the static policy would achieve asymptotic optimality and perfect performance regardless of the interarrival distribution. Note, however, that such convergence may be slow; we refer to Section 8 for examples.

In the more interesting case of uniform integrability with less disparate popularities, any causal memory management policy will incur some performance penalty; the minimum miss probability achieved by the optimal policy can be computed explicitly from the interarrival distribution characteristics and the storage fraction c.

6. Threshold Policies and Their Timer-Based Counterparts

In Section 4, we found that the optimal causal policy could be expressed in terms of a threshold on the observed stochastic intensity. This threshold is stochastically varying but converges in our large-scale regime to a deterministic constant.

This motivates us to look at policies that are defined by a deterministic threshold (i.e., where one keeps in memory items with stochastic intensity larger than θ). An immediate observation is that in such policies, the memory constraint C would not be satisfied in a strong way at all times. Rather, we must replace it with the soft constraint where the average memory occupation is C. We now make this formal.

6.1. Threshold Policies and Memory Occupation

Consider a local memory system with independent request processes {Φi:i=1N}, aggregated natural history Ft, and stochastic intensities φi(t). Consider the following threshold policy:

Cθ(t)={i:φi(t)>θ}P(N)(43)

(i.e., the store in memory of all of the items that have current stochastic intensity above the threshold θ).

Because the Φi’s are stationary, the intensities φi(t) are also stationary, and thus, we have a stationary and causal policy. The memory occupation of this policy is now the stationary random process:

Uθ(t)=#Cθ(t)=i=1N1{φi(t)>θ}.(44)

Its average can be computed as (using t=0 as a sampling point)

E[Uθ(0)]=E[i=1N1{φi(0)>θ}]=i=1NP(φi(0)>θ).(45)

Note that E[Uθ(0)] is decreasing from N to zero as θ goes from zero to . In order to have a fair comparison against a fixed memory constraint, define the threshold θC as

θCinf{θ:i=1NP(φi(0)>θ)C}.(46)

Thus, θC is the 1C/N quantile of the average of the distributions of the stochastic intensities observed at time 0. From the right continuity of the distribution functions, it is easy to check that E[UθC(0)]C with equality if 1C/N is a continuity point of the quantile function. We shall assume this henceforth to simplify exposition.

Because the arrival processes are independent, the random variables φi(0) are independent, and thus, we have the following proposition.

Proposition 1.

Assume that the above quantile is exact in the sense that E[UθC(0)]=C. Take C=cN with 0<c1. Then, for any ε>0,

P(|UθC(0)Nc|>ε)2e2Nε2,
and therefore, UθC(0)/NPc as N.

Proof.

The proof follows from the Hoeffding inequality applied to the independent Bernoulli random variables {1{φi(0)>θ}}. In fact, for any N and ε>0,

P(|UθC(0)Nc|>ε)=P(|i=1N1{φi(0)>θC}E[i=1N1{φi(0)>θC}]|>Nε)2e2Nε2. □

From Proposition 1, we conclude that under independent request processes, the memory occupation of the threshold policy will not deviate much from a fixed memory policy for large N. We will use this to our advantage in Section 8.

6.2. Asymptotic Optimality of Threshold Policies

We focus now on the situation studied in Sections 4 and 5, where by Assumption 2, the request processes come from a common scale family. Furthermore, the traffic intensities satisfy Assumption 3.

In that case, φi(0)=Xi, which is the observed stochastic intensity with distribution G(i)(x). Therefore, θC is the solution to the equation

i=1N[1G(i)(θC)]=C.

Invoking Equation (23a) for the scale family and rearranging terms, we have

G¯N(θC)=1Ni=1NG(θCλi)=1CN
with the notation introduced in (30); invoking the distribution LN of traffic intensities λi, we may also write the above equation as
0G(θCλ)LN(dλ)=1CN.(47)

We are now in position to prove the following.

Proposition 2.

Consider a family of local memory systems indexed by N with request processes satisfying Assumptions 2 and 3. Choose the memory size of the Nth system as CN=cN with 0<c1. If there exists a unique solution to θ* satisfying

G(θ*)=1c,
then the sequence of thresholds θNθCN defined by (47) satisfies θNθ*.

The essential step of the proof is to show that G¯NG; this part is identical to the proof of Theorem 5. The result then follows from the convergence of quantiles.

Let us now analyze the asymptotic performance; considering an object i, its miss probability is given by

PΦi0(iCθN(0))=PΦi0(XiθN)=G0(i)(θN).

Therefore, the total miss probability for system N is given by

MθN=i=1Nλi(N)λ(N)PΦi0(XiθN)=i=1Nλi(N)λ(N)G0(θN/λi(N))=0λG0(θN/λ)LN(dλ)0λLN(dλ).

We can then state a result analogous to Theorem 6 and Corollary 1 for the asymptotic performance.

Proposition 3.

Under the conditions of Theorem 6, the asymptotic miss probability satisfies

limNMθN=0λG0(θ*λ)L(dλ)0λL(dλ).

The proof is substantially easier than in the case of Theorem 6 because the threshold θN is deterministic instead of depending on the state of the remaining processes.

Indeed, by Lemma 1 and the fact that θNθ*, we see that the function λG0(θN/λ) is uniformly bounded for all N. Taking proper account of continuity as in Theorem 6, the numerator converges as stated. For the denominator, we invoke uniform integrability.

The main conclusion of this analysis is that under the above assumptions, a deterministic threshold policy with a soft memory constraint is asymptotically equivalent to the optimal causal policy derived in Section 3 both in terms of memory usage and in terms of performance.

We shall show now that there is a strong connection between threshold policies and timer-based ones.

6.3. Connection with Timer-Based Policies: Monotone Hazard Rates

To end this section, we would like to highlight a strong connection between threshold policies and timer-based ones in the case where the requests follow renewal processes. Timer-based or time-to-live caching has been a long-standing idea in local memory systems; here, each time a request for item i occurs, a timer is started, and the item is kept in memory up to the timer expiration. When a new request arrives, the timer is reset. TTL policies were first analyzed in terms of point process requests in Fofack et al. (2014), and the optimal timers under stationary requests were obtained in Ferragut et al. (2016).

Consider the policy defined in (43) with deterministic threshold θN=θCN from Equation (46). Assume that the incoming processes are renewal and that the interarrival hazard rates are strictly monotonically decreasing over the distribution domain (as in the Pareto parametric example). Assume for simplicity that the threshold θN lies in the range of the function ηi(·), and define

Ti(N)=ηi1(θN).

Then, because ηi is decreasing, we have

φi(t)>θNt<Ti(N);
in panel (b) of Figure 4, we illustrate this condition. Consequently, in the case of monotonically decreasing hazard rates, the threshold policy is exactly equivalent to TTL caching.4 This is consistent with the characterization of optimal timers in Ferragut et al. (2016). In addition, from the asymptotic optimality of threshold policies established above, we further conclude here that TTL caching is asymptotically optimal.

Figure 4. Threshold Policies in Terms of Timers
Notes. (a) Timer-based caching for decreasing hazard rates. (b) Timer-based prefetching for increasing hazard rates.

An analogous discussion is valid for monotonically increasing hazard rates; here, the likelihood of a subsequent request is smallest immediately after receiving one, and thus, caching is not useful. Instead, it was proposed in Ferragut et al. (2024) to remove the item and prefetch it at an appropriate later time. In fact, this strategy can also be cast as a threshold policy by observing (see Figure 4) that

φi(t)>θNt>Ti(N),
so the threshold policy removes the object from memory and recalls it after a time Ti(N); these timers coincide with the optimal timer prefetching policy from Ferragut et al. (2024). Once again, the asymptotical optimality of threshold policies allows us to further conclude that timer-based prefetching is asymptotically optimal for request processes with increasing hazard rates.

7. Extension to Heterogeneous Object Sizes

The preceding model assumes that objects have equal sizes; they all have the same impact on memory, and the natural performance criterion is the hit probability determined by whether the object can be (fully) recovered from local memory upon request. In practice, items can have different sizes and therefore, unequal memory cost. Also, one can consider in the performance the value of a partial recovery of the object. We now formalize this approach, which was first proposed in Panigrahy et al. (2022), and we show how the previous results can be extended to incorporate this case.

Consider a local memory system fed by a family of request processes {Φi:i=1,,N} with intensities λi, natural histories Ft(i), and stochastic intensities φi(t). Again, let Φ be their superposition, and let Ft be the aggregated history. Assume that each object i has a size zi>0 and that the system has total memory C.

A causal policy C˜(t) will now be an Ft-predictable stochastic process C˜:Ω×R[0,1]N, where C˜(t) represents the fractions of each file stored in the local memory. For feasibility, the policy should satisfy the memory constraint

C˜(t)=(x1,,xn)izixiC.(48)

Thus, we are allowing fractional caching (i.e., only part of the item to be stored in memory).

Assume that whenever a request for item i arrives at time t, it can recover from the local memory zixi(t) and derives a utility from this amount of content Ui(zixi), where Ui is a concave function. In this scenario, the analogue of the hit process is the following utility process:

ϒC˜(B)=i=1NBUi(zixi(t))Φi(dt).(49)

The process ϒC˜ is a measure-valued process that stores the utility derived from the local memory retrievals. For the special case where Ui(z)=z, we recover the byte-counting process in Panigrahy et al. (2022) because it keeps track of the amount of data recovered from the local memory in any time interval.

Now, let this local memory system be fed by a superposition of request processes {Φi:i=1,,N} satisfying Assumption 1. We have the analogue of Lemma 2.

Lemma 4.

Let C˜(t) be a causal policy. The stochastic intensity of the utility process (49) with respect to the aggregated history Ft is

uC˜(t)=i=1Nφ˜i(t)Ui(zixi(t)).(50)

The proof is similar to the proof of Lemma 2, changing the predictable process 1{iC(t)} for the fractional caching version Ui(zixi(t)), where xi(t) is predictable by assumption.

Consider now the policy C˜*(t) defined by solving the following optimization problem:

maxxi[0,1]iφ˜i(t)Ui(zixi)(51a)
subject to  izixiC.(51b)

We then have the analogue of Theorem 3.

Theorem 8.

Given a local memory system with total memory C fed by a family of stationary point processes, {Φi:i=1,,N} with aggregated natural history Ft, satisfying Assumption 1. Assume that the item sizes are {zi:i=1,,N} and that the system may store fractional items. Then, for any causal policy C˜(t), the stationary utility rate uC˜ satisfies

uC˜=E[uC˜(0)]E[uC˜*(0)]=uC˜*
with uC(t) as in (50) and C˜* as defined in (51).

The proof follows along the same lines of Theorem 3 by observing that the policy C˜* satisfies Equations (51) at all times.

Example 8

(Maximizing the Byte Count). If we take Ui(z)=z for all i, then we recover the byte-counting process of Panigrahy et al. (2022). In that case, the optimization problem defining C˜* becomes a linear program, and its solution can be given explicitly in terms of the stochastic intensities φ˜i(t) as follows.

  • Rank the contents in decreasing order of φ˜i(t).

  • Choose the first K objects satisfying

    k=1K1zik<C and k=1Kzik>C,
    where ik denotes the ordering permutation of the first step.

  • Take xik(t)=1 for k=1,,K1 and xiK(t)=Ci=1K1zikziK.

In other words, at all times, store the most likely upcoming requests as dictated by the stochastic intensity up to filling up your memory.

Note that the policy in Example 8 ranks the objects again in terms of the stochastic intensity independently of the object size. In particular, the optimal policy can be cast as a dynamic threshold policy, where the threshold is the stochastic intensity of the Kth object (i.e., the one that completes the memory allocation). Because of this fact, we think that the asymptotic results of Sections 5 and 6 can also be extended to this case under the appropriate scaling assumptions. This topic will be pursued in future work.

8. Parametric Examples and Simulations

In this section, we provide some interesting parametric examples to highlight the power of the results, in particular enabling us to compute sharp estimates for the performance of a local memory system. We must specify the interarrival distribution of our scale family of request processes and the distribution of popularities. We begin with the Pareto–Zipf combination, which was already introduced in examples above. We then analyze a further example with increasing hazard rates.

8.1. Pareto Interarrival Times and Zipf Popularities

Consider first the Pareto interrequest times introduced in Section 2.3 with tail parameter α. As we already mentioned, this family represents bursty traffic with decreasing hazard rates. For the base (unit intensity) process, we have the following distribution for (nonsynchronized) observed stochastic intensity:

G(x)=(α1αx)α1 for 0xαα1;G(x)=1 for x>αα1.

Combine it with the Zipf popularities with tail parameter β introduced in Example 6 with limit distribution

L(λ)=1λ1/β for λ1.

By appropriately integrating Equation (29), we can obtain the asymptotic distribution:

G(x)={(1cα,β)(α1αx)α1xαα1;1cα,β(α1αx)1/βx>αα1; where cα,β(α1)β(α1)β+1,(52)
which is a relevant parameter in the following discussion. Note that G(x) is continuous and takes the value 1cα,β at x=αα1. A depiction of G(x) is given in Figure 5.

Figure 5. Pareto Example with Zipf Popularities, α=2, and β=1
Notes. (a) Limit distribution G. (b) Threshold fixed point.

Because G is strictly increasing, we can now solve for the unique asymptotic threshold θ* as a function of the memory size c. Imposing G(θ*)=1c, we get the following result:

θ*={(αα1)(cα,βc)βccα,β,(αα1)(1c1cα,β)1α1ccα,β.

In the case β<1, it is easy to see that the measures {LN} are uniformly integrable, and thus, we can compute the miss rate estimate from (37):

0λG0(x/λ)L(dλ)={(1cα,β)(α1αx)αxαα1;11β[1αβ(1cα,β)(α1αx)11/β]xαα1.

Substituting the appropriate threshold and noting that 0λL(dλ)=11β, we reach the following result for the asymptotic miss probability:

M={1αβ(1cα,β)(ccα,β)1βccα,β;(1β)(1cα,β)1α1(1c)αα1ccα,β.(53)

In Figure 6, we plot the asymptotic miss probability of Equation (53) for fixed α and c as a function of the popularity tail parameter β. We also compare it with the asymptotic miss probability of the static policy, which is derived in Ferragut et al. (2016) and is given by 1c1β for 0β1. The optimal policy has the most advantage when popularities are more homogeneous; in particular, as β0, we have M(1c)αα1<1c for any α>1. The gain is larger as α decreases; the optimal policy capitalizes on the burstiness of the incoming traffic captured by the hazard rates.

Figure 6. Asymptotic Miss Probability for Zipf Popularities with Varying Parameter β
Notes. Pareto interrequest times with α=2 and c=0.1. The static policy was added for comparison.

A second observation is that convergence of the empirical distribution of the OSIs is very fast. In Figure 7, we plot a simulated sample in the steady state of G^N(x) for N=100 items for α=2 and β=0.5 as an example. We see the convergence to G(x) as established in Theorem 5.

Figure 7. The Empirical Distribution of the Observed Hazard Rates for N=100 and Its Limit G(x)
Note. Pareto requests with α=2 and Zipf popularities with β=0.5.

Moreover, because convergence to G is fast, convergence of the random threshold θ^NQ^N(1c)θ* is also fast. We plot panels (a) and (b) of Figure 8 two examples with the same parameters as above for N = 1,000 and 10,000, respectively. The threshold process θN(t) is approximately constant for large N around the value θ*.

Figure 8. Threshold Evolution and Large-Scale Limit for Pareto Requests α=2, β=0.5, and c=0.1
Notes. (a) Catalog size N = 1,000. (b) Catalog size N = 10,000.

This last observation is crucial for developing finite N approximations for the performance. Because of the fact that when β1, the family of distributions {LN} approaches nonuniform integrability, convergence of the miss rate estimate mC*(N), the total rate λ(N), and the miss probability MC*(N) is slow around this value as depicted in the simulations shown in Figure 9.

Figure 9. Miss Probability of the Optimal Policy for Pareto Requests, α=2, c=0.1, and Varying β
Notes. The solid curve represents the asymptotic result (53). Dots represent simulation results for different values of N. The dotted lines represent the finite N approximation (54).

In order to compute finite N approximations, we use the intuition developed in Equation (35); that is, we compute the asymptotic threshold θ* by solving G(θ*)=1c and plug in this estimate in place of the random threshold θ^N. Then, we estimate MC*(N) as

MC*(N)0λG0(θ*/λ)LN(dλ)0λLN(dλ)=1λ(N)i=1Nλi(N)G0(θ*/λi(N)).(54)

This turns out to be equivalent to approximating the optimal performance for that of the fixed threshold policy discussed in Section 6, and it is numerically easy to compute, even for distributions that do not have closed-form expressions as in this case. Therefore, this procedure provides sharp estimates of the maximum achievable performance for any policy and finite N. An example of this approximation is depicted in the dashed lines in Figure 9.

8.2. Erlang Interrequest Times and Zipf Popularities

We now turn our attention to a different example, where the hazard rate of the interrequest distribution is increasing. This leads to a totally different behavior; for instance, caching is not a good idea in this setting because upon receiving a request, a subsequent request becomes less likely. It is actually preferable to remove the content from memory and prefetch it again closer to request time (Ferragut et al. 2024).

Of course, the optimal memory management policy designed in Section 5 does this automatically by keeping track of the hazard rates, and all of the asymptotic results derived for it are still valid in this case.

To illustrate this behavior, we choose the interrequest times to be distributed as an Erlang distribution with k stages and appropriate means. For k=1, this corresponds to the Poisson process because interrequest times become exponential. As k grows, the process approaches deterministic interrequest times, and thus, the traffic pattern becomes more regular.

To be precise, let the base interrequest distribution be Erlang with k stages and mean of one (so λ=k). This corresponds to the following:

f0(t)=1(k1)!kktk1ekt,F0(t)=1j=0k11j!(kt)jektt0.

This yields the following nice formula for the hazard rate function:

η(t)=k(kt)k1(k1)!j=0k11j!(kt)j=kB(kt,k1),
where B(A,C) is the classical Erlang-B formula for the blocking probability of telephone systems. For k>1, this becomes strictly increasing in t, η(0)=0, and η(t)k when t.

We do not have analytical expressions for the age distribution or for the observed hazard rates, so G must be estimated numerically. We do so for the Zipf popularity distribution limit L(λ) introduced above. We plot the resulting distribution in Figure 10 together with a simulation of the empirical version G^N(x) for N = 1,000, showing good convergence. Also, in Figure 10, we show the threshold θ^N evolution over time for the optimal policy, showing convergence to the numerically computed limit θ*.

Figure 10. Observed Hazard Rates and Threshold Evolution for an Example with Erlang Requests with k=4 Stages and Zipf Popularities with β=0.5 and c=0.1
Notes. (a) Empirical distribution of the observed hazard rates for N = 1,000 and its limit G(x). (b) Threshold process evolution and predicted limit θ* numerically computed.

Finally, in Figure 11, we plot the asymptotic miss probability of the optimal policy numerically computed from Equation (38) as well as a simulation of the optimal policy for N = 1,000 as a function of parameter β. As discussed above, convergence to the performance limit is slow when β approaches one, so we compute the performance estimate from (54), which also corresponds to the optimal prefetching policy from Ferragut et al. (2024), showing good fit. As a comparison, the classical LRU policy is also simulated for the traffic pattern. As we discussed above, classical caching performs badly because of the regularity of the request process.

Figure 11. Miss Probability of the Optimal Policy for Erlang Requests with k=4 Stages, c=0.1, and Varying β
Notes. The solid curve represents the asymptotic result. Squares represent simulation results (N = 1,000), and the numerical approximation from (54) is shown by the dashed line. For comparison, LRU simulations are also shown.

8.3. Practical Implementation

Although our paper is focused on theoretical results for the optimal management of local memory systems, it is important to discuss possible practical implementations of the optimal policy. From Theorem 3, we know that the key point to implement the optimal policy is correctly estimating the stochastic intensities of the underlying request processes, a well-known hard problem in statistics (Daley and Vere-Jones 2003, chapter 7.5) with few results.

In the case where the incoming processes are independent and renewal processes, the problem boils down to estimating the hazard rate function of each stream based on previously recorded requests. Hazard rate estimation from life tables and for discrete random variables is a well-studied issue, with estimators such as the Kaplan–Meier estimator for the hazard rate and the Nelson–Aalen estimator for the cumulative hazard rate, respectively (Colosimo et al. 2002). In the continuous setting, some sort of kernel smoothing must be done (Tanner and Wong 1983) because by definition, the hazard rate depends on the underlying density of the distribution, so the problem is completely related to kernel density estimation.

Nevertheless, in our case, we can propose a practical algorithm along the lines of the Nelson–Aalen estimator. We first discuss this for a given item i, so we drop the dependence on i to ease the notation. Begin by observing that the hazard rate is given by

η(t)=tlog(1F0(t)),
where F0 is the interarrival distribution. In particular, log(1F0(t)) is the cumulative hazard rate function. The algorithm will proceed as follows.
  • Keep track of the observed interrequest times {τ˜k:k=1,,M} up to time t.

  • Construct the ordered sample {τ˜k*:k=1,,M}.

  • Measure the current elapsed time because the last arrival t0tτ(t).

  • Find k such that

    τ˜k1*t0<τ˜k*

    (i.e., the interval in the sample that contains the current elapsed time).

  • Estimate the hazard rate at elapsed time t0 as the slope of the empirical cumulative hazard rate between times τ˜k1* and τ˜k*, which can be directly computed as

    η^(t0)=log(1+1Nk)τ˜k*τ˜k1*.

Note that the computational complexity of this algorithm is high because we must keep track of a large-enough sample of interarrival times (something that is difficult to achieve for less popular items), and we must do so for each item in the catalog. We must also keep these interarrival times sorted and find the correct interval to evaluate the slope dynamically as time elapses.

This is why more computationally efficient algorithms are needed. In particular, the classical and simple LRU algorithm seems well suited but only in the case of decreasing hazard rates or bursty traffic. For more regular traffic, like in the increasing hazard rate case, the problem is still open.

9. Conclusions

This paper analyzes the optimal management of local memory systems with tools of stationary point processes. As a first contribution, we provided a rigorous setting for this problem, and under suitable assumptions, we characterized the optimal policy (in the sense of maximizing hit rates) to store in memory at every moment the items corresponding to the highest stochastic intensities.

From this starting point, for the case of independent request streams, we analyzed the limiting behavior of the optimal policy as the catalog size N when a fixed fraction c of items can be stored. Assuming that item request processes come from a common scale family with intensities that have a limit in distribution, we proved that the optimal policy amounts to comparing the stochastic intensity (observed hazard rate) of the process with a fixed threshold defined by the 1c quantile of a certain limiting distribution function. We further characterized the asymptotic performance (miss probability) of this optimal policy.

We also analyzed optimal threshold policies for the finite N case that satisfy the memory constraint in the mean. We showed that they have the same limiting behavior as the optimal policy. Moreover, we found that for renewal traffic and monotonic hazard rates in the interrequest distribution, these are equivalent to timer-based policies, where caching or prefetching times are determined by a hazard rate threshold. We also showed that some of the optimality results presented extend quite nicely to the case of heterogeneous object sizes and fractional caching.

Finally, we presented two detailed examples of our results for the standard Zipf model of item popularities and two cases for the interrequest process. For the bursty, decreasing hazard rate case of Pareto interarrival times, we obtain closed-form expressions for the optimal asymptotic threshold and the corresponding performance. We also provide sharp estimates of the optimal performance for finite N validated by detailed stochastic simulations. For the regular, increasing hazard rate case of Erlang interarrival times, closed-form expressions are not available, but we still carry out a numerical validation of the asymptotic threshold and performance. We also use this example to exhibit the significant superiority of the optimal policy in comparison with the popular LRU policy in contrast with the bursty case where LRU has good behavior.

For future research, we may explore the application of our machinery to characterize optimality for other kinds of request traffic: Markov-modulated Poisson processes to account for time variations, more general mixtures of traffic from different sources, and possibly, heterogeneity in item sizes as well as different popularity distributions.

Also, note that as presented, the optimal policy serves as a fundamental limit on the achievable performance but not necessarily a practical algorithm because it requires knowledge of the relevant item intensities and hazard rate distributions. This poses the question of possibly learning this information from the actual request data and alternatively, designing an eviction policy that does not make explicit use of these quantities but approximates optimal performance in practice; to some extent, LRU achieves this in the bursty case, but a counterpart for the increasing hazard rate is currently open.

Appendix A. Proof of Asymptotic Performance

In this section, we will provide a proof for our main result on asymptotic performance of the optimal policy for the uniformly integrable case. Referring back to Section 4, we begin by restating Equation (34):

mC*(N)=i=1Nλi(N)PΦi0(G^N(i)(Xi(N))pN)=i=1Nλi(N)PΦi0(Xi(N)Q^N(i)(pN)),(A.1)
where pN=1CN1, G^N(i) was defined in (33) to be the empirical distribution of the OSIs of items different from i, and Q^N(i) is its corresponding quantile function.

To compute each term on the right, we must invoke the distribution of Xi(N) under the Palm probability; it would be more convenient if the right-hand side of the inequality does not depend on i. For this purpose, we will bound the quantiles obtained from G^N(i) with those of G^N defined in (24), which include all items; as usual, Q^N is the corresponding quantile function. The key observation is that the contribution of the ith item to the empirical distribution G^N of all OSIs is at most 1N and therefore, negligible in the limit as N tends to infinity. More precisely, we have Lemma A.1.

Lemma A.1.

max1iNG^NG^N(i)1N.

Proof.

Recalling the definitions (24) and (33), we can write

NG^N(x)(N1)G^N(i)(x)=j=1N1{Xj(N)x}ji1{Xj(N)x}=1{Xi(N)x};N(G^N(x)G^N(i)(x))=1{Xi(N)x}G^N(i)(x).

In the preceding identity, the right-hand side is the difference of two numbers, both in [0, 1]; therefore, |G^N(x)G^N(i)(x)|1N. This holds for every x and for each i. □

We will now make use of this lemma to bound the distribution of G^N(i)(Xi(N)) as required for (A.1) with that of G^N(Xi(N)). However, this change brings a new difficulty because G^N(·) and Xi(N) would not be independent. For this reason, to calculate the distribution, we will use a coupling argument and consider independent versions of the corresponding random variables.

Proposition A.1

(Miss Rate Bounds for Fixed N). Let pN±=pN±1N, where pN=1CN1. Let ΘN± be the probability distribution function of the random variable θ^N±Q^N(pN±) under the probability P. Then, the miss rate is bounded from above and below by

00λG0(θλ)ΘN(dθ)LN(dλ)mC*(N)N00λG0(θλ)ΘN+(dθ)LN(dλ).(A.2)

Proof.

The proof is based on a coupling argument where we consider independent copies of the OSIs. The distributions of these copies mimic the distributions of the OSIs at an arbitrary time t=0 (that is, under the probability measure P) and upon an arrival at time t=0 (that is, under the Palm distribution PΦ0). More precisely, consider two independent, row-independent triangular arrays

{YiN:N1,1iN} and {ZiN:N1,1iN}
defined on a common probability space (Ω˜,F˜,P˜) with distributions
YiNG(·/λiN) and ZiNG0(·/λiN) for all 1iN and N1.

Let

H^N(x)=1Nj=1N1{YjNx} and H^N(i)(x)=1N1ji1{YjNx}.

Analogously to Lemma A.1, we have

max1iNH^N(i)H^N1N.(A.3)

For each N1 and 1iN, the random variables

Y1N,,Yi1N,ZiN,Yi+1N,,YNN
are independent, so if we let UiN=H^N(i)(ZiN), its distribution under P˜ will coincide with that of G^N(i)(Xi(N)) under the Palm probability PΦi0: that is,
PΦi0(G^N(i)(Xi(N))pN)=P˜(UiNpN).(A.4)

To bound the probability on the right-hand side, observe that from (A.3), we have

{H^N(ZiN)pN}{UiNpN}{H^N(ZiN)pN+};(A.5)
the proof proceeds by computing the probabilities of the extreme sets in (A.5), noting that under the constructed P˜, ZiN is independent of H^N (a function of Y1N,YNN).

Focusing momentarily on the upper bound, we can write

P˜(H^N(ZiN)pN+)=P˜(ZiNQ^NH(pN+)),(A.6)
where Q^NH denotes the quantile function of H^N; by construction, the distribution of Q^NH(pN+) under P˜ coincides with that of θ^N+=Q^N(pN+) under P, which by hypothesis, is denoted by ΘN+(θ). Because ZiN is independent and has distribution G0(·/λiN), we can compute the right-hand side of (A.6) to be
0P˜(ZiNθ)ΘN+(dθ)=0G0(θλiN)ΘN+(dθ).

Multiplying by λiN and averaging over I, we obtain from (A.5)

1Ni=1nλiNP˜(UiNpN)1Ni=1nP˜(H^N(ZiN)pN+)=1Ni=1n0λiNG0(θλiN)ΘN+(dθ)=00λG0(θλ)ΘN+(dθ)LN(dλ),
where the last step invokes the definition of LN. An analogous calculation provides the lower bound
00λG0(θλ)ΘN(dθ)LN(dλ)1Ni=1nλiNP˜(UiNpN).

Invoking (A.4) and Expression (A.1) for the miss rate, we conclude the proof. □

A.1. Convergence of Quantiles

Our next step is to compute the limit as N under suitable conditions of the lower and upper bounds in (A.2). We will use the convergence LNL of Assumption 3, but we also need to address the convergence of ΘN± and the distribution of quantiles for the random function G^N.

Recall that by Theorem 5, we have G^NG with P probability 1, where G is defined by (29). A standard result that we have already invoked (e.g., van der Vaart 1998, lemma 21.2) is that convergence in distribution implies the convergence of quantiles at a fixed point p provided that the limit quantile function is continuous at p. We will need a slight generalization of this property for the case where quantiles of the sequence are evaluated at a variable point pN convergent to p. This is stated and proven next.

Proposition A.2

(Convergence of Quantiles). Let {FN}N1 be a sequence of cumulative distribution functions such that FNF for some distribution F. Let QN and Q be the quantile functions of FN and F, respectively. Let p[0,1] be a continuity point of Q and {pN}N1 be a sequence in [0, 1] such that pNp. Then, QN(pN)Q(p).

Proof.

In the proof, we use the characterization of weak convergence in terms of the Lévy distance

dL(F,G)inf{ϵ>0F(xϵ)ϵG(x)F(x+ϵ)+ϵ,xR}
between cdfs F,G:R[0,1]. Convergence in the Lévy distance dL(FN,F)0 is equivalent to weak convergence FNF.

Let ϵ>0. By continuity of Q at p, there exists δ>0 such that |qp|<δ implies |Q(q)Q(p)|<ϵ/2. Let δNdL(FN,F). Choose N0 such that |pNp|<δ/2 and δNmin{ϵ,δ}/2 for all NN0.

We claim that

Q(pNδN)δNQN(pN)Q(pN+δN)+δN.(A.7)

Indeed, for the left inequality of (A.7), let xR be such that FN(x)pN. Then,

F(x+δN)FN(x)δNpNδN,
which implies x+δNQ(pNδN). Taking infimum over such x, we get QN(pN)Q(pNδN)δN.

For the right inequality of (A.7), let xR be such that F(xδN)pN+δN. Then,

FN(x)F(xδN)δNpN,
which implies xQN(pN). Taking infimum over such x, we get Q(pN+δN)+δNQN(pN). This proves the claim.

Let NN0. Then,

Q(p)ϵQ(p)ϵ2δN<Q(pNδN)δNQN(pN),
and
QN(pN)Q(pN+δN)+δN<Q(p)+ϵ2+δNQ(p)+ϵ.

Thus, |QN(pN)Q(p)|<ϵ for all NN0. □

A.2. Proof of Theorem 6

We are now in a position to complete the proof of our main result on performance.

We first establish that θ^N±Q^N(pN±)Nθ*Q(1c) almost surely in P. This follows by invoking Proposition A.2 at each ω in the set where G^NG, which has unit probability by Theorem 5; note that pN±1c, which by hypothesis, is a point of continuity of Q.

Almost sure convergence implies convergence in probability and in distribution. Therefore, the distributions ΘN± of the random variables θ^N± both satisfy ΘN±δθ*, a unit mass at the limit threshold. As a consequence, under Assumption 3, we have convergence of the product measures ΘN±LNδθ*L. We will use this fact to show that both bounds in (A.2) converge to the same limit, and it coincides with the right-hand side of (37).

Define for this purpose the function h:R+2R+ by h(λ,θ)=λG0(θλ). Our next claim is that

(ΘN±LN)h1(δθ*L)h1.(A.8)

This claim is established in the mapping theorem (Billingsley 1999, theorem 2.7) provided that h is measurable and that its set of discontinuities Dh has zero measure in the limit (i.e., (δθ*L)(Dh)=0).

Because the limit measure in θ is a point mass, we have

(δθ*L)(Dh)=L({λ:θ*λDG0}).(A.9)

Also, note that G0 being a distribution function, its set of discontinuities is countable. Therefore, the measure on the right of (A.9) can only be positive if there exists λ0DL (an atom of the measure L) such that θ*λ0DG0. This would imply that θ*DL·DG0 and is ruled out by hypothesis. Therefore, the claim holds.

To finish the theorem, we must go from the convergence in distribution given in (A.8) to a first-moment condition. Here is where uniform integrability will come into play. It is easiest to express this via independent random variables θ^N± and ΛN with respective distributions ΘN± and LN with θ^N±dθ* and ΛNdΛ, the latter with distribution L. Condition (A.8) is equivalent to the statement

h(ΛN,θ^N±)=ΛNG0(θ^N±ΛN)dΛG0(θ*Λ)=h(Λ,θ*).

Because G01, we have h(ΛN,θ^N±)ΛN; now, {ΛN}N1 is uniformly integrable by hypothesis, so the same happens with {h(ΛN,θ^N±)}N1. Now, invoke Billingsley (1999, theorem 3.5) to obtain

00λG0(θλ)LN(dλ)ΘN±(dθ)=E[ΛNG0(θ^N±ΛN)]NE[ΛG0(θ*Λ)]=0λG0(θ*λ)L(dλ).

Thus, both upper and lower bounds in (A.2) converge to the given limit, which concludes the proof.

Endnotes

1 Note that τ0 is a proper random variable because {τ0t}={Φ((t,0])=0} and thus, is measurable for all t0. Similarly, any point τk is a random variable.

2 We specifically avoid using the term cache for our system because in caching, normally only the recently requested objects are stored, and our model aims to generalize these policies.

3 We now make explicit the dependence on the system size N because we will study the limits as N.

4 If θN is not in the range of valid hazard rates, the equivalence also holds by allowing zero (or infinite) timers (i.e., items that are never (or always) stored).

References

  • Baccelli F, Brémaud P (2013) Elements of Queueing Theory (Springer-Verlag, Berlin).Google Scholar
  • Bahat O, Makowski AM (2005) Measuring consistency in TTL-based caches. Performance Evaluation 62(1):439–455.Google Scholar
  • Barrera J, Fontbona J (2010) The limiting move-to-front search-cost in law of large numbers asymptotic regimes. Ann. Appl. Probab. 20(2):722–752.Google Scholar
  • Berger DS, Gland P, Singla S, Ciucu F (2014) Exact analysis of TTL cache networks. Performance Evaluation 79:2–23.Google Scholar
  • Berger DS, Henningsen S, Ciucu F, Schmitt JB (2015) Maximizing cache hit ratios by variance reduction. ACM SIGMETRICS Performance Evaluation Rev. 43(2):57–59.Google Scholar
  • Bianchi G, Detti A, Caponi A, Blefari Melazzi N (2013) Check before storing: What is the performance price of content integrity verification in LRU caching? ACM SIGCOMM Comput. Comm. Rev. 43(3):59–67.Google Scholar
  • Bilal M, Kang SG (2017) A cache management scheme for efficient content eviction and replication in cache networks. IEEE Access 5:1692–1701.Google Scholar
  • Billingsley P (1999) Convergence of Probability Measures, Wiley Series in Probability and Statistics, 2nd ed. (Wiley, Chichester, UK).Google Scholar
  • Borst S, Gupta V, Walid A (2010) Distributed caching algorithms for content distribution networks. 2010 Proc. IEEE INFOCOM (IEEE, Piscataway, NJ), 1–9.Google Scholar
  • Brémaud P (2020) Point Process Calculus in Time and Space (Springer, New York).Google Scholar
  • Che H, Tung Y, Wang Z (2002) Hierarchical web caching systems: Modeling, design and experimental results. IEEE J. Selected Areas Comm. 20(7):1305–1314.Google Scholar
  • Colosimo E, Ferreira F, Oliveira M, Sousa C (2002) Empirical comparisons between Kaplan–Meier and Nelson–Aalen survival function estimators. J. Statist. Comput. Simulation 72(4):299–308.Google Scholar
  • Daley DJ, Vere-Jones D (2003) An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods (Springer, New York).Google Scholar
  • Dan A, Towsley D (1990) An approximate analysis of the LRU and FIFO buffer replacement schemes. Proc. ACM/SIGMETRICS 1990 (ACM, New York), 143–152.Google Scholar
  • Dehghan M, Massoulie L, Towsley D, Menasche DS, Tay YC (2019) A utility optimization approach to network cache design. IEEE/ACM Trans. Networking 27(3):1013–1027.Google Scholar
  • Ferragut A, Carrasco M, Paganini F (2024) Caching or pre-fetching? The role of hazard rates. 2024 60th Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 1–8.Google Scholar
  • Ferragut A, Rodriguez I, Paganini F (2016) Optimizing TTL caches under heavy tailed demands. Proc. ACM/SIGMETRICS 2016 (ACM, New York), 101–112.Google Scholar
  • Ferragut A, Rodríguez I, Paganini F (2018) Optimal timer-based caching policies for general arrival processes. Queueing Systems 88(3–4):207–241.Google Scholar
  • Fill JA (1996) Limits and rates of convergence for the distribution of search cost under the move-to-front rule. Theoret. Comput. Sci. 164(1):185–206.Google Scholar
  • Fofack NC, Nain P, Neglia G, Towsley D (2012) Analysis of TTL-based cache networks. Proc. Internat. Conf. Performance Evaluation Methodologies Tools (VALUETOOLS) (IEEE, Piscataway, NJ), 1–10.Google Scholar
  • Fofack NC, Nain P, Neglia G, Towsley D (2014) Performance evaluation of hierarchical TTL-based cache networks. Comput. Networks 65:212–231.Google Scholar
  • Fricker C, Robert P, Roberts J (2012) A versatile and accurate approximation for LRU cache performance. Proc. 24th Internat. Teletraffic Congress (IEEE, Piscataway, NJ), 57–64.Google Scholar
  • Garetto M, Leonardi E, Martina V (2016) A unified approach to the performance analysis of caching systems. ACM Trans. Model. Performance Evaluation Comput. Systems 1(3):1–28.Google Scholar
  • Gast N, Houdt BV (2015) Transient and steady-state regime of a family of list-based cache replacement algorithms. Proc. ACM/SIGMETRICS 2015 (ACM, New York), 123–136.Google Scholar
  • Gelenbe E (1973) A unified approach to the evaluation of a class of replacement algorithms. IEEE Trans. Comput. C-22(6):611–618.Google Scholar
  • Ioannidis S, Yeh E (2016) Adaptive caching networks with optimality guarantees. ACM SIGMETRICS Performance Evaluation Rev. 44(1):113–124.Google Scholar
  • Ioannidis S, Massoulie L, Chaintreau A (2010) Distributed caching over heterogeneous mobile networks. Proc. ACM SIGMETRICS Internat. Conf. Measurement Model. Comput. Systems (ACM, New York), 311–322.Google Scholar
  • Jelenković PR (1999) Asymptotic approximation of the move-to-front search cost distribution and least-recently used caching fault probabilities. Ann. Appl. Probab. 9(2):430–464.Google Scholar
  • Jelenković P, Radovanović A (2003) Asymptotic insensitivity of least-recently-used caching to statistical dependency. Proc. IEEE/INFOCOM 2003 (IEEE, Piscataway, NJ), 438–447.Google Scholar
  • Jelenković PR, Radovanović A (2004) Least-recently-used caching with dependent requests. Theoret. Comput. Sci. 326(1):293–327.Google Scholar
  • Jelenković PR, Radovanović A (2008) The persistent-access-caching algorithm. Random Structures Algorithms 33(2):219–251.Google Scholar
  • Jelenković PR, Radovanović A, Squillante MS (2006) Critical sizing of LRU caches with dependent requests. J. Appl. Probab. 43(4):1013–1027.Google Scholar
  • Jung J, Berger AW, Balakrishnan H (2003) Modeling TTL-based internet caches. Proc. IEEE/INFOCOM 2003 (IEEE, Piscataway, NJ), 417–426.Google Scholar
  • King W (1971) Analysis of paging algorithms. Proc. IFIP Congress 1971 (North-Holland Publishing Company, Amsterdam), 485–490.Google Scholar
  • Martina V, Garetto M, Leonardi E (2014) A unified approach to the performance analysis of caching systems. Proc. IEEE/INFOCOM 2014 (IEEE, Piscataway, NJ), 2040–2048.Google Scholar
  • Panigrahy NK, Li J, Towsley D (2017) Hit rate vs. hit probability based cache utility maximization. ACM SIGMETRICS Performance Evaluation Rev. 45(2):21–23.Google Scholar
  • Panigrahy NK, Li J, Towsley D, Hollot CV (2020) Network cache design under stationary requests: Exact analysis and Poisson approximation. Comput. Networks 180:107379.Google Scholar
  • Panigrahy NK, Nain P, Neglia G, Towsley D (2022) A new upper bound on cache hit probability for non-anticipative caching policies. ACM Trans. Model. Performance Evaluation Comput. Systems 7(2–4):5.Google Scholar
  • Rosensweig EJ, Kurose J, Towsley D (2010) Approximate models for general cache networks. Proc. IEEE/INFOCOM 2010 (IEEE, Piscataway, NJ), 1–9.Google Scholar
  • Shorack GR (1979) The weighted empirical process of row independent random variables with arbitrary distribution functions. Statistica Neerlandica 33(4):169–189.Google Scholar
  • Tanner MA, Wong WH (1983) The estimation of the hazard function from randomly censored data by the kernel method. Ann. Statist. 11(3):989–993.Google Scholar
  • van der Vaart AW (1998) Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 3 (Cambridge University Press, Cambridge, UK).Google Scholar