Uniformly bounded regret in the multi-secretary problem

In the secretary problem of Cayley (1875) and Moser (1956), $n$ non-negative, independent, random variables with common distribution are sequentially presented to a decision maker who decides when to stop and collect the most recent realization. The goal is to maximize the expected value of the collected element. In the $k$-choice variant, the decision maker is allowed to make $k \leq n$ selections to maximize the expected total value of the selected elements. Assuming that the values are drawn from a known distribution with finite support, we prove that the best regret---the expected gap between the optimal online policy and its offline counterpart in which all $n$ values are made visible at time $0$---is uniformly bounded in the the number of candidates $n$ and the budget $k$. Our proof is constructive: we develop an adaptive Budget-Ratio policy that achieves this performance. The policy selects or skips values depending on where the ratio of the residual budget to the remaining time stands relative to multiple thresholds that correspond to middle points of the distribution. We also prove that being adaptive is crucial: in general, the minimal regret among non-adaptive policies grows like the square root of $n$. The difference is the value of adaptiveness.


Introduction
In the classic formulation of the secretary problem a decision maker (referred to as "she") is sequentially presented with n non-negative, independent, values representing the ability of potential candidates and must select one candidate (referred to as "he"). Every time a new candidate is inspected and his ability is revealed, the decision maker must decide whether to reject or select the candidate, and her decision is irrevocable. If the candidate is selected, then the problem ends; if the candidate is rejected, then he cannot be recalled at a later time. The decision maker knows the number of candidates n, the distribution of the ability values in the population, and her objective is to maximize the probability of selecting the most able candidate. For any given n the problem can be solved by dynamic programming, but there is an asymptotically optimal heuristic that is remarkably elegant. The decision maker observes the abilities of the first n{e candidates and selects the first candidate whose ability exceeds that of the current best candidate; or the last candidate if no such candidate exists (see, e.g., Lindley 1961, Chow et al. 1964, Gilbert and Mosteller 1966. Several variations of this simple model have been introduced in the literature, and we refer to Freeman (1983) and Ferguson (1989) for a survey of extensions and references. Relevant to us is the formulation in which the decision maker seeks to maximize the expected ability of the selected candidate, rather than maximizing the probability of selecting the best. This problem was first considered by Cayley (1875) and Moser (1956), and it is a special case of the k-choice (multi-secretary) problem we study here. In our formulation: • candidate abilities are independent, identically distributed, and supported on a finite set; • the decision maker is allowed to select up to k candidates (k is the recruiting budget); and • the decision maker's goal is to maximize the expected total ability of the selected candidates.
This multi-secretary problem has applications in revenue management and auctions, among others.
In the standard capacity-allocation revenue management problem a firm sells k items (e.g. airplane seats) to n customers from a discrete set of fare classes over a finite horizon and wishes to allocate the seats in the best possible way (see, e.g., Kleywegt andPapastavrou 1998, Talluri andvan Ryzin 2004). In auctions, the decision maker observes arriving bids and must decide whether to allocate one of the available k items to an arriving customer (see, e.g., Kleinberg 2005, Babaioff et al. 2007).
The performance of any online algorithm for this k-choice secretary problem is bounded above by the offline (or posterior) sort: the decision maker waits until all the ability values are presented, sorts them, and then picks the largest k values. As such, we define the regret of an online selection algorithm as the expected gap in performance between the online decision and the offline sort, and we prove that the optimal online algorithm has a regret that is bounded uniformly over pn, kq. The constant bound depends only on the cardinality of the support and on the minimal mass value.
Our proof is constructive: we devise an adaptive policy-the Budget-Ratio (BR) policy-that achieves bounded regret. The policy is adaptive in the sense that the actions are adjusted based on the remaining number of candidates to be inspected and on the residual budget. The proof that the BR policy achieves bounded regret is based on an interesting drift argument for the BR process that tracks the ratio between the residual budget and the remaining number of candidates to be inspected. Under our policy, BR sample paths are attracted to and then remain pegged to a certain value; see Figure 1. Drift arguments are typical in the study of stability of queues (see, e.g., Bramson et al. 2008), but the proof here is somewhat subtle: the drift is strong early in the horizon but it weakens as the remaining number of step decreases. Since "wrong" decisions early in the horizon are more detrimental, this diminishing strength does not compromise the regret. Budget-Ratio sample path. Simulation of a single sample path of the ratio between the remaining recruiting budget and the number of candidates to be seen. The abilities are drawn from a uniform distribution on A " t0.2, 0.4, . . . , 1.8, 2.0u and the number of candidates n " 1, 000. In the left panel we have the initial recruiting budget k " 350, while in the right panel k " 320. In either case, we see the attraction property: once the policy reaches a ratio of 0.35 it "stays" there. The value of 0.35 is not accidental. It is one of the thresholds of our Budget-Ratio policy for this uniform distribution.
We also show that adaptivity is key. While non-adaptive policies could have temporal variation in actions (and hence could have different actions towards the end of the horizon) this variation is not enough: the regret of non-adaptive policies is, in general, of the order of ? n. Specifically, non-adaptive policies tend to be too greedy and run out of budget too soon. Non-adaptive policies introduce independence between decisions and, consequently, "too much" variance into the speed at which the recruiting budget is consumed.
Closely related to our work is the paper Wu et al. (2015) that offers an elegant adaptive index policy that we re-visit in Section 5.1. Under this policy, the ratio of remaining budget to remaining number of steps is a martingale. When the initial ratio of budget to horizon, k{n, is safely far from the jumps of the discrete distribution, this martingale property guarantees a bounded regret. In general, however, it is precisely the martingale symmetry that increases the regret. In their proof, Wu et al. (2015) show that their policy achieves, up to a constant, a deterministic-relaxation upper bound. This upper bound is not generally achievable and we must, instead, use the stochastic offline sort as a benchmark. The detailed analysis of the offline sort gives rise to sufficient conditions that, when satisfied by an online policy, guarantee uniformly bounded regret.
Notation. We use Z`to denote the non-negative integers and R`to denote the non-negative reals. For j P t1, 2, . . .u, we use rjs to denote the set of integers t1, . . . , ju and we set rjs " H otherwise. Given the real numbers x, y, z, we set pxq`" maxt0, xu, x^y " mintx, yu, and we write y " x˘z to mean that | y´x | ď z. Throughout, to simplify notation, we use M " M px, y, zq to denote a Hardy-style constant dependent on x, y, and z that may change from one line to the next.

The multi-secretary problem
A decision maker is sequentially presented with n candidates with abilities X 1 , X 2 , . . . , X n , and, given a recruiting budget equal to k, she can select up to k candidates to maximize the total expected ability of those selected up to and including time n. Of course, there is nothing to study if k ą n: the decision maker can take all candidates, so it suffices to consider pairs pn, kq that belong to the lattice triangle T " tpn, kq P Z 2 : 0 ď k ď nu.
The abilities X 1 , X 2 , . . . , X n are assumed to be independent across candidates and drawn from a common cumulative distribution function F supported on a finite set A " ta m , a m´1 , . . . , a 1 : 0 ă a m ă a m´1 ă¨¨¨ă a 1 u of distinct real numbers. We denote by pf m , f m´1 , . . . , f 1 q the probability mass function with f j " PpX 1 " a j q for all j P rms, and we let F be the cumulative distribution function andF " 1´F be the survival function. Also, for future reference we choose a value a m`1 ă a m with f m`1 " 0 so thatF pa m`1 q " 1.
The selection process unfolds as follows: suppose that at time t P rns the residual recruiting budget is k and that the sum of the abilities of the selected candidates up to and including time t´1 is w. If the candidate inspected (or "interviewed") at time t has ability X t " x, then the decision maker may select the candidate-increasing the cumulative ability to w`x and reducing the residual budget to k´1-or to reject the candidate-leaving the accrued ability at w and the remaining budget at k.
A policy is feasible if the number of selected candidates does not exceed the recruiting budget k. It is online if the decision with regards to the tth candidate is based only on its ability, the abilities of prior candidates and the history of the decisions up to time t. All decisions are final: if the candidate interviewed at time t is rejected, it is forever lost. Vice versa, if the tth candidate is selected at time t, then that decision cannot be revoked at a later time.
Formally, let F 0 denote the trivial σ-field and, for t P rns, let F t " σtX 1 , X 2 , . . . , X t u be the σ-field generated by the random variables tX 1 , X 2 , . . . , X t u. An online policy π is a sequence of tF t : t P rnsu-adapted binary random variables σ π 1 , σ π 2 , . . . , σ π n where σ π t " 1 means that the candidate with ability X t is selected. A feasible online policy requires that the number of selected candidates does not exceed the recruiting budget, i.e., that ř tPrns σ π t ď k, so Πpn, kq " # pσ π 1 , σ π 2 , . . . , σ π n q P t0, 1u n : σ π t P F t for all t P rns and ÿ tPrns σ π t ď k + , is then the set of all feasible online policies. For π P Πpn, kq, we let W π 0 , W π 1 , . . . , W π n be the sequence of random variables that track the accumulated ability: we set W π 0 " 0, and for r P rns we let W π r " ÿ tPrrs X t σ π t " W π r´1`X r σ π r .
The expected ability accrued at time n by policy π is then given by For each pn, kq P T , the goal of the decision maker is to maximize the expected value V π on pn, kq: Vo n pn, kq " max πPΠpn,kq V π on pn, kq.
For completeness, we include in Appendix B an analysis of the dynamic program. The analysis of the Bellman equation confirms an intuitive property of "good" solutions: the optimal action should only depend on the remaining number of steps and the remaining budget and not on the current level of accrued ability. It also allows for a comparison of the Budget-Ratio policy we propose in Section 5 with the optimal policy.
Non-adaptive policies are an interesting subset of feasible online policies. If the residual recruiting budget at time t is positive, then a non-adaptive policy selects the arriving candidate with ability X t " a j with probability p j,t P r0, 1s independently of all the previous actions. The probabilities p j,t are determined in advance: they can vary from one period to the next, but this variation is not adapted in response to the previous selection/rejection decisions. A more complete description of non-adaptive policies appears in Section 4. We let Π na Ď Πpn, kq be the family of non-adaptive policies and define Vn a pn, kq " sup πPΠna V π on pn, kq to be the optimal performance among these.
No feasible online policy-adapted or not-can do better than the offline, full-information, counterpart in which all values are presented in advance. The expected ability accrued at time n by the offline problem is given by # ÿ tPrns X t σ t : pσ 1 , . . . , σ n q P t0, 1u n and ÿ tPrns σ t ď k +ff .
Since the offline solution selects the best k candidates for each realization of X 1 , . . . , X n , we have that V π on pn, kq ď Vo ff pn, kq, for all pn, kq P T and all π P Πpn, kq.
We use the value of the offline problem as a benchmark. The optimal online policy trivially achieves the offline benchmark if k " 0 or k " n. The main results of this paper (Theorem 2 and Corollary 1) are gathered below.
Theorem 1 (The regret of online feasible policies). If ǫ " 1 2 mintf m , f m´1 , . . . , f 1 u, then there exists a policy br P Πpn, kq and a constant M " M pǫ, m, a 1 q such that Vo ff pn, kq´Vo n pn, kq ď Vo ff pn, kq´V br on pn, kq ď M for all pn, kq P T .
Furthermore, if pf 1`ǫ qn ď k ď p1´f m´ǫ qn, then an optimal non-adaptive policy has regret that grows like ? n. Specifically, there is a constant M " M pǫ, m, a 1 , . . . , a m q such that M ? n ď Vo ff pn, kq´Vn a pn, kq.
Remark 1 (On the regret with uniform random permutations). We note here that there are regret bounds for a version of this multi-secretary problem in which the values are given by a uniform random permutation of the integers rns instead of being from a random sample.
Within the uniform random permutation framework, Kleinberg (2005) proves that the minimal regret in this setting is of the order of ? k and provides an algorithm that achieves this lower bound; see also Louchard and Bruss (2016).
Remark 2 (Fixing F vs. varying F ). In our result, the distribution F is fixed while the horizon length n and the recruiting budget k are varied. The constant M depends on F through the support, its cardinality, and the minimal mass. If one allows the distribution F to change with pn, kq, then the regret may not be bounded. Kleinberg (2017) suggested a discrete distribution on A " t1, 2, 3u with probability mass function that is allowed to depend on n and k. Specifically, if the probability mass function is given by p n´k`?k n , ? k n , k´2 ? k n q and k " rn{2s, then one can show that the optimal online policy is, in order of magnitude, ? n away from the offline sort.

The offline problem
Denoting by Z r j " ř tPrrs ½tX t " a j u the number of candidates with ability a j inspected up to and including time r, the offline optimization problem has the compact representation where, for pz 1 , . . . , z m q P Z m and k P Z`, s.t. 0 ď 0 ď s j ď z j for all j P rms ÿ jPrms s j ď k.
For a given realization of Z n 1 , . . . , Z n m , the (trivial) optimal solution is to sort the values and select the k candidates with the largest abilities. That is, one selects S n 1 " mintZ n 1 , ku candidates with ability a 1 . Then, if there is any recruiting budget left, one selects candidates with ability a 2 until either selecting all of them or exhausting the remaining budget of k´Z n 1 , i.e. one selects S n 2 " mintZ n 2 , pk´Z n 1 q`u candidates with ability a 2 . In general, the offline number of selected a j -candidates is given by so that The offline solution (3) has the appealing property that-up to constant deviations-all the action is in two ability levels. That is, depending on the ratio k{n, there is an index j 0 P rms such that the offline sort algorithm rejects almost all of the candidates with ability strictly below a j 0`1 and selects all the candidates with ability strictly above a j 0 . The action index j 0 depends on the horizon length n and on the recruiting budget k, and it is given for any pair pn, kq P T by The definition of the index j 0 " j 0 pn, kq in (4) suggests a helpful partition of T into the sets T j " tpn, kq P T : j 0 pn, kq " ju for j P t1, 2, . . . , mu.
Proposition 1 (Offline sort decomposition). Let ǫ " 1 2 mintf m , . . . , f 1 u and fix j P t1, . . . , mu. For all pn, kq P T j then one has the bounds Consequently, for all pn, kq P T j one has the decomposition The following lemma will be used repeatedly in the sequel. All lemmas stated in the paper are proved in Appendix A.
Lemma 1. Let B be a binomial random variable with n trials and success probability p. Then, for any ε ą 0, A first use of Lemma 1 is in the proof of Proposition 1.
Proof of Proposition 1. If j " 1, the left inequality of (5) reduces to´p4ǫq´1 ď 0, and there is nothing to prove. For 1 ă j and ι ă j, we obtain from (3) that By the definition of T j and (4), we have thatF pa j q`ǫ ď k{n, so, since the sum B " ř iPrj´1s Z n i is binomial with parameters n andF pa j q " f j´1`. . .`f 1 , the first inequality in (5) follows from (7).
For the second inequality in (5), notice that if j P tm´1, mu then there is nothing to prove.
Otherwise, if j ď m´2 we have for all ι ě j`2 (or, equivalently for ι´1 ě j`1) that . . , f 1 u and pn, kq P T j we have that k{n ďF pa j`2 q´ǫ, and the second inequality in (5) again follows from (7). The decomposition (6) now immediately follows recalling that a m ă a m´1 ă¨¨¨ă a 1 .F or any feasible online policy π P Πpn, kq, we let S π,r j " ř r t"1 X t σ π t ½tX t " a j u be the number of candidates with ability level a j that are selected by policy π up to and including time r, so the expected total ability accrued by policy π can be written as " ÿ jPrms a j ErS π,n j s.
Proposition 1 suggests that, to have bounded regret, an online algorithm must be selecting almost all candidates with abilities a 1 , a 2 , . . . , a j 0´1 and rejecting almost all of the candidates with abilities a j 0`2 , . . . , a m if j 0 is such that k{n P rF pa j 0 q`1 2 f j 0 ,F pa j 0`2 q´1 2 f j 0`1 q. This sufficient condition guides us in the development of the Budget-Ratio policy in Section 5.
For any pn, kq P T j the definition (4) of the map pn, kq Þ Ñ j 0 pn, kq gives us that j 0 pn, kq " j, so for any policy π P Πpn, kq and any stopping time τ ď n we have the lower bound ÿ iPrj´1s a i ErS π,τ i s`a j ErS π,τ j s`a j`1 ErS π,τ j`1 s ď ÿ jPrms a j ErS π,n j s " V π on pn, kq.
In turn, for the policy π " πpn, kq and the stopping time τ " τ pπq given in the proposition, it Next, we recall that Proposition 1 gives us a constant M " M pǫq such that and we obtain from property (iv) that ErZ n i´Z τ i s ď Ern´τ s ď M and 0 ď ErS n i´S τ i s ď Ern´τ s ď M for all i P rms. Consequently, the monotonicity a m ă a m´1 ă¨¨¨ă a 1 gives us the further upper bound so when we combine the two inequalities (8) and (9), we obtain Vo ff pn, kq´V π on pn, kq ď 2a 1 pm`1qM for all pn, kq P T j , just as needed to complete the proof of the proposition.R emark 3 (A deterministic relaxation.). A further upper bound to the multi-secretary problem is provided by an intuitive deterministic relaxation. Given the linear program (2), we consider its optimal value DRpn, kq " ϕpErZ n 1 s, . . . , ErZ n m s, kq and we note that its optimal solution is given by ErZ n i sq`( " min nf j , pk´nF pa j qq`( for all j P rms. Since the number of a j -candidates selected by the offline sort p s j " ErS n j s " ErmintZ n j , pkř iPrj´1s Z n i q`us is a feasible solution for the deterministic relaxation (10), we immediately have the bound Vo ff pn, kq ď DRpn, kq for all pairs pn, kq P T .
As central-limit-theorem intuition suggests, the "cost of randomness" is at most of the order of ?
n. But it does not have to be that large: there is a subset T 1 Ă T for which the difference DRpn, kq´Vo ff pn, kq is bounded by a constant that does not depend on pn, kq P T 1 . Members pn, kq of T 1 are such that k{n is "safely" away from the jump points of the distribution F (Appendix C).
For pn, kq P T 1 , benchmarking against the deterministic relaxation is the same as benchmarking against the offline sort (see also Wu et al. 2015). In general this is not the case. For instance, if one takes k " ErZ n 1 s " nf 1 then DRpn, kq " a 1 nf 1 but there exists M ă 8 such that a 1 ErS n 1 s " a 1 ErmintZ n 1 , kus ď a 1 nf 1´M ? n " DRpn, kq´M ? n.

The square-root regret of non-adaptive policies
A non-adaptive policy π is an online feasible policy that is characterized by a probability matrix tp j,t : j P rms and t P rnsu. The entry p j,t represents the probability of selecting the candidate inspected at time t given that the candidate's ability is X t " a j and that the recruiting budget remaining at time t is non zero. Formally, given π " tp j,t : j P rms and t P rnsu, let B 1 , B 2 , . . . , B n be a sequence of independent Bernoulli random variables with success probabilities q 1 , q 2 , . . . , q n such that ErB t |X t " a j s " p j,t and q t " ErB t s " ÿ jPrms p j,t f j .
The non-adaptive policy π selects candidates until it runs out of budget or reaches the end of the horizon, i.e., up to the stopping time ν " νpπq given by The expected total ability accrued by the non-adaptive policy π is then given by and Vn a pn, kq " sup πPΠna V π on pn, kq is the performance of the best non-adaptive policy.
An intuitive non-adaptive policy is the index policy id " tp j,t : j P rms and t P rnsu, that takes its probabilities from the solution (11) to the deterministic relaxation (10). If the residual budget at time t is non-zero and if and j id P rms is the index such thatF pa j id q ď k{n ăF pa j id`1 q, then the index policy selects an arriving a j -candidate with probability In the main result of this section we prove that, for a large range of pn, kq pairs, the regret of non-adaptive policies is generally of the order of ? n. Viewed in the context of Proposition 2, non-adaptive policies violate property (iv): they run out of budget too early. The proof relies on this "greediness." Theorem 2 (The regret of non-adaptive policies). For ǫ " 1 2 mintf m , f m´1 , . . . , f 1 u suppose that pf 1`ǫ qn ď k ď p1´f m´ǫ qn. Then there is a constant M " M pǫ, m, a 1 , . . . , a m q such that M ? n ď Vo ff pn, kq´Vn a pn, kq.
As one might expect, the non-adaptive (time-homogeneous) index policy in (12) already achieves this order of magnitude and, in this sense, is representative of the performance of general nonadaptive policies.
Lemma 2 (The regret of the non-adaptive index policy). The non-adaptive index policy id " tp j,t : j P rms and t P rnsu has a ? n regret. That is, for any ε P p0, 1q and all pairs pn, kq such that ε ď k{n we have Vo ff pn, kq´V id na pn, kq ď DRpn, kq´V id on pn, kq ď ε´1a 1 ? n.
The conditions of Theorem 2 are not necessary but certain budget ranges do have to be excluded.
In the small-budget range with k ď pf 1´ǫ qn, for example, the regret is in fact constant. The offline solution mostly takes a 1 values. The non-adaptive policy p π that has p 1,t " 1 for all t P rns and p j,t " 0 for all j ‰ 1 and all t P rns will achieve a constant regret. Interestingly, in this same range, the index policy may not be as aggressive. For instance, if k " pf 1´ǫ qn we see from (12) that the index policy sets p 1,t " 1´ǫ{f 1 spreading out the selection of a 1 values throughout the time horizon and having a regret of order ? n.
In the proof of Theorem 2 we use three auxiliary lemmas, the first of which provides a lower bound for the overshoot of a centered Bernoulli random walk.
The second auxiliary lemma shows that any good non-adaptive policy must be a perturbation of the index policy. Specifically, if s j pπq " ř tPrns p j,t f j is the expected number of a j candidates that policy π selects under infinite budget, then s j pπq is just a perturbation of sj , the solution of the deterministic relaxation (10).
. . , f 1 u, and take any p2ǫ min jPrm´1s | a j´aj`1 |q´2M 2 ď n. If π " tp j,t : j P rms and t P rnsu is a non-adaptive policy such that DRpn, kq´V π on pn, kq ď M ? n, then s j pπq " sj˘t min jPrm´1s | a j´aj`1 |u´1M ? n for all j P rms.
In words, when k is bounded away from both 0 and n, the offline algorithm selects-in expectationall but a constant number of the highest values and at most a constant number of the lowest values.
The optimal non-adaptive policy must do so as well. In particular, it should have in "most" time periods p 1,t " 1 and p m,t " 0 so that the marginal probability of selection, q t , is safely bounded away from zero and from one. The last auxiliary lemma makes this intuitive idea formal.
. . , f 1 u and suppose that pf 1`ǫ qn ď k ď p1´f m´ǫ qn. Then, there is a constant M " M pǫ, a 1 , a 2 , a m q ă 8, such that an optimal non-adaptive policy must satisfy ÿ Consequently, ÿ and one has the lower bound

Proof of Theorem 2
For any non-adaptive policy π " tp j,t : j P rms and t P rnsu with associated stopping time ν " mintr ě 1 : ř tPrrs B t or r ě nu, we have that policy π does not make any selection after time ν, and we also have that Z ν j ď Z n j for all j P rms. Thus, if we recall the linear program (2), use the monotonicity of ϕpz 1 , . . . , z m ,¨q in pz 1 , . . . , z m q, and recall the equivalence (1), we obtain that V π on pn, kq ď ErϕpZ ν 1 , . . . , Z ν m , kqs ď ErϕpZ n 1 , . . . , Z n m , kqs " Vo ff pn, kq.
For ǫ " 1 2 mintf m , f m´1 , . . . , f 1 u and pf 1`ǫ qn ď k ď p1´f m´ǫ qn, Hoeffding's inequality (see, e.g. Boucheron et al. 2013, Theorem 2.8) immediately tells us that there is a constant M " M pǫq ă 8 such that nPtZ n 1 ą ku ď M , and Wald's lemma gives us that so the proof of the theorem is complete if we can prove an appropriate lower bound for Ern´νs.
Lemma 2 tells us that for f 1`ǫ ď k{n the index policy has regret that is bounded above pf 1ǫ q´1a 1 ? n, so that it suffices to consider non-adaptive policies π " tp j,t , j P rms and t P rnsu for which DRpn, kq´V π on pn, kq ď pf 1`ǫ q´1a 1 ? n.
Since ř jPrms sj " k, we also have that ÿ Furthermore, since 0 ď B t ď 1 for all t P rns we know that With ς 2 pπq " ř tPrns q t p1´q t q, the estimate (17) implies that k ď ř tPrns q t`r mM ςpπq´1 ? nsςpπq, and Lemma 6 tells us that there is a constant M " M pǫ, m, a 1 , . . . , a m q such that mM ςpπq´1 ? n ď M . In turn, we also have that k ď ř tPrns q t`M ςpπq, so after we subtract ř tPrns B t on both-sides, change sign, take the positive part, and recall (18), we obtain the lower bound ÿ tPrns pB t´qt q´M ςpπq˘`ď`ÿ tPrns B t´k˘`ď n´ν.
The random variable on the left-hand side is a sum of centered independent Bernoullis so, when we take expectations, Lemma 4 tells us that there is a constant β 1 " β 1 pǫ, m, a 1 , . . . , a m q such that Plugging this last estimate back into (16) gives us that f 1 pβ 1 ςpπq´2´3 ? 2q ď ErZ n 1´Z ν 1 s, and the theorem then follows after one uses one more time the lower bound for ςpπq given in Lemma 6 and chooses M " M pǫ, m, a 1 , . . . , a m q accordingly.

The Budget-Ratio (BR) policy
We now introduce an adaptive online policy that makes selection decisions depending on the ratio between the remaining number of positions to be filled (the remaining budget) and the remaining number of candidates to be inspected (the remaining time), and we refer to this policy as the Budget-Ratio (BR) policy.
With π " br, the random variables σ br 1 , σ br 2 , . . . , σ br n give us the sequence of selection decisions under the BR policy (see Section 2), and we let, for t P rns, br t be the remaining budget after the tth decision (K 0 " k). We now introduce the thresholds 0 " T 1 ă T 2 ă¨¨¨ă T m ă T m`1 "`8 given by T j " 1 2 pF pa j q`F pa j`1 qq for each j P t2, 3, . . . , mu, so that the Budget-Ratio decision at time t`1 selects X t`1 depending on its value and on the position of the ratio K t {pn´tq relative to these thresholds. Specifically, at each decision time t`1 P t1, . . . n´1u, the BR policy (i) identifies the index j P rms such that (ii) selects X t`1 if and only if K t ą 0 and X t`1 ě a j ; i.e., it sets σ br t`1 " # 1 if K t ą 0 and X t`1 ą a j`1 0 otherwise.

Figure 2
The BR policy: thresholds and dynamics. The y-axis has the thresholds of the BR policy for the 5-point distribution on A " ta5, a4, a3, a2, a1u with the probability mass function pf5, f4, f3, f2, f1q " p 5 28 , 5 28 , 7 28 , 6 28 , 5 28 q. The plotted series is a sample path realization of the ratio t Kt n´t : 0 ď t ď nu which enters the "orbit" of the threshold T3 at time τ0 (so jpτ0q " 3) and exits at time τ . Up until τ0 both thresholds T3 and T4 are in play. In this chart we take δ " 19 224 ă ǫ " 5 56 " 1 2 mintf5, . . . , f1u. Notice that F pa3q " f1`f2. WhenF pa3q ă Kt{pn´tq ă T3, the budget is, in expectation, sufficient to take some a3 values but the policy will not do that until T3 is crossed. This "under-selection" makes Kt{pn´tq drift up toward T3. When T3 ă Kt{pn´tq ăF pa4q, the budget is, in expectation, insufficient to take all a3 values but the policy does select them. This "over-selection" makes Kt{pn´tq drift towards T3.
. . . . . . . . τ Figure 2 gives a graphical representation of the selection regions of the BR policy. The policy has two natural properties: (i) since T 1 " 0, the BR policy selects all a 1 -valued candidates until exhausting the budget; and (ii) since T m´1 ă 1 and T m "`8, the BR policy selects all remaining values as soon as the remaining budget is greater than or equal to the remaining number of time periods (i.e., if n´t ď K t ).
If τ 0 ă n´2δ´1´1, then τ 0 is the first time that the ratio K t {pn´tq enters the "orbit" of one of the thresholds; see Figure 2. We denote by jpτ 0 q the index of the threshold that is within δ{2 of the ratio K τ 0 {pn´τ 0 q, and we use T jpτ 0 q to denote the value of that threshold. If τ 0 " n´2δ´1´1, then we set jpτ 0 q " m`1 and T jpτ 0 q " 8. For all t ď n´2δ´1´1 the jumps of K t {pn´tq satisfy the absolute boundˇˇˇˇK so that, on the event τ 0 ă n´2δ´1´1, we are guaranteed that T jpτ 0 q is either T j or T j`1 when j is such that k{n P rT j , T j`1 q.
After time τ 0 , we consider the process Y u " K τ 0`u´T jpτ 0 q pn´τ 0´u q for u P t0, 1, . . . n´τ 0 u, which serves a useful vehicle to study the behavior of the budget-ratio process and, in particular, to track the deviations of the budget ratio from the threshold T jpτ 0 q . This is becausěˇˇˇK In words, the ratio is outside of the dark region in Figure 2 if and only if the deviation process Y u exceeds the "moving target" δpn´τ 0´u q.
The initial condition is in all of the three scenarios such that k{n is already within the orbit of T3 so τ0 " 0 and jpτ0q " 3. and we have the negative-drift property In the case that Y u ă 0 and K τ 0`u ą 0, we have that K τ 0`u {pn´τ 0´u q ă T jpτ 0 q , so that the BR policy skips all values smaller or equal to a jpτ 0 q , i.e., σ br τ 0`u`1 ď ½pX τ 0`u`1 ą a jpτ 0 q q so that and we have a strictly positive drift (20) ě 1 2 For completeness, we also note that if K τ 0`u " 0, then the drift is simply given by The bounds (19) and (20) show that, regardless of whether the ratio K τ 0`u {pn´τ 0´u q is above or below the critical threshold, the BR policy pulls it towards the threshold. This preliminary drift analysis will be useful in showing that the stopping time at which the ratio exits the critical orbit (see again the right side of Figure 2) is suitably large.
Theorem 3 (BR stopping time). Let ǫ " 1 2 mintf m , f m´1 , . . . , f 1 u. Then there is a constant M " M pǫq such that for all pn, kq P T , the stopping time τ in (22) satisfies the bound Erτ s ě n´M .
The proof of Theorem 3 (at the end of this section) is based on the mean-reversal property established above and a Lyapunov function argument. One must be careful in this analysis: when approaching the horizon's end, a small change in K t can lead to a large change in K t {pn´tq.
Viewed in terms of the deviation process Y u , the challenge is that the target δpn´τ 0´u q is moving and easier to exceed when u is large.
Theorem 3 shows that the BR policy satisfies the sufficient condition (iv) in Proposition 2.
Corollary 1 then proceeds to show that the remaining requirements (i)-(iii) in Proposition 2 are also satisfied.
Corollary 1 (Uniformly bounded regret). Let ǫ " 1 2 mintf m , f m´1 , . . . , f 1 u. Then the BR policy and the stopping time τ in (22) satisfy the properties in Proposition 2. In particular, there is a constant M " M pǫq such that Vo ff pn, kq´Vo n pn, kq ď 2a 1 pm`1qM for all pn, kq P T .
The BR policy, while achieving bounded regret, is not the optimal policy. Considering the optimality equation (45) developed in Appendix B, one can see that with two periods to go (ℓ " 2 there) and one unit of budget (k " 1) the optimal action is to take any (and only) values a j ě h 2 p1q " g 1 p1q´g 1 p0q " ErX 1 s. In this state, the BR policy will instead take all values above the median. Of course while the BR policy makes some mistakes when approaching the horizon's end, Corollary 1 is evidence that it mostly does the right thing.
Proof of Corollary 1. It suffices to prove that the BR policy satisfies the sufficient conditions stated in Proposition 2. By Theorem 3 the BR policy has the associated stopping time τ in (22) that satisfies condition (iv) in the proposition. Conditions (i)-(iii) are verified by the following sample-path argument.
If n ă 2δ´1`1, then the three conditions are satisfied immediately by choosing M to be a suitable constant. Otherwise if n ě 2δ´1`1, we note that T j "F pa j q`1 2 f j "F pa j`1 q´1 2 f j for all j P t2, . . . , mu, so the definition (4) tells us that if k{n P rT j , T j`1 q then j 0 pn, kq " j for all j P rms. Thus, if j is the index such that k{n P rT j , T j`1 q, then j is the "action" index identified in Proposition 2, and we verify the three conditions by distinguishing the case tτ 0 ă n´2δ´1´1u from tτ 0 " n´2δ´1´1u.
As argued earlier, on the event tτ 0 ă n´2δ´1´1u the index jpτ 0 q P tj, j`1u. In turn, for all t ă τ 0 , the BR policy selects X t`1 ě a j and skips all X t`1 ď a j`1 . If jpτ 0 q " j then for time indices t P rτ 0 , τ q all values a j´1 and greater are selected and all values a j`1 or smaller are skipped. If jpτ 0 q " j`1, then on t P rτ 0 , τ q all values a j and greater are selected and all those smaller or equal than a j`2 are skipped. Thus, on the event tτ 0 ă n´2δ´1´1u we have ÿ where, recall, S br,τ i is the number of a i candidates selected by time τ . Furthermore, Here, the left equality holds because out of the total budget used by time τ , which is given by k´K τ " ř tPrτ s σ br t , the quantity ř iPrj´1s Z τ i is allocated to values larger or equal to a j´1 and the remaining to a j .
By combining the observations in (24) and (25), we find on the event tτ 0 ă n´2δ´1´1u that and where we use the fact that if x, y, z are non-negative numbers then minpx´y, zq ě minpx, zq´y.
On the event tτ 0 " n´2δ´1´1u we have that τ " τ 0 and all values greater than or equal to a j are selected and all values lower than or equal to a j`1 are skipped so that and 0 " S br,τ j`1 " pk´ÿ Finally, since the BR policy selects all remaining values as soon as there is a t P rns such that K t ě n´t, we have that K τ ď n´τ and, consequently, that ErK τ s ě Erτ s´n ě´M (30) where the last inequality follows from Theorem 3.
If we now recall the estimates (23), (26) and (27) which hold on the event tτ 0 ă n´2δ´1´1u and the relations (28) and (29) which are satisfied on tτ 0 " n´2δ´1´1u, take expectations and recall the bound (30), we see that the sufficient conditions (i)-(iii) in Proposition 2 are all satisfied.5 .1. An alternative to Budget-Ratio: the Adaptive-Index policy.
In closely related work, Wu et al. (2015) offer an elegant adaptive-index policy that we revisit here.
Given the deterministic relaxation (10), we re-solve in each time period the deterministic problem DRpn´t, K t q " ϕpErZ n´t 1 s, . . . , ErZ n´t m s, K t q, and recall from (11) that the optimal solution is given by sj ,t " mintErZ n´t j s, pK tř iPrj´1s ErZ n´t i sq`u " mintf j pn´tq, pK t´F pa j qpn´tqq`u for all j P rms.
Then, we construct the adaptive (re-optimized) index policy by mimicking the solution of this optimization problem. Specifically, if sj ,t " f j pn´tq and the candidate inspected at time t`1 has ability a j then that candidate is selected. Otherwise, if sj ,t " K t´F pa j qpn´tq ą 0 then an arriving a j candidate is selected with probability Finally, if sj ,t " 0 then an arriving a j candidate is rejected.
This policy induces a nice martingale structure. Using the notation introduced in Section 2, we let σ ai 1 , σ ai 2 , . . . , σ ai n be the sequence of decisions of the Adaptive-Index policy and let K ai t " k´ř sPrts σ ai s be the associated remaining-budget process for all t P rns and with K ai 0 " k. In the statement and in the proof of the next proposition, we use the standard notation a^b " minta, bu.
Proposition 3. Let τ 1 " inftt P rns : K ai t {pn´tq ą 1u, then the stopped Adaptive-Index ratio process tR t^τ 1 " K ai t^τ 1 n´pt^τ 1 q : t P rnsu is a martingale. Proof. Since 0 ď R t^τ 1 ď 1 it is trivially true that Er|R t^τ 1 |s ă 8 for all t P rns. Next, if F t is the σ-field generated by the random variables σ ai 1 , . . . , σ ai t , then we have that where σ ai t`1 " 1 with probability K ai t n´t^1 and it is zero otherwise. Hence, where we use the fact that ½tτ 1 ą tu implies K ai t {pn´tq ď 1.B ecause of this martingale structure, the Adaptive-Index ratio K ai t {pn´tq remains "close" to k{n so that this policy, like the BR policy, is careful in utilizing its budget and does not run out of it until (almost) the horizon's end. Furthermore, this martingale property guarantees bounded regret when the initial ratio k{n is safely far from the masses of the discrete distribution (see also Wu et al. 2015, and Section C in this paper).
In general, however, spending the budget at the right "rate" is not sufficient, however. It is also important that the budget is spent on the right candidates and the symmetric martingale structure is too weak for that purpose: when initialized, for example, at k{n "F pa j q for some j, the martingale spends an equal amount of time below and aboveF pa j q, and the re-optimized index policy takes the values a j and a j`1 in equal proportions. A good policy should start selecting a j`1 values only after it has selected all (or most) a j values first. Figure 4 gives an illustration of this phenomenon. Conversely, if the ratio k{n is safely far from the jumps of the discrete distribution, then the adaptive index policy keeping the initial budget at k n "F p0.6q (a mass point of the distribution). Whereas the Budget-Ratio policy achieves bounded regret, the regret of the Adaptive-Index policy grows with the problem size n.

Proof of Theorem 3
Given the deviation process Y u " K τ 0`u´T jpτ 0 q pn´τ 0´u q, u P t0, 1, . . . n´τ 0 u, the key step in the proof of Theorem 3 is the derivation of an exponential tail bound for the random variable | Y u | for each u P t0, 1, . . . , n´τ 0 u. As before, we take ǫ " 1 2 mintf m , f m´1 , . . . , f 1 u.
Proposition 4 (Exponential tail bound). Fix 0 ă δ ă ǫ, c " e 2´3 and 0 ă η ă pǫ´δq{c. Then there is a constant M " M pǫq such that, for all 0 ď u ď n´τ 0 , we have the exponential tail bound Proof. If τ 0 " n´2δ´1´1, then the statement is trivial. Otherwise, for τ 0 ă n´2δ´1´1 the proof is an application-using the mean-reversal property of the BR policy-of the tail bound of Hajek (1982) to the two processes tY u : 0 ď u ď n´τ 0 u and t´Y u : 0 ď u ď n´τ 0 u.
The choice of η in (31) tells us that ǫ´ηc´δ ą 0, so we can drop the second term in the first exponent on the right-hand side. By setting M pǫq " maxt1, e 2 1´ρ u, we then have The analysis of the sequence t´Y u : 0 ď u ď n´τ 0 u follows a similar logic. For any a ě 0 ErY u`1´Yu | p F u s½pY u ă´a, K τ 0`u " 0q, so that by (20) and (21) we have for all a ě 0. On the event tY u ă´a, K τ 0`u " 0u, we must have that jpτ 0 q ą 1. Otherwise, if jpτ 0 q " 1, T jpτ 0 q " 0 and Y u " K τ`u " 0 ě´a by definition. In particular, T jpτ 0 q ě ǫ on this event, and it follows that for all a ě 0. It is then easily verified that As before, Hajek (1982, Lemma 2.1 and Theorem 2.3) gives-with c " e 2´3 , 0 ă η ď pǫ´δq{c, and ρ " 1´ηpǫ´ηcq-that The statement of the lemma is now the combination of (32) and (33).T he exponential tail bound in Proposition 4 goes a long way for the proof of Theorem 3 which follows next.
We will represent Erτ s as a sum of the tail probabilities which, by Markov's inequality, satisfy the bounds By integrating the exponential tail bound in Proposition 4 for 0 ď u ď t´τ 0 (recall that τ 0 ă t ď n´2δ´1´1 ă n), we obtain for the constant M " M pǫq in that lemma. Since τ ą τ 0 by definition, then Ppτ ą t|F 0 q " 1 for all t ď τ 0 and we also have that
The middle summand on the right-hand side is uniformly bounded, so in summary we have a constant M " M pǫq ă 8 such that Erτ | p F 0 s ě n´M, and the proof of the theorem follows after one takes total expectations.6

. Concluding remarks
We have proved that in the multi-secretary problem with independent candidate abilities drawn from a common finite-support distribution, the regret is constant and achievable by a multithreshold policy. In our model, the decision maker knows and makes crucial use of the distribution of candidate abilities. Two obvious extension to consider are the problem instances in which the ability distribution is continuous and/or unknown to the decision maker.
While one would like to think of the continuous distribution as a "limit" of discrete ones, our analysis does build to a great extent on this discreteness, and our bounds depend on the cardinality of the support. At this point, it is not clear if bounded regret is achievable also with continuous distributions.
For the case of unknown distribution, we conjecture that, with a finite support, the regret should be logarithmic in n. Indeed, consider a "stupid" algorithm that uses the first Oplogpnqq steps to learn about the distribution (without concern for the objective) and, at the end of the learning period, computes the threshold and runs with the Budget-Ratio policy thereafter. Simple Chernoff bounds suggests that the likelihood of mis-estimation should be exponentially small. Coupling this with the fact that the performance of the BR policy (specifically the fact that Erτ s ě n´M ) is insensitive to small perturbations to the thresholds leads to our conjecture. Proof of Lemma 1. The equality in (7) is obvious, and we focus on proving the inequality. Since ErX n s " pn and pp`εqn ď k, for any u ą 0 we have the tail bound PppX n´k q`ě uq " PpX n´p n ě k´pn`uq ď PpX n´p n ě εn`uq, and Hoeffding's inequality (see, e.g. Boucheron et al. 2013, Theorem 2.8
By integrating both sides for u P r0, 8q we then obtain for all pp`εqn ď k that To prove the second bound in (7), one applies the first bound to the binomial random variable Y n " n´X n and the budget k 1 " n´k.P roof of Lemma 2. The index policy id " tp j,t : j P rms and t P rnsu defined by (12) is such that all values greater than or equal to a j id´1 are selected together with a fraction of the a j id values.
Formally, by Wald's lemma, we have that Recall that now that for any policy π we have the inequalities V π on pn, kq ď Vo ff pn, kq ď DRpn, kq " j id´1 ÿ j"1 a j f j n`a j id pk´nF pa j id qq,

so that
Vo ff pn, kq´V id on pn, kq ď DRpn, kq´V id on pn, kq ď a 1 Ern´νs, and to bound the regret as desired it suffices to obtain an upper bound for Ern´νs.
Let tB t : t P rnsu be i.i.d Bernoulli random variables with success probability q " is the centered number of candidates the index policy selects by time r, we then have for t ě 0 and q " k{n that, n´ν ě t if and only if N n´rts ě k´qpn´rtsq " qrts.
In turn, Kolmogorov's maximal inequality (See, e.g. Billingsley 1995, Theorem 22.4) tells us that for any t ą 0 It then follows that so that Ern´νs ď q´1 ? n ď ε´1 ? n for any ε P p0, 1q and all pairs pn, kq such that ε ď k{n.P roof of Lemma 3. The non-adaptive policy p π that takes all values a 1 and rejects all others achieves bounded regret. To see this, notice that S n j " min Z n j , pk´ÿ iPrj´1s Z n i q`( ď pk´Z n 1 q`for all j ě 2.
Since the random variable Z n 1 is Binomial with parameters n, f 1 , and 0 ď k ď npf 1´ǫ q, then Lemma 1 implies that ErS n j s ď Erpk´Z n 1 q`s ď 1 4ǫ for all j ě 2.
In turn, the value of the offline solution for k ď npf 1´ǫ q satisfies the bound Vo ff pn, kq ď a 1 ErS n 1 s`a 2 m´1 4ǫ " a 1 ErmintZ n 1 , kus`a 2 m´1 4ǫ .
The non-adaptive policy p π. takes all a 1 values and none of the others, until it runs out of budget at time ν " mintr ě 1 : ř tPrrs B t ě k or r ě nu, so that ErS p π,n j s " 0 for all j ě 2 and ErS p π,n 1 s " Er ÿ tPrνs B t ½pX t " a 1 qs.
Furthermore, we have that S p π,n 1 " Z n 1 if Z n 1 ă k, and it equals k otherwise. Thus, ErS p π,n 1 s " ErZ ν 1 s " ErmintZ n 1 , kus " ErS n 1 s, so we finally have the bound Vo ff pn, kq´a 2 m´1 4ǫ ď a 1 ErmintZ n 1 , kus " V π on pn, kq, just as needed.P roof of Lemma 4. Let Z denote a normal random variable with mean zero and variance 1.
A version of Stein's lemma (see, e.g., Ross 2011, Theorem 3.6) for the sum of independent (not necessarily identically distributed) random variables tells us that so by the homogeneity of the distance function d W , we also have that The inequality (36) then implies that ς n ErpZ´Υq`s´p2`3 ? 2q ď ErpN n´Υ ς n q`s, and one can obtain an immediate lower bound for the left-hand side is by ς n ΥPpZ ě 2Υq ď ς n ErpZ´Υq½pZ ě 2Υqs ď ς n ErpZ´Υq`s.
The left inequality in (13) then follows setting β 1 " ΥPpZ ě 2Υq. For the right inequality we combine the earlier argument with the symmetry of the normal distribution and the Lipschitz-1 continuity of the map x Þ Ñ p´x´Υς n q`.
The argument for the second inequality in (14) is standard. Since ErN n s " 0 and ErN 2 n s " ς 2 n , we have that ErpN n`Υ ς n q 2 s ď ErpN n`Υ ς n q 2 s " ς 2 n`Υ 2 ς 2 n , so taking β 2 " 1`Υ 2 concludes the proof.P roof of Lemma 5. Given a non-adaptive policy π " tp j,t : j P rms and t P rnsu and a constant M ă 8 we letM " M min jPrm´1s | a j´aj`1 | and ι " arg max jPrms | s j pπq´sj |, and we show that if then DRpn, kq´V π on pn, kq ą M ? n.
We let j id P rms be the index such thatF pa j id q ď k{n ăF pa j id`1 q. There are two cases to consider: (i) (38) is attained at ι ě j id`1 (k{n ăF pa j id q ăF pa ι q) in which case 0 " sι ă s ι pπq, and (ii) ι ď j id (F pa ι ďF pa j id q ď k{n) in which case 0 ă sι and s ι pπq ă sι .
We begin with the first case. That is, we assume that (38) is attained for some ι ě j id`1 when sι " 0, and we obtain that s ι pπq ě sι`M ? n "M ? n. To estimate the gap between DRpn, kq and V π on pn, kq, consider a version of the deterministic relaxation (10) that requires the selection of at leastM ? n candidates with ability a ι out of the f ι n available. SinceM ď 2ǫ ? n we have that

M
? n ď f j n for all j P rms, so we write the optimization problem as and we obtain that V π on pn, kq ď DRCpn, k, ιq ď DRpn, kq.
The unique maximizer pš 1 , . . . ,š m q of DRCpn, k, ιq is given by otherwise, so that the difference between the value of the deterministic relaxation and the value of its constrained version is given by Because j id´1 ă j id ď ι´1 ă ι, the monotonicity a ι ă a ι´1 ď a j id ă a j id´1 gives us the lower bound DRpn, kq´DRCpn, k, ιq ě a j id´1M ? n´a ιM ? n ą ra ι´1´aι sM ? n, and since V π on pn, kq ď DRCpn, k, ιq we have DRpn, kq´V π on pn, kq ą ra ι´1´aι sM ? n for all ι ě j id`1 .
A similar inequality can be obtained for second case in which when (38) is attained at some index ι ď j id . In this case we would consider a version of the deterministic relaxation (10) with the additional constraint s ι ď nf ι´M ? n or s j id ď k´nF pa j id q´M ? n. This analysis then implies the bound DRpn, kq´V π on pn, kq ą ra ι´aι`1 sM ? n for all ι ď j id , so if we recall (41) and use the definition ofM in (37) we finally obtain that DRpn, kq´V π on pn, kq ą M ? n, concluding the proof of the lemma.P roof of Lemma 6. Fix any non-adaptive policy π P Π na and recall that S π,ν 1 is the number of a 1 -candidates that the policy selects. If s j ď k, then V π on pn, kq ď a 1 ErS π,ν 1 s`Erϕ´1pZ n 2 , . . . , Z n m , k´S π,ν 1 qs.
We now note that Vo ff pn, kq " a 1 ErS n 1 s`Erϕ´1pZ n 2 , . . . , Z n m , k´S n 1 qs, so when we use this last decomposition in the displayed equation above we conclude that V π on pn, kq ď Vo ff pn, kq´pa 1´a2 q pErS n 1 s´ErS π,ν 1 sq .
Let M " M pǫ, a 1 , a 2 q be such that the index policy achieves an pa 1´a2 qM ? n regret (see Lemma 2 with ε " f 1`ǫ ). Then, if π is such that ErS π,ν 1 s ă ErS n 1 s´M ? n, then V π on pn, kq ď Vo ff pn, kq´pa 1´a2 qM ? n, so that π cannot be optimal. In other words, the optimal policy must satisfy ErS π,ν 1 s ě ErS n 1 s´M ? n.
Since p 1,t f 1 ď q t for all t P rns, it follows that there is another constat M " M pǫ, a 1 , a 2 q such that concluding the proof of the left inequality of (15).
The decomposition Vo ff pn, kq " Erϕ´mpZ n 1 , . . . , Z n m´1 , kqs`a m ErS n m s, then implies that V π on pn, kq ď Vo ff pn, kq´a m pErS n m s´ErS π,ν m sq.
Here we have that k ď p1´f m´ǫ qn so Lemma 1 tells us that ErS n m s ď p4ǫq´1 and if M " M pǫ, a m q is the constant such that the index policy achieves a m M ? n regret, then we see that policy π cannot be optimal if ErS π,ν m s ą M ? n. Since p1´p m,t qf m ď 1´q t , this observation then gives us another constant M " M pǫ, a m q such that which completes the proof.˝ Proof of Proposition 5. This is an induction proof. For ℓ " 1, we have by (44) that v 1 pw, 0q " w and v 1 pw, kq " w`ÿ jPrms a j f j " w`ErX 1 s for all k ě 1.
Taking g 1 pkq " # 0 if k " 0 ErX 1 s if k ě 1, one has that v 1 pw, kq " w`g 1 pkq for all w ě 0 and all k P Z`.
Next, as induction hypothesis suppose that we have the decomposition v ℓ´1 pw, kq " w`g ℓ´1 pkq for all w ě 0 and all k P Z`.
Defining recursively g ℓ pkq " ÿ jPrms maxta j`gℓ´1 pk´1q, g ℓ´1 pkquf j for all k ě 1, we then have together with (48) that (47) holds for all w ě 0 and all k P Z`. From this argument it also follows that g ℓ pkq satisfy the recursion (45) with the boundary condition (46).A
In particular, 0 ď Z n j´m in Z n j , pk´ÿ iPrj´1s Z n i q`( ď´ÿ iPrjs Z n i´k¯`. Taking expectations on both sides, and recalling (49) for j ď j id´1 , we have 0 ď sj´ErS n j s ď E "´ÿ iPrjs Z n i´k¯`ı .
For such j ď j id´1 the sum ř iPrjs Z n i is a Binomial random variable with n trials and success probabilityF pa j`1 q ďF pa j id q, so because pn, kq P T 1 and pF pa j id q`ǫ 1 qn ď k, we obtain from (7) that sj " ErS n j s˘1 4ǫ 1 , for all j ď j id´1 . Similarly, so we have the two inequalities Here, the two sums ř iPrj id´1 s Z n i and ř iPrj id s Z n i are again Binomial random variables with n trials and success probabilities given, respectively, byF pa j id q andF pa j id`1 q. Taking expectations and usingF pa j id q`ǫ 1 ď k{n ďF pa j id`1 q´ǫ 1 , Lemma 1 guarantees that 1 4ǫ 1 ď k´ÿ iPrj id´1 s ErZ n i s´ErS n j id s ď 1 4ǫ 1 .
In turn, the representation (49) for sj id implies that sj id " ErS n j id s˘1 4ǫ 1 .

then give us that
DRpn, kq´Vo ff pn, kq ď a 1 m 4ǫ 1 for all pn, kq P T 1 , just as needed to complete the proof of the proposition.˝