Universality of Power-of-$d$ Load Balancing in Many-Server Systems

We consider a system of $N$ parallel single-server queues with unit exponential service rates and a single dispatcher where tasks arrive as a Poisson process of rate $\lambda(N)$. When a task arrives, the dispatcher assigns it to a server with the shortest queue among $d(N)$ randomly selected servers ($1 \leq d(N) \leq N$). This load balancing strategy is referred to as a JSQ($d(N)$) scheme, marking that it subsumes the celebrated Join-the-Shortest Queue (JSQ) policy as a crucial special case for $d(N) = N$. We construct a stochastic coupling to bound the difference in the queue length processes between the JSQ policy and a scheme with an arbitrary value of $d(N)$. We use the coupling to derive the fluid limit in the regime where $\lambda(N) / N \to \lambda<1$ as $N \to \infty$ with $d(N) \to\infty$, along with the associated fixed point. The fluid limit turns out not to depend on the exact growth rate of $d(N)$, and in particular coincides with that for the JSQ policy. We further leverage the coupling to establish that the diffusion limit in the critical regime where $(N - \lambda(N)) / \sqrt{N} \to \beta>0$ as $N \to \infty$ with $d(N)/(\sqrt{N} \log (N))\to\infty$ corresponds to that for the JSQ policy. These results indicate that the optimality of the JSQ policy can be preserved at the fluid-level and diffusion-level while reducing the overhead by nearly a factor O($N$) and O($\sqrt{N}/\log(N)$), respectively.


Introduction
In this paper we establish a universality property for a broad class of randomized load balancing schemes in many-server systems. While the specific features of load balancing policies may considerably differ, the principal purpose is to distribute service requests or tasks among servers or distributed resources in parallel-processing systems. Welldesigned load balancing schemes provide an effective mechanism for improving relevant performance metrics experienced by users while achieving high resource utilization levels. The analysis and design of load balancing schemes has attracted strong renewed interest in the last several years, mainly motivated by significant challenges involved in assigning tasks (e.g. file transfers, compute jobs, database look-ups) to servers in largescale data centers.
In the present paper we focus on a basic scenario of sending tasks from a single dispatcher to N parallel queues with identical servers, exponentially distributed service requirements, and a service discipline at each individual server that is oblivious to the actual service requirements (e.g. FCFS). In this canonical case, the so-called Join-the-Shortest-Queue (JSQ) policy has several strong optimality properties, and in particular minimizes the overall mean delay among the class of non-anticipating load balancing policies that do not have any advance knowledge of the service requirements [6,32,35]. (Relaxing any of the three above-mentioned assumptions tends to break the optimality properties of the JSQ policy, and renders the delay-minimizing policy quite complex or even counter-intuitive, see for instance [12,16,33].) In order to implement the JSQ policy, a dispatcher requires instantaneous knowledge of the queue lengths at all the servers, which may give rise to a substantial communication burden, and not be scalable in scenarios with large numbers of servers. The latter issue has motivated consideration of so-called JSQ(d) strategies, where the dispatcher assigns an incoming task to a server with the shortest queue among d servers selected uniformly at random. Mean-field limit theorems in Mitzenmacher [22] and Vvedenskaya et al. [31] indicate that even a value as small as d = 2 yields significant performance improvements in a many-server regime with N → ∞, in the sense that the tail of the queue length distribution at each individual server falls off much more rapidly compared to a strictly random assignment policy (d = 1). This is commonly referred to as the "power-of-two" effect. While these results were originally proved for exponential service requirement distributions, they have been extended to general service requirement distributions in Bramson et al. [3]. Analyses of several variants of this model can be found in [5,7,11,18,19] The diversity parameter d thus induces a fundamental trade-off between the amount of communication overhead and the performance in terms of queue lengths and delays. Specifically, a strictly random assignment policy can be implemented with zero overhead, but for any positive load per server, the probability of non-zero wait and the mean waiting time do not fall to zero as N → ∞. In contrast, a nominal implementation of the JSQ policy (without maintaining state information at the dispatcher) involves O(N) overhead per task, but it can be shown that the probability of non-zero wait and the mean waiting time vanish as N → ∞ for any fixed subcritical load per server. Although JSQ(d) strategies with a fixed parameter d 2 yield significant performance improvements over purely random task assignment while reducing the communication overhead by a factor O(N) compared to the JSQ policy, the probability of non-zero wait and mean waiting time do not vanish in the limit. In that sense a fixed value of d is not sufficient to achieve asymptotically optimal performance. This is also reflected by recent results of Gamarnik et al. [10] indicating that in the absence of any memory at the dispatcher the communication overhead per task must grow with N in order to allow a zero mean waiting time in the limit.
In order to gain further insight in the trade-off between performance and communication overhead as governed by the diversity parameter d, we also consider a regime where the number of servers N grows large, but allow the value of d to depend on N, and write d(N) to explicitly reflect that. For convenience, we assume a Poisson arrival process of rate λ(N) and unit-mean exponential service requirements.
We construct a stochastic coupling to bound the difference in the queue length processes between the ordinary JSQ policy and a scheme with an arbitrary value of d(N). We exploit the coupling to obtain the fluid limit in the subcritical regime where λ(N)/N → λ < 1 as N → ∞ with d(N) → ∞, along with the associated fixed point. As it turns out, the fluid limit does not depend on the exact growth rate of d(N), and in particular coincides with that for the JSQ policy. This implies that the overhead of the JSQ policy can be reduced by 'almost' a factor O(N) while maintaining fluid-level optimality. In case of batch arrivals fluid-level optimality can even be achieved with O(1) communication overhead per task.
We further consider the Halfin-Whitt heavy-traffic regime where (N − λ(N))/ √ N → β > 0 as N → ∞. Recent work of Eschenfeldt & Gamarnik [8] showed that the diffusionscaled system occupancy state for the ordinary JSQ policy in this regime weakly converges to a two-dimensional reflected Ornstein-Uhlenbeck process. We leverage the abovementioned coupling to prove that the diffusion limit in case d(N)/( √ N log(N)) → ∞ as N → ∞ corresponds to that for the JSQ policy. This indicates that the overhead of the JSQ policy can 'almost' be reduced to O( √ N log N) while retaining diffusion-level optimality. The above condition is in fact close to necessary, in the sense that the diffusion-level behavior of the scheme is sub-optimal if d(N)/( √ N log N) → 0 as N → ∞. The above results mirror the fluid-level and diffusion-level optimality properties reported in the companion paper [23] for power-of-d(N) strategies in a scenario with N server pools, where each server pool is a collection of servers, each working at unit rate. The coupling developed in [23] has greater hold on the task completions, and provides absolute bounds on the difference of each component of the occupancy states. More specifically, the task completions depend on the total number of active tasks in the entire system, whereas in the single-server scenario, it depends only on the number of non-idle servers. As a result, obtaining a stochastic coupling bound in this paper becomes analytically more challenging, and in contrast with the infinite-server scenario, involves the cumulative loss terms and tail sums of the occupancy states of the ordinary JSQ policy. This imposes the additional challenge of proving the ℓ 1 convergence of the occupancy state process of the ordinary JSQ policy as will be described in greater detail later. To the best of our knowledge, this is the first time the transient fluid limit of the ordinary JSQ policy is rigorously established.
The idea of using coupling to prove scaling limits of large-scale parallel-server systems was introduced by the authors in [24]. The coupling method there was much weaker and was useful only for systems starting from specific initial occupancy states and for the particular scaling regime considered in that paper. In contrast, in the current paper we need to develop a much stronger and wider coupling framework involving an intermediate class of schemes as described in Section 2.4 to establish the universality results. In addition, we consider arbitrary starting states and different scaling regimes. Remark 2.6 further discusses the novelty and importance of the current stochastic comparison framework.
The remainder of the paper is organized as follows. In Section 2 we present a detailed model description and state the main results, and in Section 3 we construct a coupling and establish the stochastic ordering relations. Sections 4 and 5 contain the proofs of the fluid  Figure 1: The occupancy state of the system; When the servers are arranged in nondecreasing order of their queue lengths, Q i represents the width of the i th row. and diffusion limit results, respectively. Finally in Section 6 we make some concluding remarks and briefly comment on future research directions.

Model description and notation
Consider a system with N parallel single-server queues with identical servers and a single dispatcher. Tasks with unit-mean exponential service requirements arrive at the dispatcher as a Poisson process of rate λ(N), and are instantaneously forwarded to one of the servers. Specifically, when a task arrives, the dispatcher assigns it to a server with the shortest queue among d(N) randomly selected servers (1 d(N) N). This load balancing strategy will be referred to as the JSQ(d(N)) scheme, marking that it subsumes the ordinary JSQ policy as a crucial special case for d(N) = N. The buffer capacity at each of the servers is b (possibly infinite), and when a task is assigned to a server with b pending tasks, it is permanently discarded. For is the number of servers under the JSQ(d(N)) scheme with a queue length of i or larger, at time t, including the possible task in service, i = 1, . . . , b. Figure 1 provides a schematic diagram of the Q i -values. Throughout we assume that at each arrival epoch the servers are ordered in nondecreasing order of their queue lengths (ties can be broken arbitrarily), and whenever we refer to some ordered server, it should be understood with respect to this prior ordering.
We occasionally omit the superscript d(N), and replace it by N, to refer the N th system, when the value of d(N) is clear from the context. When a task is discarded, in case of a finite buffer size, we call it an overflow event, and we denote by L d(N) (t) the total number of overflow events under the JSQ(d(N)) scheme up to time t.
A sequence of random variables X N N 1 , for some function f : Ê → Ê + , is said to be O P (f(N)), if the sequence of scaled random variables X N /f(N) N 1 is tight, or said to be o P (f(N)), if X N /f(N) N 1 converges to zero in probability. Boldfaced letters are used to denote vectors. We denote by ℓ 1 the space of all summable sequences. For any set K, the closure is denoted by K. We denote by D E [0, ∞) the set of all cádlág (right continuous left limit exists) functions from [0, ∞) to a complete separable metric space E, and by ' L − →' convergence in distribution for real-valued random variables and with respect to the Skorohod-J 1 topology for cádlág processes.

Fluid-limit results
In the fluid-level analysis, we consider the subcritical regime where λ(N)/N → λ < 1 as N → ∞. In order to state the results, we first introduce some useful notation. Denote the fluid-scaled system occupancy state by as the set of all possible fluid-scaled occupancy states equipped with ℓ 1 topology. For any q ∈ S, denote m(q) = min{i : q i+1 < 1}, with the convention that q b+1 = 0 if b < ∞. Note that m(q) < ∞, since q ∈ ℓ 1 . If m(q) = 0, then define p 0 (q) = 1 and p i (q) = 0 for all i 1. If m(q) > 0, distinguish two cases, depending on whether the normalized arrival rate λ is larger than 1 − q m(q)+1 or not.
, and p i (q) = 0 for all i = m(q) − 1, m(q). Note that the assumption λ < 1 ensures that the latter case cannot occur when m(q) = b < ∞. Theorem 2.1. (Universality of fluid limit for JSQ(d(N)) scheme) Assume q d(N) (0) → q ∞ in S and λ(N)/N → λ < 1 as N → ∞. For the JSQ(d(N)) scheme with d(N) → ∞, any subsequence of the sequence of processes q d(N) (t) t 0 has a further subsequence that converges weakly with respect to the Skorohod J 1 topology, to the limit q(t) t 0 satisfying the following system of integral equations where the coefficients p i (·) are as defined earlier.
The above theorem shows that the fluid-level dynamics do not depend on the specific growth rate of d(N) as long as d(N) → ∞ as N → ∞. In particular, the JSQ(d(N)) scheme with d(N) → ∞ exhibits the same behavior as the ordinary JSQ policy in the limit, and thus achieves fluid-level optimality.
The coefficient p i (q) represents the instantaneous fraction of incoming tasks assigned to servers with a queue length of exactly i in the fluid-level state q ∈ S. Assuming m(q) < b, a strictly positive fraction 1 − q m(q)+1 of the servers have a queue length of exactly m(q). Since d(N) → ∞, the fraction of incoming tasks that get assigned to servers with a queue length of m(q) + 1 or larger is zero: p i (q) = 0 for all i = m(q) + 1, . . . , b − 1. Also, tasks at servers with a queue length of exactly i are completed at (normalized) rate q i − q i+1 , which is zero for all i = 0, . . . , m(q) − 1, and hence the fraction of incoming tasks that get assigned to servers with a queue length of m(q) − 2 or less is zero as well: p i (q) = 0 for all i = 0, . . . , m(q) − 2. This only leaves the fractions p m(q)−1 (q) and p m(q) (q) to be determined. Now observe that the fraction of servers with a queue length of exactly m(q) − 1 is zero. If m(q) = 0, then clearly the incoming tasks will join the empty queue, and thus, p m(q) = 1, and p i (q) = 0 for all i = m(q). Furthermore, if m(q) 1, since tasks at servers with a queue length of exactly m(q) are completed at (normalized) rate 1 − q m(q)+1 > 0, incoming tasks can be assigned to servers with a queue length of exactly m(q) − 1 at that rate. We thus need to distinguish between two cases, depending on whether the normalized arrival rate λ is larger than 1 − q m(q)+1 or not. If λ < 1 − q m(q)+1 , then all the incoming tasks can be assigned to a server with a queue length of exactly m(q) − 1, so that p m(q)−1 (q) = 1 and p m(q) (q) = 0. On the other hand, if λ > 1 − q m(q)+1 , then not all incoming tasks can be assigned to servers with a queue length of exactly m(q) − 1 active tasks, and a positive fraction will be assigned to servers with a queue length of exactly m(q): It is easily verified that the unique fixed point q ⋆ = (q ⋆ 1 , q ⋆ 2 , . . . , q ⋆ b ) of the system of differential equations in (2.1) is given by Note that the fixed point in (2.2) is consistent with the results in [22,31,36] for fixed d, where taking d → ∞ yields the same fixed point. However, the results in [22,31,36] for fixed d cannot be directly used to handle joint scalings, and do not yield the universality of the entire fluid-scaled sample path for arbitrary initial states as established in Theorem 2.1. The fixed point in (2.2) in conjunction with the interchange of limits result in Proposition 2.2 below indicates that in stationarity the fraction of servers with a queue length of two or larger is negligible. Let be the stationary measure of the occupancy states of the N th system.
where π ⋆ = δ q ⋆ with δ x being the Dirac measure concentrated upon x, and q ⋆ as in (2.2).
The above proposition relies on tightness of π d(N) N 1 and the global stability of the fixed point, and is proved in Subsection 4.3.
We now consider an extension of the model in which tasks arrive in batches. We assume that the batches arrive as a Poisson process with rate λ(N)/ℓ(N), and have fixed size ℓ(N) > 0, so that the effective total task arrival rate remains λ(N). We will show that even for arbitrarily slowly growing batch size, fluid-level optimality can be achieved with O(1) communication overhead per task. For that, we define the JSQ(d(N)) scheme adapted for batch arrivals. When a batch of size ℓ(N) arrives, the dispatcher samples d(N) ℓ(N) servers without replacement, and assigns the ℓ(N) tasks to the ℓ(N) servers with the smallest queue length among the sampled servers.
then the sequence of processes q d(N) (t) t 0 converges weakly to the limit q(t) t 0 , described as follows: 3) The fluid limit in (2.3) agrees with the fluid limit of the JSQ(d(N)) scheme if the initial state is taken as in Theorem 2.3. Further observe that the fixed point also coincides with that of the JSQ policy, as given by (2.2). Also, for a fixed ε > 0, the communication overhead per task is on average given by (1 − λ − ε) −1 which is O(1). Thus Theorem 2.3 ensures that in case of batch arrivals with growing batch size, fluid-level optimality can be achieved with O(1) communication overhead per task. The result for the fluid-level optimality in stationarity can also be obtained indirectly by exploiting the fluid-limit result in [36]. Specifically, it can be deduced from the result in [36] that for batch arrivals with growing batch size, the JSQ(d(N)) scheme with suitably growing d(N) yields the same fixed point of the fluid limit as described in (2.2).

Diffusion-limit results
In the diffusion-limit analysis, we consider the Halfin-Whitt regime where for some positive coefficient β > 0. In order to state the results, we first introduce some useful notation. LetQ d(N) (t) = Q d(N) (t) be a properly centered and scaled version of the system occupancy state Q d(N) (t), with The reason why Q are not, is because the fraction of servers with a queue length of exactly one tends to one, whereas the fraction of servers with a queue length of two or more tends to zero as N → ∞.
for t 0, where W is the standard Brownian motion and U 1 is the unique nondecreasing nonneg- Although (2.4) differs from the diffusion limit obtained for the fully pooled M/M/N model in the Halfin-Whitt regime [13,29,30], it shares similar favorable properties. Ob- is the scaled number of vacant servers. Thus, Theorem 2.4 shows that over any finite time horizon, there will be O P ( √ N) servers with queue length zero and O P ( √ N) servers with a queue length larger than two, and hence all but O P ( √ N) servers have a queue length of exactly one. This diffusion limit is proved in [8] for the ordinary JSQ policy, and its steady-state properties are studied in [1,2,4]. Our contribution is to construct a stochastic coupling and establish that, somewhat remarkably, the diffusion limit is the same for any JSQ(d(N)) scheme, as long as d(N)/( √ N log(N)) → ∞. In particular, the JSQ(d(N)) scheme with d(N)/( √ N log(N)) → ∞ exhibits the same behavior as the ordinary JSQ policy in the limit, and thus achieves diffusion-level optimality. This growth condition for d(N) is not only sufficient, but also nearly necessary, as indicated by the next theorem. Theorem 2.5, in conjunction with Theorem 2.4, shows that √ N log N is the minimal order of d(N) for the JSQ(d(N)) scheme to achieve diffusion-level optimality.

Proof strategy
The idea behind the proofs of the asymptotic results for the JSQ(d(N)) scheme in Theorems 2.1 and 2.4 is to (i) prove the fluid limit and exploit the existing diffusion limit result for the ordinary JSQ policy, and then (ii) prove a universality result by establishing that the ordinary JSQ policy and the JSQ(d(N)) scheme coincide under some suitable conditions on d(N). For the ordinary JSQ policy the fluid limit in the subcritical regime is established in Subsection 4.1, and the diffusion limit in the Halfin-Whitt heavy-traffic regime in [8,Theorem 2]. A direct comparison between the JSQ(d(N)) scheme and the ordinary JSQ policy is not straightforward, which is why we introduce the CJSQ(n(N)) class of schemes as an intermediate scenario to establish the universality result.
Just like the JSQ(d(N)) scheme, the schemes in the class CJSQ(n(N)) may be thought of as "sloppy" versions of the JSQ policy, in the sense that tasks are not necessarily assigned to a server with the shortest queue length but to one of the n(N) + 1 lowest ordered servers, as graphically illustrated in Figure 2a class only includes the ordinary JSQ policy. Note that the JSQ(d(N)) scheme is guaranteed to identify the lowest ordered server, but only among a randomly sampled subset of d(N) servers. In contrast, a scheme in the CJSQ(n(N)) class only guarantees that one of the n(N) + 1 lowest ordered servers is selected, but across the entire pool of N servers. We will show that for sufficiently small n(N), any scheme from the class CJSQ(n(N)) is still 'close' to the ordinary JSQ policy. We will further prove that for sufficiently large d(N) relative to n(N) we can construct a scheme called JSQ(n(N), d(N)), belonging to the CJSQ(n(N)) class, which differs 'negligibly' from the JSQ(d(N)) scheme. Therefore, for a 'suitable' choice of d(N) the idea is to produce a 'suitable' n(N). This proof strategy is schematically represented in Figure 2b. Remark 2.6. As mentioned in the introduction, a coupling method was used in [24] to establish the diffusion limit of the Join-the-Idle Queue (JIQ) policy starting from specific initial occupancy states. Comparing the JIQ and JSQ policies in that scaling regime was much easier when viewed as follows: (i) If there is an idle server in the system, both JIQ and JSQ perform similarly, (ii) Also, when there is no idle server and only O( √ N) servers with queue length two, JSQ assigns the arriving task to a server with queue length one. In that case, since JIQ assigns at random, the probability that the task will land on a server with queue length two and thus acts differently than JSQ is O(1/ √ N). Since on any finite time interval the number of times an arrival finds all servers busy is at most O( √ N), all the arrivals except an O(1) of them are assigned in exactly the same manner in both JIQ and JSQ, which then leads to the same scaling limit for both policies. Note that in the computation of the expected number of events when JIQ and JSQ performs differently, both the specific initial state condition and the scaling regime were crucial. In the current paper the stochastic comparison framework is inherently different. Here the idea pivots on two key observations: (i) For any scheme, if each arrival is assigned to approximately the shortest queue, then the scheme can still retain its optimality on various scales, and (ii) For any two schemes, if on any finite time interval not too many arrivals are assigned to different ordered servers, then they can have the same scaling limits. Combination of the above two ideas provides a much wider coupling framework involving an intermediate class of schemes that enables us to consider arbitrary starting states and different scaling regimes. In addition, the consideration of the arbitrary starting state will turn out to be crucial in order to extend the fluid-scale universality result to the steady state.
In the next section we construct a stochastic coupling called S-coupling, which will be the key vehicle in establishing the universality result mentioned above.

Remark 2.7.
Observe that, sampling without replacement polls more servers than with replacement, and hence the minimum number of active tasks among the selected servers is stochastically smaller in the case without replacement. As a result, for sufficient conditions as in Theorems 2.1 and 2.5, it is enough to consider sampling with replacement. Also, for notational convenience, in the proof of the almost necessary condition stated in Theorem 2.5 we will assume sampling with replacement, although the proof technique and the result is valid if the servers are chosen without replacement.

Coupling and Stochastic Ordering
In this section, we construct a coupling between any scheme from the class CJSQ(n(N)) and the ordinary JSQ policy, which ensures that for sufficiently small n(N), on any finite time interval, the two schemes differ negligibly. This plays an instrumental role in establishing the universality results in Theorems 2.1 and 2.4. All the statements in this section should be understood to apply to the N th system with N servers.

Stack formation and deterministic ordering
In order to prove the stochastic comparisons among the various schemes, as in [24], we describe the many-server system as an ensemble of stacks, in a way that two different ensembles can be ordered. In this formulation, at each step, items are added or removed according to some rule. From a high level, we then show that if two systems follow some specific rules, then at any step, the two ensembles maintain some kind of deterministic ordering. This deterministic ordering turns into an almost sure ordering in the next subsection, when we construct the S-coupling.
Each server along with its queue is thought of as a stack of items, and we always consider the stacks to be arranged in nondecreasing order of their heights. The ensemble of stacks then represents the empirical CDF of the queue length distribution, and the i th horizontal bar corresponds to Q Π i (for some task assignment scheme Π), as depicted in Figure 1. If an arriving item happens to land on a stack which already contains b items, then the item is discarded, and is added to a special stack L Π of discarded items, where it stays forever.
Any two ensembles A and B, each having N stacks and a maximum height b per stack, are said to follow Rule(n A , n B , k) at some step, if either an item is removed from the k th stack in both ensembles (if nonempty), or an item is added to the n th A stack in ensemble A and to the n th B stack in ensemble B.

Proposition 3.1.
For any two ensembles of stacks A and B, as described above, if at any step Rule(n A , n B , k) is followed for some value of n A , n B , and k, with n A n B , then the following ordering is always preserved: for all m b, This proposition says that, while adding the items to the ordered stacks, if we ensure that in ensemble A the item is always placed to the left of that in ensemble B, and if the items are removed from the same ordered stack in both ensembles, then the aggregate size of the b − m + 1 highest horizontal bars as depicted in Figure 1 plus the cumulative number of discarded items is no larger in A than in B throughout.
Proof of Proposition 3.1. We prove the ordering by forward induction on the time-steps, i.e., we assume that at some step the ordering holds, and show that in the next step it will be preserved. In ensemble Π, where Π = A, B, after applying Rule(n A , n B , k), the updated lengths of the horizontal bars are denoted byQ Now if the rule prescribes removal of an item from the k th stack, then the updated ensemble will have the values On the other hand, if the rule produces the addition of an item to stack n Π , then the values will be updated as Fix any m b. Observe that in any event the Q i -values change by at most one at any step, and hence it suffices to prove the preservation of the ordering in the case when (3.1) holds with equality: We distinguish between two cases depending on whether an item is removed or added. First suppose that the rule prescribes removal of an item from the k−th stack from both ensembles. Observe from (3.2) that the value of b i=m Q Π i + L Π changes if and only if I Π (k) m. Also, since removal of an item can only decrease the sum, without loss of generality we may assume that I B (k) m, otherwise the right side of (3.4) remains unchanged, and the ordering is trivially preserved. From our initial hypothesis, This implies Also, Therefore the sum b i=m Q A i + L A also decreases, and the ordering is preserved. Now suppose that the rule prescribes addition of an item to the respective stacks in both ensembles. From (3.3) we get that after adding an item, the value of As in the previous case, we assume (3.4), and since adding an item can only increase the concerned sums, we assume that I A (n A ) m − 1, because otherwise the left side of (3.4) remains unchanged, and the ordering is trivially preserved. Now from our initial hypothesis we have Observe that Hence, the value of b i=m Q B i + L B also increases, and the ordering is preserved.

Stochastic ordering
We now use the deterministic ordering established in Proposition 3.1 in conjunction with the S-coupling construction to prove a stochastic comparison between the JSQ(d(N)) scheme, a specific scheme from the class CJSQ(n(N)) and the ordinary JSQ policy. As described earlier, the class CJSQ(n(N)) contains all schemes that assign incoming tasks by some rule to any of the n(N) + 1 lowest ordered servers. Observe that when n(N) = 0, the class contains only the ordinary JSQ policy. Also, if n (1) (N) < n (2) (N), then CJSQ(n (1) (N)) ⊂ CJSQ(n (2) (N)). Let MJSQ(n(N)) be a particular scheme that always assigns incoming tasks to precisely the (n(N) + 1) th ordered server. Notice that this scheme is effectively the JSQ policy when the system always maintains n(N) idle servers, or equivalently, uses only N − n(N) servers, and MJSQ(n(N)) ∈ CJSQ(n(N)). For brevity, we suppress n(N) in the notation for the remainder of this subsection. We call any two systems S-coupled, if they have synchronized arrival clocks and departure clocks of the k th longest queue, for 1 k N ('S' in the name of the coupling stands for 'Server'). Consider three S-coupled systems following respectively the JSQ policy, any scheme from the class CJSQ, and the MJSQ scheme. Recall that Q Π i (t) is the number of servers with at least i tasks at time t and L Π (t) is the total number of lost tasks up to time t, for the schemes Π = JSQ, CJSQ, MJSQ. The following proposition provides a stochastic ordering for any scheme in the class CJSQ with respect to the ordinary JSQ policy and the MJSQ scheme.
provided the inequalities hold at time t = 0.
The above proposition has the following immediate corollary, which will be used to prove bounds on the fluid and the diffusion scale.

Corollary 3.3.
In the joint probability space constructed by the S-coupling of the three systems under respectively JSQ, MJSQ, and any scheme from the class CJSQ, the following ordering is preserved almost surely throughout the sample path: for any fixed m 1 provided the inequalities hold at time t = 0.
Proof of Proposition 3.2. We first S-couple the concerned systems. Let us say that an incoming task is assigned to the n th Π ordered server under scheme Π, Π= JSQ, CJSQ, MJSQ. Then observe that, under the S-coupling, almost surely, n JSQ n CJSQ n MJSQ . Therefore, Proposition 3.1 ensures that in the probability space constructed through the S-coupling, the ordering is preserved almost surely throughout the sample path.
represents the aggregate size of the rightmost k stacks, i.e., the k longest queues. Using this observation, the stochastic majorization property of the JSQ policy as stated in [26,27,28] can be shown following similar arguments as in the proof of Proposition 3.2. Conversely, the stochastic ordering between the JSQ policy and the MJSQ scheme presented in Proposition 3.2 can also be derived from the weak majorization arguments developed in [26,27,28]. But it is only through the stack arguments developed in the previous subsection that we could extend the results to compare any scheme from the class CJSQ with the scheme MJSQ as well as in Proposition 3.2 (ii).
To analyze the JSQ(d(N)) scheme, we need a further stochastic comparison argument. Consider two S-coupled systems following schemes Π 1 and Π 2 . Fix a specific arrival epoch, and let the arriving task join the n th Π i ordered server in the i th system following scheme Π i , i = 1, 2 (ties can be broken arbitrarily in both systems). We say that at a specific arrival epoch the two systems differ in decision, if n Π 1 = n Π 2 , and denote by ∆ Π 1 ,Π 2 (t) the cumulative number of times the two systems differ in decision up to time t. Proposition 3.5. For two S-coupled systems under schemes Π 1 and Π 2 the following inequality is preserved almost surely provided the two systems start from the same occupancy state at t = 0, i.e., Proof. We will again use forward induction on the event times of arrivals and departures. Let the inequality (3.11) hold at time epoch t 0 , and let t 1 be the next event time. We distinguish between two cases, depending on whether t 1 is an arrival epoch or a departure epoch.
If t 1 is an arrival epoch and the systems differ in decision, then observe that the left side of (3.11) can only increase by two. In this case, the right side also increases by two, and the inequality is preserved. Therefore, it is enough to prove that the left side of (3.11) remains unchanged if the two systems do not differ in decision. In that case, assume that both Π 1 and Π 2 assign the arriving task to the k th ordered server. Recall from the proof of Proposition 3.1 the definition of I Π for some scheme Π. If I Π 1 (k) = I Π 2 (k), then the left side of (3.11) clearly remains unchanged. Now, without loss of generality, assume I Π 1 (k) < I Π 2 (k). Therefore, After an arrival, the (I Π 1 (k) + 1)-th term in the left side of (3.11) decreases by one, and the (I Π 2 (k) + 1)-th term may increases by at most one. Thus the inequality is preserved. If t 1 is a departure epoch, then due to the S-coupling, without loss of generality, assume that a potential departure occurs from the k th ordered server. Also note that a departure in either of the two systems can change at most one of the Q i -values. If at i and Q Π 2 i decrease by one, and hence the left side of (3.11) does not change. Otherwise, without loss of generality assume Furthermore, after the departure, Q Π 1 I Π 1 (k) may decrease at most by one. Therefore |Q Π 1 | may increase at most by one, and Q Π 2 I Π 2 (k) decreases by one, thus |Q Π 1 I Π 2 (k) − Q Π 2 I Π 2 (k) | decreases by one. Hence, in total, the left side of (3.11) either remains the same or decreases by one.

Comparing the JSQ(d) and CJSQ(n) schemes
We will now introduce the JSQ(n, d) scheme with n, d N, which is an intermediate blend between the CJSQ(n) schemes and the JSQ(d) scheme. The JSQ(n, d) scheme will be seen in a moment to be a scheme in the CJSQ(n) class. It will also be seen to approximate the JSQ(d) scheme closely. We now specify the JSQ(d, n) scheme. At its first step, just as in the JSQ(d) scheme, it first chooses the shortest of d random candidates but only sends this to that server's queue if it is one of the n + 1 shortest queues. If it is not, then at the second step it picks any of the n + 1 shortest queues uniformly at random and then sends to that server's queue. As was mentioned earlier, by construction, JSQ(d, n) is a scheme in CJSQ(n).
We now consider two S-coupled systems with a JSQ(d) and a JSQ(n, d) scheme. Assume that at some specific arrival epoch, the incoming task is dispatched to the k th ordered server in the system under the JSQ(d) scheme. If k ∈ {1, 2, . . . , n + 1}, then the system under JSQ(n, d) scheme also assigns the arriving task to the k th ordered server. Otherwise, it dispatches the arriving task uniformly at random among the first (n + 1) ordered servers.
In the next proposition we will bound the number of times these two systems differ in decision on any finite time interval. For any T 0, let A(T ) and ∆(T ) be the total number of arrivals to the system and the cumulative number of times that the JSQ(d) scheme and JSQ(n, d) scheme differ in decision up to time T .
Proof. Observe that at any arrival epoch, the systems under the JSQ(d) scheme and the JSQ(n, d) scheme will differ in decision only if none of the n lowest ordered servers gets selected by the JSQ(d) scheme. Now, at any arrival epoch, the probability that the JSQ(d) scheme does not select any of the n lowest ordered servers, is given by Since at each arrival epoch, d servers are selected independently, given A(T ), Therefore, for T 0, Markov's inequality yields, for any fixed M > 0,

Fluid-Limit Proofs
In this section we prove the fluid-limit results for the JSQ(d(N)) scheme stated in Theorems 2.1 and 2.3. As mentioned in Subsection 2.4, the fluid limit for the ordinary JSQ policy is provided in Subsection 4.1, and in Subsection 4.2 we prove a universality result establishing that under the condition that d(N) → ∞ as N → ∞, the fluid limit for the JSQ(d(N)) scheme coincides with that for the ordinary JSQ policy.

Fluid limit of JSQ
In this section we establish the fluid limit for the ordinary JSQ policy and the interchange of limits result stated in Proposition 2.2. In the proof we will leverage the time scale separation technique developed in [15], suitably extended to an infinite-dimensional space.
As mentioned in the introduction, to the best of our knowledge, this is the first time the transient fluid limit of the ordinary JSQ policy is rigorously established. We also observe that in order to exploit the coupling framework in Section 3.2 and in particular Proposition 3.2, we need convergence of tail-sums. Thus we need to establish the fluid convergence result with respect to the ℓ 1 topology, which makes the analysis technically challenging.
To leverage the time scale separation technique, note that the rate at which incoming tasks join a server with i active tasks is determined only by the process represents the number of servers with fewer than i tasks at time t. Furthermore, the dynamics of the Z N (·) process can be described as where e i is the i th unit vector, and

2)
i = 1, 2, . . . , b, with the convention that Q N b+1 is always taken to be zero, if b < ∞. Observe that in any time interval [t, t + ε] of length ε > 0, the Z N (·) process experiences O(εN) events (arrivals and departures), while the q N (·) process can change by only O(ε) amount. In other words, loosely speaking, around a 'small' neighborhood of time t, the q i (t)'s are constants, while as N → ∞, the process Z N (·) behaves as a time-scaled version of the following process: Therefore, the Z N (·) process evolves on a much faster time scale than the q N (·) process. As a result, in the limit as N → ∞, at each time point t, the Z N (·) process achieves stationarity depending on the instantaneous value of the q N (·) process, i.e., a separation of time scales takes place. In order to establish the time-scale separation and the fluid limit results, we first write the evolution of the occupancy states in terms of a suitable random measure (see (4.16)) and establish in Proposition 4.4 that the sequence of joint occupancy process and the random measure is relatively compact. We also characterize the limit of any convergence subsequence, where we invoke analogous arguments as used in the proofs of [15,Lemma 2] and [15,Theorem 3] to complete the proof of the separation of time scales. The proof of the fluid limit result is then completed by establishing uniqueness of the instantaneous stationary distribution achieved by the fast process, given any fluid-scaled occupancy state.
Denote by¯ + the one-point compactification of the set of nonnegative integers + , i.e.,¯ + = + ∪ {∞}. Equip¯ + with the order topology. Denote G =¯ b + equipped with product topology, and with the Borel σ-algebra G. Let us consider the G-valued process Z N (s) := Z N i (s) i 1 as introduced above. Note that for the ordinary JSQ policy, the probability that a task arriving at (say) t k is assigned to some server with i active tasks is given by where R i is as in (4.2). We prove the following fluid-limit result for the ordinary JSQ policy. Recall the definition of m(q) in Subsection 2.2. If m(q) > 0, then define and else, define p 0 (q) = 1 and p i (q) = 0 for all i = 1, . . . , b.
Then any subsequence of the sequence of processes q N (t) t 0 for the ordinary JSQ policy has a further subsequence that converges weakly with respect to the Skorohod J 1 topology to the limit {q(t)} t 0 satisfying the following system of integral equations where q(0) = q ∞ and the coefficients p i (·) are as defined in (4.4).
The rest of this section will be devoted in the proof of Theorem 4.1. First we construct the martingale representation of the occupancy state process Q N (·). Note that the component Q N i (t), satisfies the identity relation where A N i (t) = number of arrivals during [0, t] to some server with i − 1 active tasks, D N i (t) = number of departures during [0, t] from some server with i active tasks. We can express A N i (t) and D N i (t) as where N A,i and N D,i are mutually independent unit-rate Poisson processes, i = 1, 2, . . . , b. Define the following sigma fields with respective compensator and predictable quadratic variation processes given by Therefore, finally we have the following martingale representation of the N th process: (4.9) In the proposition below, we prove that the martingale part vanishes in ℓ 1 when scaled by N.

Proposition 4.3 (Convergence of martingales).
Proof. The proof follows using the same line of arguments as in the proof of [21, Theorem 3.13], and hence is sketched only briefly for the sake of completeness. Fix any T 0, and observe that  Recall that we denote all the fluid-scaled quantities by their respective small letters, e.g. q N (t) := Q N (t)/N, componentwise, i.e., q N i (t) := Q N i (t)/N for i 1. Therefore the martingale representation in (4.9), can be written as 13) or equivalently, (4.14) Now, we consider the Markov process (q N , Z N )(·) defined on S × G. Define a random measure α N on the measurable space for A 1 ∈ C and A 2 ∈ G. Then the representation in (4.14) can be written in terms of the random measure as  L − → q ∞ ∈ S as N → ∞, then (q N (·), α N ) N 1 is a relatively compact sequence in D S [0, ∞) × L and the limit (q(·), α) of any convergent subsequence satisfies To prove Proposition 4.4, we will verify the relative compactness conditions from [9]. Let (E, r) be a complete and separable metric space. For any x ∈ D E [0, ∞), δ > 0 and T > 0, define r(x(s), x(t)), (4.18) where {t i } ranges over all partitions of the form 0 = t 0 < t 1 < . . . < t n−1 < T t n with min 1 i n (t i − t i−1 ) > δ and n 1. Below we state the conditions for the sake of completeness.

Theorem 4.5. [9, Corollary 3.7.4]
Let (E, r) be complete and separable, and let X n n 1 be a family of processes with sample paths in D E [0, ∞). Then X n n 1 is relatively compact if and only if the following two conditions hold: (a) For every η > 0 and rational t 0, there exists a compact set Γ η,t ⊂ E such that lim n→∞ È (X n (t) ∈ Γ η,t ) 1 − η.
(b) For every η > 0 and T > 0, there exists δ > 0 such that In order to prove the relative compactness, we will need the next three lemmas: Lemma 4.6 characterizes the relatively compact subsets of S, Lemma 4.7 provides a necessary and sufficient criterion for a sequence of ℓ 1 -valued random variables to be tight, and Lemma 4.8 is needed to ensure that at all finite times t, the occupancy state process lies in some compact set (possibly depending upon t). Proof. For the if part, fix any K ⊆ S satisfying (4.19). We will show that any sequence x n n 1 in K has a Cauchy subsequence. Since the ℓ 1 space is complete, this will then imply that x n n 1 has a convergent subsequence with the limit in K, which will complete the proof.
To show the existence of a Cauchy sequence, fix any ε > 0, and choose k 1 (depending on ε) such that Now observe that the set of first coordinates x n 1 n 1 is a sequence in [0, 1], and hence has a convergent subsequence. Along that subsequence, the set of the second coordinates has a further convergent subsequence. Proceeding this way, we can get a subsequence along which the first k − 1 coordinates converge. Therefore, depending upon ε, an N ′ ∈ AE can be chosen, such that Therefore, (4.20) and (4.21) yields for all n max N, N ′ , along the above suitably constructed subsequence. Now that the limit point is in S follows from the completeness of ℓ 1 space and the fact that S is a closed subset of ℓ 1 . Indeed, since the ℓ 1 topology is finer than the product topology, any set that is closed with respect to the product topology is closed with respect to the ℓ 1 topology, and observe that S is closed with respect to the product topology. For the only if part, let K ⊆ S be relatively compact, and on the contrary, assume that there exists an ε > 0, such that Therefore, for each k 1, there exists x (k) ∈ K, such that ∞ i=k x (k) i ε/2. Consider any limit point x * of the sequence x (k) k 1 , and note that ∞ i=j x * i ε/2 for all j 1. This contradicts that x * ∈ ℓ 1 , and the proof is complete. (ii) X N N 1 is tight with respect to ℓ 1 topology. Proof. To prove (i) =⇒ (ii), for any ε > 0, we will construct a relatively compact set compact set K(ε) such that È X N / ∈ K(ε) < ε for all N.
Observe from (4.23) that for all ε > 0, there exists an r(ε) 1, such that and with it an N(ε) 1, such that Furthermore, since X 1 , X 2 , . . . , X N(ε) is a finite set of ℓ 1 -valued random variables, there exists k(ε) r(ε), such that Thus, there exists an increasing sequence k(n) n 1 such that Define the set K(ε) as x i ε 2 n for all n 1 .
Due to Lemma 4.6, we know K(ε) is relatively compact in ℓ 1 . Also, To prove (ii) =⇒ (i), first observe that a sequence of random variables is tight with respect to the ℓ 1 topology implies that it must be tight with respect to the product topology. Now assume on the contrary to (4.23), that there exists ε > 0, such that Since X N N 1 is tight with respect to ℓ 1 topology, take any convergent subsequence X N(n) n 1 with X * being a random variable following the limiting measure. In that case, observe that (4.24) implies È i k X * i > ε/2 > ε for all k 1, which leads to a contradiction since X * is an ℓ 1 -valued random variable. Lemma 4.8. For any q ∈ S, assume that q N (0) L − → q ∞ , as N → ∞. Then for any t 0, there exists M(t, q ∞ ) 1, such that under the JSQ policy, with probability tending to one as N → ∞, no arriving task is assigned to a server with M(t, q ∞ ) − 1 active tasks up to time t.
Proof. Let A N (t) be the cumulative number of tasks arriving up to time t. Since the arrival rate is λ(N), and λ(N)/N → λ, as N → ∞, for any ε > 0, Note that since q ∞ ∈ S ⊂ ℓ 1 , M(t, q ∞ ) exists and is finite for all t 0. We now claim that the probability that in the interval [0, t] a task is assigned to some server with M(t, q ∞ ) active tasks tends to 0, as N → ∞. Indeed, in order for a task to be assigned to some server with M(t, q ∞ ) − 1 active tasks, all the servers must have at least M(t, q ∞ ) − 1 active tasks. Now, the minimum number of tasks required for this, is given by ). Therefore, the proof is complete by observing that Proof of Proposition 4.4. The proof goes in two steps. We first prove the relative compactness, and then show that the limit satisfies (4.17). Observe from [9, Proposition 3.2.4] that, to prove the relative compactness of the sequence of processes (q N (·), α N ) N 1 , it is enough to prove relative compactness of the individual components. Note that, from Prohorov's theorem [9,Theorem 3.2.2], L is compact, since G is compact. Now, relative compactness of α N N 1 follows from the compactness of L under the topology of weak convergence of measures and Prohorov's theorem. To claim the relative compactness of q N (·) N 1 , we will verify the conditions of Theorem 4.5.
Observe that Theorem 4.5 (a) requires to show tightness of the sequence q N (t) N 1 for each fixed (rational) t 0. Fix any t 0. Due to Lemma 4.8, we know Also, q N (0) L − → q ∞ with respect to ℓ 1 topology. In particular, q N (0) N 1 is tight in ℓ 1 . Therefore, using (ii) =⇒ (i) in Lemma 4.7 we obtain, for any ε > 0, , which is compact with respect to the product topology, q N (t) N 1 is tight with respect to the product topology. Hence using (i) =⇒ (ii) in Lemma 4.7 we conclude that the sequence q N (t) N 1 is tight in ℓ 1 . For condition (b), first note that for all i = 1, . . . , b. Thus, Observe that the proof of the relative compactness of q N (t) t 0 is complete if we show that for any η > 0, there exists a δ > 0 and a finite partition (4.26) Now, (4.25) implies that, for any finite partition (t j ) n j=1 of [0, T ], where È (ζ N > η/2) < η for all sufficiently large N. Now take δ = η/(4(λ + 1)) and any partition with max j (t j − t j−1 ) < η/(2(λ + 1)) and min j (t j − t j−1 ) > δ. Now on the event Therefore, for all sufficiently large N, To prove that the limit (q(·), α) of any convergent subsequence satisfies (4.17), we will use the continuous-mapping theorem [34,Theorem 3.4.1]. Specifically, we will show that the right side of (4.16) is a continuous map of suitable arguments. Let q(t) t 0 and y(t) t 0 be an S-valued and an ℓ 1 -valued cádlág function, respectively. Also, let α be a measure on the measurable space ([0, ∞) × G, C ⊗ G). Then for q 0 ∈ S, define for i 1, Observe that it is enough to show F = (F 1 , . . . , F b ) is a continuous operator. Indeed, in that case the right side of (4.16) can be written as F(q N , α N , q N (0), y N ), where y N = (y N 1 , . . . , y N b ) with y N i = (M N A,i − M N D,i )/N, and since each argument converges, we will get the convergence to the right side of (4.17). Therefore, we now prove the continuity of F below. In particular assume that (a) the sequence of processes (q N , y N ) N 1 converges to (q, y) with respect to ℓ 1 topology, (b) for any fixed t 0, the sequence α N ([0, t], R i ) i 1 N 1 in ℓ 1 converges to α([0, t], R i ) i 1 , and (c) the sequence of Svalued random variables q N (0) converges to q(0) with respect to ℓ 1 topology.
(iii) We now claim that for the ǫ > 0 given above there is an N 3 ∈ N such that Observe that we only know the weak convergence of the sequence of measures α N , and therefore we cannot directly make assumption (b) above. We are therefore about to show that assumption (b) is valid in our case and that it follows from weak convergence. Indeed, since q ∞ ∈ S ⊆ ℓ 1 , there existsM(q ∞ ), such that q ∞ M(q ∞ ) < 1, and consequently q ∞ i < 1 for all i M (q ∞ ). Also, due to Lemma 4.8, This implies Also, due to weak convergence of α N , (iv) Finally, due to (c), choose N 4 ∈ AE, such that q N (0) − q(0) 1 < ε/4.
Thus the proof of continuity of F is complete.
To characterize the limit in (4.17), for any q ∈ S, define the Markov process Z q on G as where e i is the i th unit vector, i = 1, . . . , b.
Proof of Theorem 4.1. Having proved the relative compactness in Proposition 4.4, it follows from analogous arguments as used in the proofs of [15,Lemma 2] and [15,Theorem 3], that the limit of any convergent subsequence of the sequence of processes q N (t) t 0 satisfies for some stationary measure π q(t) of the Markov process Z q(t) described in (4.28) satisfying π q Z : Z i = ∞ = 1 if q i < 1. Now it remains to show that q(t) uniquely determines π q(t) , and that π q(s) (R i ) = p i−1 (q(s)) described in (4.4). As mentioned earlier, in this proof we will now assume the specific assignment probabilities in (4.2), corresponding to the ordinary JSQ policy. To see this, fix any q = (q 1 , . . . , q b ) ∈ S. Observe that due to summability of the components of q, there exists 0 m < ∞, such that q m+1 < 1 and q 1 = . . . = q m = 1, with the convention that q 0 ≡ 1 and q b+1 ≡ 0 if b < ∞. In that case, Also, note that q i = 1 forces dq i /dt 0, i.e., λπ q (R i ) q i − q i+1 for all i = 1, . . . , m, and in particular π q (R i ) = 0 for all i = 1, . . . , m − 1. Thus, Therefore, π q is determined only by the stationary distribution of the m th component, which can be described as a birth-death process and let π (m) be its stationary distribution. Now it is enough to show that π (m) is uniquely determined by q. First observe that the process on¯ described in (4.30) is reducible, and can be decomposed into two irreducible classes given by and {∞}, respectively. Therefore, if π (m) (Z = ∞) = 0 or 1, then it is unique. Indeed, if π (m) (Z = ∞) = 0, then Z is birth-death process on only, and hence it has a unique stationary distribution. Otherwise, if π (m) (Z = ∞) = 1, then it is trivially unique. Now we distinguish between two cases depending upon whether q m − q m+1 λ or not.
Note that if q m − q m+1 λ, then π (m) (Z k) = 1 for all k 0. On¯ this shows that π (m) (Z = ∞) = 1. Furthermore, if q m − q m+1 < λ, we will show that π (m) (Z = ∞) = 0. On the contrary, assume π (m) (Z = ∞) = ε ∈ (0, 1]. Also, letπ (m) be the unique stationary distribution of the birth-death process in (4.30) on . Therefore, Substituting into the differential form of the fluid equation (4.5) at the given time t, we obtain that dq m (t) where the last inequality follows since we are considering the case when q m − q m+1 < λ. Now since q m (t) = 1, this leads to a contradiction for any ε > 0, and hence it must be the case that π (m) (Z = ∞) = 0. Therefore, for all q ∈ S, π q is uniquely determined by q. Furthermore, we can identify the expression for π q (R i ) as and hence π q(s) (R i ) = p i−1 (q(s)) as claimed.

Equivalence on fluid scale
Having proved Theorem 4.1, it suffices to prove the universality property stated in the next proposition. This will complete the proof of Theorem 2.1.

Proposition 4.9.
If d(N) → ∞ as N → ∞, then the JSQ(d(N)) scheme and the ordinary JSQ policy have the same fluid limit.
The proof of the above proposition uses the S-coupling results from Section 3, and consists of three steps: (i) First we show that if n(N)/N → 0 as N → ∞, then the MJSQ(n(N)) scheme has the same fluid limit as the ordinary JSQ policy.
(ii) Then we apply Corollary 3.3 to prove that as long as n(N)/N → 0, any scheme from the class CJSQ(n(N)) has the same fluid limit as the ordinary JSQ policy.
(iii) Next, using Propositions 3.5 and 3.6 we establish that if d(N) → ∞, then for some n(N) with n(N)/N → 0, the JSQ(d(N)) scheme and the JSQ(n(N), d(N)) scheme have the same fluid limit. The proposition then follows by observing that the JSQ(n(N), d(N)) scheme belongs to the class CJSQ(n(N)).
Proof of Proposition 4.9. First, to show Claim (i) above, defineN = N − n(N) andλ(N) = λ(N). Observe that the MJSQ(n(N)) scheme with N servers can be thought of as the ordinary JSQ policy withN servers and arrival rateλ(N). Also, since n(N)/N → 0, Furthermore, observe that the fluid limit of the JSQ policy in Theorem 4.1 as given by (4.5) is characterized by the parameter λ only, and hence the fluid limit of the MJSQ(n(N)) scheme is the same as that of the ordinary JSQ policy. Second, observe from the fluid limit of the JSQ policy that if λ < 1, then for any buffer capacity b 1, and any starting state, the fluid-scaled cumulative overflow is negligible, i.e., for any t 0, L N (t)/N È − → 0. Since the above fact is induced by the fluid limit only, the same holds for the MJSQ(n(N)) scheme. Therefore, using the lower and upper bounds in Corollary 3.3 and the tail bound in Proposition 3.2, we obtain Claim (ii) above. .
Since A N (T )/N N 1 is a tight sequence of random variables, we have and hence, ∆ N (T )/N È − → 0. Therefore, applying the ℓ 1 distance bound stated in Proposition 3.5, we obtain Claim (iii). The proof is then completed by observing that the JSQ(n(N), d(N)) scheme belongs to the class CJSQ(n(N)).
Proof of Theorem 2.3. For any ε > 0, define Now the proof consists of two main steps. First we show that if d(N) ℓ(N) for some ε > 0, then there exists an ε ′ > 0, such that if for some T > 0, È T N ε ′ > T → 1 as N → ∞, then the number of times that the JSQ(d(N)) scheme and the ordinary JSQ policy differ in decision in [0, T ] is o P (N). This then implies that up to such a time T , it is enough to consider the fluid limit of the ordinary JSQ policy with batch arrivals. Second, we show that if the conditions stated in Theorem 2.3 hold, then for any finite time T > 0, This will complete the proof. To prove the first part, consider the JSQ(d(N)) scheme in case of batch arrivals. Choose ε ′ = ε/2, and assume that T > 0 is such that È T N ε ′ > T → 1 as N → ∞. Let I i denote the number of idle servers among d(N) randomly chosen servers for the i th batch arrival, and define W N (t) to be the cumulative number of tasks that have not been assigned to some idle server, up to time t. If A N (t) denotes the number of batch arrivals that occurred up to time t, then Therefore, for c = 1 − λ − ε/2 we have, Now, from [14,20], we know  This implies that whenever for all t T . Now the analysis of the batch arrival with ordinary JSQ policy in Theorem 4.10 below, up to time T , shows that the process q N (t) 0 t T converges to the deterministic limit q(t) 0 t T , described by (2.3).
Therefore, it is enough to show that any T > 0 satisfies the required criterion. This can be seen by observing that for any T 0, and any ε ′ > 0, Therefore the proof is complete. (0) È − → 0 for all i 2, then the sequence of processes q d(N) (t) t 0 converges weakly to the limit q(t) t 0 , described as follows: Proof. Fix any finite time T 0. To analyze the JSQ policy with batch arrivals, observe that before time T , all the arriving tasks join idle servers. Therefore, assuming Q N 2 (0) = 0, for all t T , the evolution for Q N 1 can be written as where A and D are independent unit-rate Poisson processes. Using the random time change of unit-rate Poisson processes [25,Lemma 3.2], and applying the arguments in [25,Lemma 3.4], the above process scaled by N, then admits the martingale decomposition where are square integrable martingales with respective quadratic variation processes given by , it follows that q N 1 (t) t 0 as N → ∞ converges weakly to a deterministic limit described by the integral equation

Global stability and interchange of limits
To prove the interchange of limits result stated in Proposition 2.2, we will establish the global stability of the fixed point, i.e., all fluid paths converge to the fixed point in (2.2) as t → ∞. This is formally stated in the following lemma.
Lemma 4.11. Let q(t) be the fluid limit, i.e., the solution of the dynamical system described by the system of integral equations in (2.1). For any q ∞ ∈ S, if q(0) = q ∞ , then q(t) → q * as t → ∞, where q * is defined as in (2.2).
In case of the JSQ(d) scheme with fixed d, the global stability is proved by constructing a Lyapunov function that measures the 'distance' (in terms of a weighted L 1 -norm) between the trajectory and the fixed point, and that strictly decreases everywhere except at the fixed point, see [21,Theorem 3.6]. In case of the ordinary JSQ policy however, we can exploit a more direct method to establish the global stability, as further detailed below.
Proof of Lemma 4.11. The proof follows in two steps: we will first establish that as t → ∞, q 1 (t) → λ < 1, and then show that q 2 (t) → 0.
On the other hand, we claim that lim sup t→∞ q 1 (t) λ. Suppose not, i.e., assume lim sup t→∞ q 1 (t) = λ + ε for some ε > 0. Because q 1 (t) is non-decreasing when q 1 (t) λ, there must exist a t 0 such that q 1 (t) λ ∀ t t 0 . The high-level idea behind the claim is as follows. If q 1 (t) were to remain above λ by a non-vanishing margin, then the cumulative number of departures would exceed the cumulative number of arrivals by an infinite amount, which cannot occur since the initial number of tasks is bounded. More formally, This provides a contradiction with lim sup t→∞ q 1 (t) = λ + ε, since the rate of decrease of q 1 (t) is at most 1. Therefore, q 1 (t) → λ as t → ∞.
Proof of Proposition 2.2. The proof follows in two steps: (i) we first establish that the sequence of stationary measures π d(N) N 1 is tight, and then (ii) show the interchange of limits.
(i) Observe that if b < ∞, then the space [0, 1] b is compact, and hence Prohorov's theorem implies that π d(N) N 1 is tight. Now assume b = ∞. For any two positive integers d 1 d 2 , note that at each arrival, the JSQ(d 2 ) scheme polls more servers than the JSQ(d 1 ) scheme. Thus using the S-coupling and Proposition 3.1, we can conclude for every N, Let X N and Y N denote random variables following the stationary distribution of two systems with N servers under the JSQ(d(N)) and JSQ(1) schemes, respectively. We will verify the tightness criterion stated in Lemma 4.7. Note that since X N takes value in S ⊂ [0, 1] ∞ , which is compact with respect to the product topology, Prohorov's theorem implies that X N N 1 is tight with respect to the product topology. To verify the condition in (4.23), note that the system under the JSQ(1) scheme is essentially a collection of N independent M/M/1 systems. Therefore, for each k 1, Since λ < 1, taking the limit k → ∞, the right side of the above inequality tends to zero, and hence, the condition in (4.23) is verified.
(ii) Now observe that since π d(N) N 1 is tight, any subsequence has a convergent further subsequence. Let π d(N n ) n 1 be any such convergent subsequence, with N n n 1 ⊆ AE, such that π d(N n ) L − →π as n → ∞. We will show thatπ is unique and equals the measure π ⋆ , as defined in the statement of Proposition 2.2. Notice that if q d(N n ) (0) ∼ π d(N n ) , then q d(N n ) (t) ∼ π d(N n ) for all t 0. Thus,π is an invariant distribution of the deterministic process q(t) t 0 . This in conjunction with the global stability in Lemma 4.11 implies thatπ must be the fixed point of the fluid limit. Thus, we have shown the convergence of the stationary measure.

Diffusion-Limit Proofs
In this section we prove the diffusion-limit results for the JSQ(d(N)) scheme stated in Theorem 2.4, and the almost necessity condition for diffusion-level optimality stated in Theorem 2.5. As noted in Subsection 2.4, the diffusion limit for the ordinary JSQ policy is obtained in [8,Theorem 2], and characterized by (2.4). Therefore it suffices to prove the universality property stated in the next proposition. The proof of the above proposition follows similar lines as that of Proposition 4.9, leveraging again the S-coupling results from Section 3, and involves three steps: (i) First we show that if n(N)/ √ N → 0 as N → ∞, then the MJSQ(n(N)) scheme has the same diffusion limit as the ordinary JSQ policy.
(ii) Then we use Corollary 3.3 to prove that as long as n(N)/ √ N → 0, any scheme from the class CJSQ(n(N)) has the same diffusion limit as the ordinary JSQ policy.
(iii) Next we establish using Propositions 3.5 and 3.6 that if d(N)/( √ N log(N)) → ∞ as N → ∞, then for some n(N) with n(N)/ √ N → 0, the JSQ(d(N)) scheme and the JSQ(n(N), d(N)) scheme have the same diffusion limit. The proposition then follows by observing that the JSQ(n(N), d(N)) scheme belongs to the class CJSQ(n(N)).
Proof of Proposition 5.1. To show Claim (i) above, defineN = N − n(N) andλ(N) = λ(N). As mentioned earlier, the MJSQ(n(N)) scheme with N servers can be thought of as the ordinary JSQ policy withN servers and arrival rateλ(N). Also, since n(N)/ √ N → 0, Furthermore, observe that the diffusion limit of the JSQ policy in [8, Theorem 2] as given in (2.4) is characterized by the parameter β > 0, and hence the diffusion limit of the MJSQ(n(N)) scheme is the same as that of the ordinary JSQ policy. Observe from the diffusion limit of the JSQ policy that if β > 0, then for any buffer capacity b 2, and suitable initial state as described in Theorem 2.4, the cumulative overflow is negligible, i.e., for any t 0, L N (t) È − → 0. Indeed observe that if b 2, and Q N 2 (0) N 1 is a tight sequence, then the sequence of processes Q N 2 (t) t 0 is stochastically bounded. Therefore, on any finite time interval, there will be only O P ( √ N) servers with queue length more than one, whereas, for an overflow event to occur all the N servers must have at least two pending tasks. Therefore, for any t 0, Finally, since the above fact is implied by the diffusion limit only, the same holds for the MJSQ(n(N)) scheme. Therefore, using the lower and upper bounds in Corollary 3.3 we arrive at Claim (ii). . (5.1) Since A N (T )/N N 1 is a tight sequence of random variables, and Therefore, by invoking Proposition 3.5, we obtain Claim (iii). The proof is then completed by observing that the JSQ(n(N), d(N)) scheme belongs to the class CJSQ(n(N)).
We next prove that the growth condition d(N)/( √ N log N) → ∞ is nearly necessary: for any d(N) such that d(N)/( √ N log N) → 0 as N → ∞, the diffusion limit of the JSQ(d(N)) scheme differs from that of the ordinary JSQ policy. Note that it is enough to consider the truncated system where any arrival to a server with at least two tasks is discarded, since the truncated system and the original system have the same diffusion limit [24]. Now consider the JSQ(d(N)) scheme for some d(N) such that d(N)/( √ N log N) → 0 as N → ∞, and assume on the contrary, the hypothesis that the process (Q (·) are O P ( √ N), then in any finite time interval the number of tasks assigned to a server with queue length at least one, by the JSQ(d(N)) scheme with d(N)/( √ N log N) → 0 does not scale with √ N, which then immediately proves that the diffusion limit cannot coincide with that of the ordinary JSQ policy.
To formalize the above idea, we first define an artificial scheme below, which will serve as an asymptotic lower bound to the number of servers with queue length two in a system following the JSQ(d(N)) scheme, under the hypothesis that the diffusion limit of the JSQ(d(N)) coincides with that of the ordinary JSQ policy. For any nonnegative sequence c(N), define a scheme Π(c(N)) which (i) At each external arrival, assigns the task to a server having queue length one with probability (1 − c(N)/N) d(N) , and else discards it (ties can be broken randomly), (ii) If a departure occurs from a server with queue length one, then it immediately makes the server busy with a dummy arrival, i.e., essentially Π(c(N)) prohibits any server to remain idle.
We use a coupling argument to show the following: In order to prove Lemma 5.2, we first S-couple the two systems under schemes Π(c(N)) and JSQ(d(N)) respectively. Now at each external arrival, to assign the task in the two systems in a coupled way, draw a single uniform[0, 1] random variable U, independent of any other processes. then assign the task to an idle server, and otherwise discard it. This preserves the statistical law of the JSQ(d(N)) scheme with a buffer size b = 2. Indeed note that according to the above rule the probability that an incoming task will be assigned to some server with queue length zero, one, and two, are respectively given by • Under the scheme Π(c(N)), if U < (1 − c(N)/N) d(N) , then assign the incoming task to a server with queue length one, otherwise discard it. Clearly, the statistical law of the Π(c(N)) scheme is preserved by this rule.
Proof of Lemma 5.2. Fix any T 0. Now the proof follows in two steps: (i) First assume that at each external arrival up to time T , whenever an incoming task joins a server with queue length one, under the Π(c(N)) scheme, then so does the incoming task under JSQ(d(N)) scheme. In that case, since the two systems are S-coupled, by forward induction on event times, it can be seen that Q  Note that the probabilities that an incoming task joins a server with queue length one are given by (Q  N)) and the Π(c(N)) scheme respectively. Informally speaking, due to the above coupling, (5.3) then implies that with high probability, on any finite time interval, whenever an external incoming task joins a server with queue length one under the Π(c(N)) scheme, then so does the incoming task under the JSQ(d(N)) scheme. Therefore, from Part (i) above, we can say (t)/ √ N t 0 converges to the appropriate diffusion limit corresponding to that of the ordinary JSQ policy. We will show that under this hypothesis, the process Q d(N) 2 (t)/ √ N t 0 is not stochastically bounded, which will then lead to a contradiction.
In order to show this, we will choose an appropriate c(N) such that c(N)/ √ N → ∞ as N → ∞, and the process Q Π(c(N)) 2 (t)/ √ N t 0 is not stochastically bounded. The conclusion then follows by the application of Lemma 5.2.
Observe that the martingale decomposition of the scaled Q Π(c(N)) 2 (·) process can be written as Π(c(N)) 2 (s)ds.

(5.5)
Observe that for any t 0, Choosing g(N) such that g(N)/ω(N) → 0 implies Note that for any ω(N), this choice of g(N) is feasible (choose g(N) = ω(N), say). Furthermore, the process M N (t)/ √ N t 0 in (5.4) is stochastically bounded due to the martingale FCLT [9, Theorem 7.1] and our hypothesis. Now we can conclude that for the above choices of g(N) and ω(N), the process Q Π(c(N)) 2 (t) t 0 , and hence the process Q d(N) 2 (t) t 0 (due to Lemma 5.2) is not stochastically bounded. Therefore, the limit does not coincide with the limit of the scaled Q JSQ 2 -process.

Conclusion
In the present paper we have established universality properties for power-of-d load balancing schemes in many-server systems. Specifically, we considered a system of N parallel exponential servers and a single dispatcher which assigns arriving tasks to the server with the shortest queue among d(N) randomly selected servers. We developed a novel stochastic coupling construction to bound the difference in the queue length processes between the JSQ policy (d = N) and a scheme with an arbitrary value of d. As it turns out, a direct comparison between the JSQ policy and a JSQ(d) scheme is a significant challenge. Hence, we adopted a two-stage approach based on a novel class of schemes which always assign the incoming task to one of the servers with the n(N) + 1 smallest number of tasks. Just like the JSQ(d (N)) scheme, these schemes may be thought of as 'sloppy' versions of the JSQ policy. Indeed, the JSQ(d(N)) scheme is guaranteed to identify the server with the minimum number of tasks, but only among a randomly sampled subset of d(N) servers. In contrast, the schemes in the above class only guarantee that one of the n(N) + 1 servers with the smallest number of tasks is selected, but across the entire system of N servers. We showed that the system occupancy processes for an intermediate blend of these schemes are simultaneously close on a g(N) scale (g(N) = N or g(N) = √ N) to both the JSQ policy and the JSQ(d(N)) scheme for suitably chosen values of d(N) and n(N) as function of g(N). Based on the latter asymptotic universality, it then sufficed to establish the fluid and diffusion limits for the ordinary JSQ policy. Thus deriving the fluid limit of the ordinary JSQ policy, and using the above coupling argument we establish the fluid limit of the JSQ(d(N)) scheme in a regime with d(N) → ∞ as N → ∞, along with the corresponding fixed point. The fluid limit turns out not to depend on the exact growth rate of d(N), and in particular coincides with that for the ordinary JSQ policy. We further leveraged the coupling to prove that the diffusion limit in the Halfin-Whitt regime with d(N)/( √ N log(N)) → ∞ as N → ∞ corresponds to that for the JSQ policy. These results indicate that the optimality of the JSQ policy can be preserved at the fluid-level and diffusion-level while reducing the overhead by nearly a factor O(N) and O( √ N/ log(N)), respectively. In future work we plan to extend the results to heterogeneous servers and non-exponential service requirement distributions. We also intend to pursue extensions to network scenarios and server-task compatibility constraints.