Learning-Based Pricing and Matching for Two-Sided Queues

Published Online:https://doi.org/10.1287/stsy.2024.0073

Abstract

We consider a dynamic system with multiple types of customers and servers. Each type of waiting customer or server joins a separate queue, forming a bipartite graph with customer-side queues and server-side queues. The platform can match the servers and customers and then the matched pairs leave the system. The platform will charge a customer a price according to their type when they arrive and will pay a server a price according to their type. The arrival rate of each queue is determined by the price according to some unknown demand or supply functions. Our goal is to design pricing and matching algorithms to maximize the profit of the platform with unknown demand and supply functions while keeping queue lengths of both customers and servers below a predetermined threshold. The difficulties of the problem include simultaneous learning and decision making and the tradeoff between maximizing profit and minimizing queue length. We use a longest-queue-first matching algorithm and propose a learning-based pricing algorithm, which combines gradient-free stochastic projected gradient ascent with bisection search. We prove that our proposed algorithm yields a sublinear regret O˜(T5/6) and queue-length bound O˜(T2/3), where T is the time horizon. We further establish a tradeoff between the regret bound and the queue-length bound.

Funding: This research was supported in part by the National Science Foundation [Grants 2112471, 2207548, 2228974, and 2240981].

Supplemental Material: The online appendix is available at https://doi.org/10.1287/stsy.2024.0073.

1. Introduction

We study pricing and matching in two-sided queueing systems with multiple types of customers and servers. Take ride sharing (Uber, Lyft, etc.) as an example. A ride order (request) can be viewed as a customer, and an available driver can be viewed as a server. A ride order can have multiple types. For example, we can view different numbers of passengers in the order as different customer types. We can also view different pick-up locations of the order as different customer types. A driver can also have multiple types. For example, we can view different vehicle capacities as different server types. We can also view different locations of the drivers as different server types. Each type of waiting ride order or driver joins a separate queue, forming a bipartite graph with customer-side (passenger-side) queues and server-side (driver-side) queues. We consider a discrete-time system. At each time slot, the ride-sharing platform will determine a price for each customer or server type. If a passenger accepts the price, they will be charged and join the queue. Similarly, if a driver accepts the price, they will be paid and join the queue. Therefore, the arrival rate of each type of ride order (or driver) is governed by the corresponding price according to some unknown demand (or supply) function. After that, the platform will match the drivers and ride orders in the queues if their types are compatible. For example, if the customer type is the number of passengers in the ride order and the server type is the capacity of the vehicle, then their types are compatible if the capacity of the vehicle is no less than the number of passengers in the ride order. The matched pairs then leave the system. The same process repeats in the next time slot. The profit of the platform is equal to the income from ride orders minus the cost paid to drivers. Assuming the demand and supply functions are unknown, the goal is to design pricing and matching algorithms to maximize the total profit of the platform over a finite time horizon while keeping the queue lengths of both customers and servers below a predetermined threshold.

Similar pricing and matching problems in two-sided queues have been studied in the literature (Caldentey et al. 2009; Adan and Weiss 2012; Gurvich and Ward 2015; Nguyen and Stolyar 2018; Chen and Hu 2020; Varma et al. 2020, 2023; Hu and Zhou 2022; Vaze and Nair 2022; Aveklouris et al. 2023, 2024). Caldentey et al. (2009), Adan and Weiss (2012), Hu and Zhou (2022), and Aveklouris et al. (2023, 2024) study matching problems in two-sided queueing systems with multiple types of demand and multiple types of supply. Nguyen and Stolyar (2018) study a two-sided queueing system with on-demand servers with the goal of designing adaptive server invitation scheme to minimize waiting times. Gurvich and Ward (2015) study a matching problem for multisided queues. However, these works do not consider pricing. Varma et al. (2020) and Chen and Hu (2020) study pricing and matching problems in two-sided queues, but they consider scenarios with strategic demand side and/or supply side, which is different from our setting, where customers and servers are not strategic, and their arrival rates are based on demand and supply functions. Vaze and Nair (2022) consider a two-sided queueing system where servers arrive at a fixed rate, and the arrival rate of customers is controlled by the price. They consider one-sided pricing and single-type customers and servers, whereas we consider two-sided pricing and matching and multiple customer and server types. The most relevant work in the literature is Varma et al. (2023), whose model is similar to ours. The differences between Varma et al. (2023) and ours include the following:

  • The key difference is that they assume that the platform knows the demand and supply functions, whereas in our setting, the demand and supply functions are unknown.

  • The objective in Varma et al. (2023) is the asymptotic long run expected profit of the platform subtracting a penalty that is linear in the queue length, whereas our objective is to maximize the total profit of the platform over a finite time horizon while keeping the queue lengths below a predetermined threshold during the entire time horizon.

  • Varma et al. (2023) consider a large-scale regime in which they scale the arrival rates and study the profit loss in steady state. We focus on finite-time analysis without scaling the arrival rates.

There are some other related works about learning-based dynamic pricing and matching. Besbes and Zeevi (2015) study dynamic pricing with demand learning but they do not consider queueing. Chen and Shi (2024) study a dual-sourcing inventory system with demand learning, which is different from our two-sided queueing system. Tassiulas and Ephremides (1990) and Srikant and Ying (2014) study MaxWeight algorithms for scheduling in one-sided multiserver queueing systems. Our matching algorithm can also be seen as a MaxWeight algorithm but applied to a two-sided matching setting (Varma et al. 2023). Flaxman et al. (2004), Agarwal et al. (2010), Duchi et al. (2015), Shamir (2017), and Lattimore (2024) study zero-order gradient descent algorithms, some ideas from which are incorporated into parts of our pricing algorithm. There are other online learning algorithms in queueing systems (Chen et al. 2023, 2024; Yang et al. 2023), but they consider only one-sided queueing system.

The key challenge of our setting is that the demand and supply functions are unknown. Learning the demand and supply functions explicitly will not be adaptive if the demand or supply functions change over time. Therefore, we propose to learn and make pricing and matching decisions simultaneously. Although several candidate approaches exist, they are not suitable for our setting. One idea is to formulate the problem into a Markov decision process (MDP) and then use reinforcement learning (RL) to solve the MDP approximately. However, the state space of this MDP increases exponentially with the number of queues (i.e., customer and server types), so it is difficult to solve due to the curse of dimensionality. Another approach is to use bandit algorithms from Lipschitz bandits (Slivkins 2019). For example, one may discretize the price space and apply upper confidence bound (UCB)-based algorithms. However, such approaches fail to balance the matching rates and arrival rates on both sides, making it difficult to control the queue lengths. These algorithms also have a high computational complexity because the number of arms is large. Approaches from bandit convex optimization (Lattimore 2024) aim to solve convex optimization problems with bandit feedback. However, the control variables in our setting are prices and the profit function is not concave in terms of prices. The constraint set representing a set of balanced arrival rates is also not convex in terms of prices.

To deal with this challenge, we propose to use a longest-queue-first matching algorithm and a learning-based pricing algorithm, which combines gradient-free (zero-order) stochastic projected gradient ascent (Agarwal et al. 2010) with bisection search. We also borrow the idea of fluid pricing policy from Varma et al. (2023) to control the queue length by rejecting arrivals according to a predetermined queue-length threshold. In this paper, we prove that our proposed algorithm yields a sublinear regret O˜(T5/6) when compared with a fluid-based baseline and guarantees an anytime queue-length bound O˜(T2/3), where T is the time horizon. Furthermore, we establish a tradeoff between the regret bound and the queue-length bound: For any γ(0,2/3], if we keep the queue lengths of both customers and servers below Θ˜(Tγ), then we have a regret upper bound O˜(T1γ/4). We prove a hardness result: The regret is at least Ω(T1γ) to maintain the queue length below Tγ for a single-link system with demand and supply functions from a specific class, even when the functions are known.

2. Model

We consider a discrete-time system with two-sided queues: a customer side and a server side. There are multiple queues on each side, representing different types of customers and different types of servers. We can also view customers as demand and servers as supply. We model the system by a bipartite graph G(IJ,E), where I={1,2,,I} is the set of customer types and J={1,2,,J} is the set of server types with |I|=I and |J|=J. E is the set of all compatible links, which means that a type i customer can be served by a type j server if and only if (i,j)E. Figure 1 is an example with three types of customers and two types of servers (I=3,J=2). At each time slot, there are arrivals of customers and servers. At time slot t, let Ac,i(t) and As,j(t) denote the number of arrivals of type i customers and type j servers, respectively. Assume that Ac,i(t) and As,j(t) follow independent Bernoulli distributions. Let λi(t)E[Ac,i(t)] and μj(t)E[As,j(t)]. At each time slot, after the arrival of customers and servers, the platform will match the customers and servers in the queues through compatible links. Once a customer is matched with a server, that is, the customer is served by the server, they will leave the system immediately. Let Qc,i(t) and Qs,j(t) denote the queue length of type i customers and that of type j servers, respectively, before arrivals and matching at time slot t. Assume the queues are all empty at t=1.

Figure 1. Model
Note. An example with three types of customers and two types of servers.

There is also a price attached to each type of customer or server. Let pc,i(t) denote the price of the type i customer and ps,j(t) denote the price of the type j server at time slot t. At time slot t, when a customer of type i arrives, they will be charged pc,i(t) units of money by the platform. When a server of type j arrives, they will be paid ps,j(t) units of money by the platform. Note that the prices determine the arrival rates. Increasing the price of type i customers will reduce arrival rate λi(t); increasing the price of type j servers will increase arrival rate μj(t). For a type i customer, the demand function is denoted by Fi:[0,1][pc,i,min,pc,i,max], where the input is an arrival rate λi[0,1], the output is the price pc,i[pc,i,min,pc,i,max] corresponding to this arrival rate, and pc,i,min and pc,i,max are the minimum and maximum prices for type i customer, respectively. Similarly, for type j server, the supply function is denoted by Gj:[0,1][ps,j,min,ps,j,max], where the input is an arrival rate μj[0,1], the output is the price ps,j[ps,j,min,ps,j,max] corresponding to this arrival rate, and ps,j,min and ps,j,max are the minimum and maximum prices for a type j server, respectively. Then we have pc,i(t)=Fi(λi(t)) and ps,j(t)=Gj(μj(t)). We make the following assumption on the demand and supply functions Fi and Gj.

Assumption 1.

For any customer type i, Fi is strictly decreasing, bijective, and LFi-Lipschitz. Let Fi1 denote the inverse function of Fi. Assume that Fi1 is LFi1-Lipschitz. For any server type j, Gj is strictly increasing, bijective, and LGj-Lipschitz. Let Gj1 denote the inverse function of Gj. Assume that Gj1 is LGj1-Lipschitz.

From Assumption 1, we know Fi(1)=pc,i,min,Fi(0)=pc,i,max for any customer type i and Gj(0)=ps,j,min,Gj(1)=ps,j,max for any server type j.

The sequence of events occurring in each time slot is shown in Figure 2. At the beginning of time slot t, we first observe the lengths of all queues Qc,i(t),Qs,j(t) for all i, j. Then the platform runs a pricing algorithm to decide the prices pc,i(t),ps,j(t),i,j. The arrival rates λi(t),μj(t),i,j will be determined by these prices, and then the actual arrivals Ac,i(t),As,j(t),i,j occur. Next, the platform runs a matching algorithm and then those matched customers and servers leave the system.

Figure 2. Timeline in Each Time Slot

At time slot t, the platform makes a profit of

iAc,i(t)pc,i(t)jAs,j(t)ps,j(t)=iAc,i(t)Fi(λi(t))jAs,j(t)Gj(μj(t)).

Our goal is to design an online pricing and matching algorithms to maximize the profit of the platform over T time slots without knowing the demand and supply functions while keeping queue lengths of both customers and servers below a predetermined threshold, that is, to maximize

t=1T[iAc,i(t)Fi(λi(t))jAs,j(t)Gj(μj(t))],(1)
while keeping
maxt=1,,Tmax{maxiQc,i(t),maxjQs,j(t)}
below a predetermined threshold. To quantify the performance of the online algorithm, we compare it with the following optimal solution to a fluid-based optimization problem (Varma et al. 2023):
maxλ,μ,xiλiFi(λi)jμjGj(μj)(2)
s.t.λi=j:(i,j)Exi,j,for all iI,(3)
μj=i:(i,j)Exi,j,for all jJ,(4)
xi,j0,for all (i,j)E,(5)
λi,μj[0,1],for all iI, jJ,(6)
where x(xi,j)(i,j)E, λ(λi)iI, and μ(μj)jJ. Similar to Varma et al. (2023), we make the following assumption.

Assumption 2.

For any customer type i, the revenue function ri(λi)λiFi(λi) is concave. For any server type j, the cost function cj(μj)μjGj(μj) is convex.

From Assumption 2, Objective (2) is concave. Note that the constraints are all affine. Hence, the baseline optimization problem (2)–(6) is concave, and there exists a solution that achieves the optimal value, denoted as (λ*,μ*,x*).

Let λi(t) and μj(t) be the arrival rates at time t under any policy. Then from (1), the expected profit under the policy is

t=1TE[iAc,i(t)Fi(λi(t))jAs,j(t)Gj(μj(t))]=t=1TE[iλi(t)Fi(λi(t))jμj(t)Gj(μj(t))].

Let f* denote the optimal value of the fluid optimization problem (2)–(6). For the profit maximization, we consider the following expected regret with respect to the fluid baseline:

E[R(T)]Tf*t=1TE[iAc,i(t)Fi(λi(t))jAs,j(t)Gj(μj(t))]=t=1T((iλi*Fi(λi*)jμj*Gj(μj*))E[iλi(t)Fi(λi(t))jμj(t)Gj(μj(t))]),(7)
where (λ*,μ*) is an optimal solution to the fluid optimization problem. The following lemma establishes the connection between the optimal expected profit and f* over T time slots.

Lemma 1.

Let Assumption 1 and Assumption 2 hold. Then under any policy, we have

t=1TE[iλi(t)Fi(λi(t))jμj(t)Gj(μj(t))]Tf*i(LFi+pc,i,max)E[Qc,i(T+1)]+j(LGj+ps,j,max)E[Qs,j(T+1)].

The proof of this lemma can be found in Appendix A. Lemma 1 means that the fluid baseline approximates the optimal profit, with an error at most proportional to the queue length at time T. This justifies the use of the regret definition (7). Specifically, let E[Ropt(T)] denote the expected regret with respect to the policy that maximizes the expected profit. Then Lemma 1 implies that

E[Ropt(T)]E[R(T)]i(LFi+pc,i,max)E[Qc,i(T+1)]+j(LGj+ps,j,max)E[Qs,j(T+1)].

As long as the queue lengths E[Qc,i(T+1)] and E[Qs,j(T+1)] are orderwise smaller than E[R(T)], the regret defined in (7) approximates the “true” regret (with respect to the optimal expected profit). We consider policies that make the queue mean rate stable (Neely 2022); that is, under the policy, for all i, j,

limT1TE[Qc,i(T)]=0,limT1TE[Qs,j(T)]=0.

Under any mean rate stable policy, using Lemma 1 and taking limit T+, we obtain

limsupT+1Tt=1TE[iλi(t)Fi(λi(t))jμj(t)Gj(μj(t))]f*,
which means that the optimal value of the fluid optimization problem is an upper bound on the asymptotic time-averaged expected profit under any mean rate stable policy. This further justifies the use of the fluid optimal value as a baseline to define regret.

We make another assumption on the problem.

Assumption 3.

There exists a known positive number amin(0,1) such that there exists an optimal solution λ*,μ* to the fluid optimization problem that satisfies λi*amin and μj*amin for all i and j.

Assumption 3 means that we need the optimal arrival rates to be strictly greater than zero. This assumption is not restrictive, because any queue with an optimal arrival rate of zero, indicating it is redundant, can be easily identified in practice and removed from the problem formulation. Note that amin>0 can be any lower bound on the optimal arrival rates.

Assume that the functions Fi and Gj are unknown. The platform cannot directly control the arrival rates, but they can control the prices of the customers and the servers and the matching algorithm. The platform can also observe the number of arrivals and the queue lengths of all the queues. At each time slot, the platform needs to design a price for each queue (customer or server) and to make a decision on how the customers and servers are matched. The objective is to minimize the regret E[R(T)] while keeping the maximum queue length maxt=1,,Tmax{maxiQc,i(t),maxjQs,j(t)} below a predetermined threshold. This anytime queue length constraint is necessary in certain practical scenarios—for instance, when the system has a finite buffer size for each queue. Moreover, customers’ experience is often affected by the anytime queue length because they typically expect a consistent level of service.

2.1. Notation

We use bold symbols to represent vectors in R|E| or RI or RJ, such as x,λ,μ. We use subscript i or j to indicate an element of a vector, for example, λi is the ith element of λ. Note that we view x=(xi,j)(i,j)E as a vector in R|E| rather than a matrix, and the order of elements in x can be arbitrary as long as the order is fixed. We use subscript c to indicate the customer side and s for the server side. g(T)=O(h(T)) means that there exist positive constants C and T0 such that g(T)Ch(T) for all TT0; g(T)=Θ(h(T)) means that there exist positive constants C1, C2 and T0 such that C1h(T)g(T)C2h(T) for all TT0; and g(T)=Ω(h(T)) means that there exist positive constants C and T0 such that g(T)Ch(T) for all TT0. We omit the base of logarithmic functions in O(·), Θ(·), and Ω(·), for example, Θ(log(T)), because the base does not affect the order. We use log2(·) to denote (log(·))2. We use O˜(·), Θ˜(·), and Ω˜(·) to suppress polylog(T) factors.

3. Summary of Our Main Results

In this section, we provide an overview of the main results in this paper. We use a longest-queue-first matching algorithm, which is a discrete-time version of the matching algorithm in Varma et al. (2023). We propose a learning-based pricing algorithm, which combines gradient-free (zero-order) stochastic projected gradient ascent (Agarwal et al. 2010) with bisection search, illustrated in Section 4 in detail. The proposed pricing algorithm learns from samples and make pricing decisions simultaneously. Under the proposed matching and pricing algorithms, we establish the following profit-regret bound and queue-length bound:

E[R(T)]=O˜(T56),maxt=1,,Tmax{maxiQc,i(t),maxjQs,j(t)}=O˜(T23).

By changing the parameters of the proposed pricing algorithm, we can achieve the following tradeoff between regret and queue length: For any γ[0,2/3], there exists a set of parameters such that the algorithm can achieve:

E[R(T)]=O˜(T1γ4),maxt=1,,Tmax{maxiQc,i(t),maxjQs,j(t)}=O˜(Tγ).

The above results are shown in Figure 3. As the allowable queue length increases, the regret bound that we can achieve becomes better. However, if we allow the queue length to increase over Θ˜(T2/3), the regret bound cannot be further improved and remains O˜(T5/6).

Figure 3. Tradeoff Between Regret and Queue Length

3.1. Hardness Result

We show in the following lemma that in a single-link system, any policy that keeps the anytime queue length bounded by Tγ must incur a regret of at least Ω(T1γ).

Lemma 2.

Consider a single-link system with one customer queue and one server queue with demand function F(·) and supply function G(·). Let Qc(t) and Qs(t) denote the customer-side queue length and server-side queue length, respectively. Consider the matching policy that matches all pairs whenever possible because there is no incentive to delay any match in a single-link system. There exists a class of problem instances that satisfies Assumption 1, Assumption 2, and Assumption 3, such that for any pricing policy satisfying maxt=1,,Tmax{Qc(t),Qs(t)}Tγ, the expected regret satisfies E[R(T)]=Ω(T1γ) for any γ[0,1/2).

The definition of the class of problem instances and the proof of Lemma 2 can be found in Appendix B. Our upper bound results under the proposed algorithm also hold for this problem class. Note that when the queue length threshold is a constant (i.e., γ=0), the regret under our algorithm becomes linear, which is unavoidable by Lemma 2.

4. Algorithm

In this section, we present the proposed matching and pricing algorithms in detail. The matching algorithm is to match the customers on the demand-side queues with the servers on the supply-side queues, and the pricing algorithm is to learn and design the prices without knowing the demand and supply functions.

4.1. Matching Algorithm

We use a longest-queue-first matching algorithm as shown in Algorithm 1 and Figure 4, where we present the matching algorithm for one time slot, and it is the same for all time slots. In the algorithm, for each time slot, we iterate over each queue on both the customer and server sides. For each queue, we check whether there is a new arrival in the queue. If there is a new arrival (a customer or a server), the algorithm will match the arrival with one server (or customer) in the longest compatible queue on the other side, as shown in Figure 4. After matching, we decrease the queue length by one for both matched queues. Then we move on to the next queue and repeat the process.

Figure 4. Example of Longest-Queue-First Matching Algorithm
Notes. At each time slot, for each queue, if there is a new arrival, it will be matched with one server (or customer) in the longest compatible queue on the other side. Then the matched pairs leave the system and we move on to check the next queue and repeat the same process.
Algorithm 1

(Matching Algorithm)

  • 1 At each time slot t, without loss of generality, assume that the arrivals come in order Ac,1(t),Ac,2(t),,Ac,I(t),As,1(t),As,2(t),,As,J(t).

  • 2 Initialize: Q˜c,iQc,i(t) for all i and Q˜s,jQs,j(t) for all j.

  • 3 for i=1 to I do

  • 4  Observe Ac,i(t) and update Q˜c,iQ˜c,i+Ac,i(t);

  • 5  Let ji*(t)argmaxj:(i,j)EQ˜s,j with ties broken arbitrarily;

  • 6  if Ac,i(t)=1 and j:(i,j)EQ˜s,j>0 then

  • 7   Match the arrival Ac,i(t) with a server in queue ji*(t);

  • 8   Then update Q˜c,iQ˜c,i1 and Q˜s,ji*(t)Q˜s,ji*(t)1;

  • 9 for j=1 to J do

  • 10  Observe As,j(t) and update Q˜s,jQ˜s,j+As,j(t);

  • 11  Let ij*(t)=argmaxi:(i,j)EQ˜c,i with ties broken arbitrarily;

  • 12  if As,j(t)=1 and i:(i,j)EQ˜c,i>0 then

  • 13   Match the arrival As,j(t) with a customer in queue ij*(t);

  • 14   Then update Q˜s,jQ˜s,j1 and Q˜c,ij*(t)Q˜c,ij*(t)1;

4.2. Pricing Algorithm

Because the functions Fi,Gj are unknown and the arrival rates cannot be directly controlled, we cannot simply solve the optimization problem (2)–(6) and obtain the prices. In this section, we propose to combine the techniques of gradient-free (zero-order) stochastic projected gradient descent (Agarwal et al. 2010) and bisection search to solve the problem.

Before presenting the details of our pricing algorithm, we begin by providing a simple example to understand the idea of our algorithm. We consider two customer queues, which are connected to one server queue. Let x1,x2 denote the arrival rates of customer queue 1 and customer queue 2, respectively. Then the arrival rate of the server queue should be equal to x1+x2 due to the balance constraints (3) and (4). Let x(k)(x1(k),x2(k)), where k denote the iteration. Suppose the demand and supply functions are known and the gradients g(k) are known; that is, we have first-order access to the profit function f. In this case, we can use the gradient ascent algorithm to solve the problem, as shown in Figure 5(a), where η is the step size. Suppose the demand and supply functions are known but the gradients are unknown; that is, we only have zero-order access to the profit function f. Then we can use the two-point zero-order method (Agarwal et al. 2010) to solve the problem, as shown in Figure 5(b). In the two point method, in each iteration k, we first sample a uniformly random direction u(k) (u(k) is a unit vector) and then calculate the profits using two points, x(k)+δu(k) and x(k)δu(k), where δ is small number. Then we can estimate the gradient and update x(k) using the formula in Figure 5(b). However, if the demand and supply functions and the gradients are all unknown, simply using the two-point method does not work because we even do not have zero-order access to the profit function f. As shown in Figure 5(c), for x=x(k)+δu(k) or x=x(k)δu(k), to estimate the profit f(x), we need to estimate the prices of all queues F1(x1),F2(x2),G1(x1+x2) because f(x)=x1F1(x1)+x2F2(x2)(x1+x2)G1(x1+x2), where x1+x2 is the arrival rate of the server queue. Take F1(x1) as an example. Estimating F1(x1) is equivalent to finding the solution to the equation x1=F11(p) given x1. We propose using bisection search on prices to estimate p such that x1=F11(p), as shown in Figure 5(c), where the horizontal axis is price p, the vertical axis is arrival rate x of the queue, and the green dashed curve is the inverse demand function x=F11(p). We first have the initial knowledge that the price F1(x1) will be between some lower bound and some upper bound, which forms our first search interval. Then we set the price of the queue to be the midpoint of the search interval and run the system for multiple time slots to calculate average number of arrivals per time slot. This is how we get an estimated value of F11(p) for a chosen p. Then we determine the next search interval by comparing it with x1, as shown in Figure 5(c). The bisection search stops when the accuracy is good enough. Note that during the process of running the system, we reject arrivals if the queue length is larger than or equal to a predetermined threshold, to control the queue length. After the bisection search, we obtain estimates of the profits f(x(k)+δu(k)) and f(x(k)δu(k)). Then we can substitute these estimates into the two-point method in Figure 5(b) to get an estimate of the gradient and then update the arrival rate x(k).

Figure 5. Pricing Algorithm
Notes. This is an example with two customer queues connected to one server queue. Let x1 and x2 denote the arrival rates of the two customer queues, respectively. The arrival rate of the server queue should be x1+x2 because of the balance constraint. Let x(k)(x1(k),x2(k)), where k denotes the iteration.

Now we present the details of the pricing algorithm in the general case. The structure of our pricing algorithm is shown in Figure 6. Before presenting the algorithm, we first define some notations and a shrunk feasible set. Let Ec,i{j|(i,j)E} and Es,j{i|(i,j)E} denote the sets of all queues that are connected to customer queue i and server queue j, respectively. By substituting λi,μj and by Assumption 3, the optimization problem (2)–(6) can be equivalently rewritten as

maxxf(x)i(jEc,ixi,j)Fi(jEc,ixi,j)j(iEs,jxi,j)Gj(iEs,jxi,j)(8)
s.t.jEc,ixi,j[amin,1],i=1,,I,(9)
iEs,jxi,j[amin,1],j=1,,J,(10)
xi,j0,(i,j)E.(11)

Figure 6. Structure of the Pricing Algorithm

Note that we view x(xi,j)(i,j)E as a vector in R|E|. Note that the problem (8)–(11) is also concave. Let DR|E| denote the feasible set of the problem (8)–(11), that is,

D{x|xi,j0 for all i,j,jEc,ixi,j[amin,1] for all i,iEs,jxi,j[amin,1] for all j}.(12)

Let Ni,jmax{|Ec,i|,|Es,j|} denote the maximum cardinality of the sets Ec,i and Es,j. Define a shrunk set of D with a parameter δ as follows:

D{x| for all i,j,xi,jamin+12Ni,j(1δr)amin+12Ni,j,jEc,i(xi,jamin+12Ni,j)[(1δr)(jEc,iamin+12Ni,jamin),(1δr)(1jEc,iamin+12Ni,j)],iEs,j(xi,jamin+12Ni,j)[(1δr)(iEs,jamin+12Ni,jamin),(1δr)(1iEs,jamin+12Ni,j)]},(13)
where
rmini,j{1+amin2Ni,j,1|Ec,i|(1jEc,iamin+12Ni,j),1|Es,j|(1iEs,jamin+12Ni,j),1|Ec,i|(jEc,iamin+12Ni,jamin),1|Es,j|(iEs,jamin+12Ni,jamin)},(14)
and δ(0,r). With the definition of D, we can show (Lemma 4 in the Online Appendix) that x+δuD for any xD and any vector u in the unit ball. To ensure that the definition of D is valid, we need the following assumption.

Assumption 4.

Assume that jEc,iamin+12Ni,jamin>0 for all i and iEs,jamin+12Ni,jamin>0 for all j.

As long as amin is small enough, Assumption 4 holds. Because Ni,j|Ec,i| and amin<1, we have

1jEc,iamin+12Ni,j1amin+12>0
for all i. Also, by Assumption 4, jEc,i(amin+1)/(2Ni,j)amin>0 for all i. Similarly, we have 1iEs,j(amin+1)/(2Ni,j)>0 and iEs,j(amin+1)/(2Ni,j)amin>0 for all j. Hence, r>0 and the definition of D is valid. Let xctr((amin+1)/(2Ni,j))(i,j)E. We can see that xctrD, which can be used as the initial point for the pricing algorithm; xctr can be thought of as the “center” of the set D.

Algorithm 2

(Pricing Algorithm)

  • 1 Initialize: Choose an exploration parameter δ(0,r). Choose a step size η(0,1). Choose an accuracy parameter ϵ(0,1/e). Choose a queue length threshold qth. Choose x(1)=xctrD as the initial point. Define ec,i and es,j according to (C.1) and (C.2). Define Nβln(1/ϵ)/ϵ2 and Mlog2(1/ϵ), where β>0 is a constant.

  • 2 Time step counter t1;

  • 3 Outer iteration counter k1;

  • 4 repeat

      // Step 1. generate a random direction and two points

  • 5  Choose a unit vector u(k)R|E| uniformly at random, that is, u(k)2=1;

  • 6  Let xi,j+(k)(x(k)+δu(k))i,j and xi,j(k)(x(k)δu(k))i,j;

  • 7  Let λi+(k)jEc,ixi,j+(k), λi(k)jEc,ixi,j(k), μj+(k)iEs,jxi,j+(k), μj(k)iEs,jxi,j(k);

  • 8  Let λ+(k) be a vector of λi+(k),i=1,I. Define similarly λ(k), μ+(k), and μ(k).

      // Step 2. bisection search to approximate Fi(λi+(k)),Fi(λi(k)),Gj(μj+(k)),Gj(μj(k)).

  • 9  if k=1 then

  • 10   For all i, let p¯c,i+(k,1)=pc,i,min, p¯c,i+(k,1)=pc,i,max;

  • 11   For all j, let p¯s,j+(k,1)=ps,j,min, p¯s,j+(k,1)=ps,j,max;

  • 12   Let q˜th=+; // We do not reject arrivals for k=1.

  • 13  else

  • 14   For all i, let p¯c,i+(k,1)=pc,i+(k1,M)ec,i, p¯c,i+(k,1)=pc,i+(k1,M)+ec,i;

  • 15   For all j, let p¯s,j+(k,1)=ps,j+(k1,M)es,j, p¯s,j+(k,1)=ps,j+(k1,M)+es,j;

  • 16   Let q˜th=qth;

  • 17  Let p¯c+(k,1) be a vector of p¯c,i+(k,1),i=1,I. Define similarly p¯c+(k,1), p¯s+(k,1), p¯s+(k,1).

  • 18  t, pc+(k,M), ps+(k,M)= Bisection (λ+(k),μ+(k),p¯c+(k,1),p¯c+(k,1),p¯s+(k,1),p¯s+(k,1),q˜th,t,M,N,ϵ);

  • 19  Do Line 9–18 for λ(k),μ(k). Denote the counterparts of the prices by p¯c(k,1), p¯c(k,1), pc(k,M), p¯s(k,1), p¯s(k,1), ps(k,M).

    // Step 3. gradient calculation

  • 20  Let g^(k)=|E|2δ[(i=1Iλi+(k)pc,i+(k,M)j=1Jμj+(k)ps,j+(k,M)) (i=1Iλi(k)pc,i(k,M)j=1Jμj(k)ps,j(k,M))]u(k);

      // Step 4. gradient ascent update

  • 21  Projected Gradient Ascent: x(k+1)=ΠD(x(k)+ηg^(k));

  • 22  kk+1;

  • 23 until t>T;

The details of the proposed pricing algorithm are shown in Algorithm 2 and Algorithm 3. In the notation in Algorithms 2 and 3, (k) denotes the kth outer iteration; (k,m) denotes the mth bisection iteration in the kth outer iteration. As shown in Algorithm 2, we first choose δ,η,ϵ, and qth; δ(0,r) is an exploration parameter for estimating the gradient, and it is used to construct the set D. We choose x(1)=xctrD as the initial point of the algorithm; η(0,1) is the step size for the gradient ascent step; and ϵ(0,1/e) is the accuracy for the bisection algorithm. The threshold qth>0 is used to control queue length. We also define ec,i and es,j, which can be later proved to be the upper bounds of the change of the prices in one outer iteration. Note that the parameters δ,η,ϵ,qth can be a function of total number of time steps T and can be optimized later to obtain a sublinear regret. In each outer iteration k, we have four steps.

  • Step 1: We first generate a random direction, that is, a unit vector u(k)R|E| that is uniformly sampled from the unit sphere, and then change x(k) in the direction of u(k) and also in the opposite direction of u(k), generating xi,j+(k) and xi,j(k), as shown in Line 6 of Algorithm 2. We can then calculate the arrival rates λi+(k),μj+(k) and λi(k),μj(k) as shown in Line 7 of Algorithm 2.

  • Step 2: Next, we find the prices that approximate Fi(λi+(k)),Gj(μj+(k)), and Fi(λi(k)),Gj(μj(k)) using the bisection search shown in Algorithm 3, which will be illustrated in detail later in this section. Here, we specify the initial search intervals for the bisection search. If k=1, we start with search intervals [pc,i,min,pc,i,max] and [ps,j,min,ps,j,max], as shown in Lines 10 and 11 of Algorithm 2. Note that we do not reject arrivals in the first iteration because we want to quickly learn the price that can keep the queues balanced. If k>1, we use pc,i+(k1,Mk1+), ps,j+(k1,Mk1+),pc,i(k1,Mk1), and ps,j(k1,Mk1),ec,i,es,j to construct the lower and upper bounds for the bisection algorithm, as shown in Lines 14 and 15 of Algorithm 2, where ec,i=Θ(ηϵ/δ+η+δ+ϵ) and es,j=Θ(ηϵ/δ+η+δ+ϵ), and the exact expressions can be found in Appendix C.

  • Step 3: The bisection algorithm will output approximated prices pc+(k,Mk+), ps+(k,Mk+), pc(k,Mk+), and ps(k,Mk+) that correspond to λ+(k),μ+(k),λ(k),andμ(k), respectively, as shown in Lines 18 and 19 of Algorithm 2. Then we can use these arrival rates and prices to calculate an estimate g^(k) of the gradient of the objective function f(x) in (8), as shown in Line 20 of Algorithm 2.

  • Step 4: With this gradient estimate, we do a projected gradient ascent as shown in Line 21 of Algorithm 2. The algorithm stops until the number of time steps tT.

Algorithm 3

(Bisection (λ+/(k),μ+/(k),p¯c+/(k,1),p¯c+/(k,1),p¯s+/(k,1),p¯s+/(k,1),q˜th,t,M,N,ϵ))

  • 1 for m=1 to M do

  • 2  Let pc,i+/(k,m)=12(p¯c,i+/(k,m)+p¯c,i+/(k,m)) for all i;

  • 3  Let ps,j+/(k,m)=12(p¯s,j+/(k,m)+p¯s,j+/(k,m)) for all j;

  • 4  Let nc,i(k,m)=0 for all i and ns,j(k,m)=0 for all j;

  • 5  repeat

  • 6   for i=1 to I do

  • 7    if Qc,i(t)q˜th then set price pc,i,max for queue i;

  • 8    else set price pc,i+/(k,m) for queue i and nc,i(k,m)nc,i(k,m)+1;

  • 9   for j=1 to J do

  • 10    if Qs,j(t)q˜th then set price ps,j,min for queue j;

  • 11    else set price ps,j+/(k,m) for queue j and ns,j(k,m)ns,j(k,m)+1;

  • 12   Use the above set of prices to run one step of the system;

  • 13   tt+1;

  • 14   Terminate the algorithm when t>T;

  • 15  until for all queues i, nc,i(k,m)N, that is, the price pc,i+/(k,m) is run for at least N times and for all queues j, ns,j(k,m)N, that is, the price ps,j+/(k,m) is run for at least N times;

  • 16  Let tc,i+/(k,m,n) denote the time slot when the price pc,i+/(k,m) is run for the nth time for the customer-side queue i; let ts,j+/(k,m,n) denote the time slot when the price ps,j+/(k,m) is run for the nth time for the server-side queue j;

  • 17  for i=1 to I do

  • 18   Let λ^i+/(k,m)=(1/N)n=1NAc,i(tc,i+/(k,m,n)); // sample average

  • 19   if λ^i+/(k,m)>λi+/(k) then p¯c,i+/(k,m+1)=pc,i+/(k,m), p¯c,i+/(k,m+1)=p¯c,i+/(k,m);

  • 20   else p¯c,i+/(k,m+1)=p¯c,i+/(k,m), p¯c,i+/(k,m+1)=pc,i+/(k,m);

  • 21  for j=1 to J do

  • 22   Let μ^j+/(k,m)=(1/N)n=1NAs,j(ts,j+/(k,m,n)); // sample average

  • 23   if μ^j+/(k,m)>μj+/(k) then p¯s,j+/(k,m+1)=p¯s,j+/(k,m), p¯s,j+/(k,m+1)=ps,j+/(k,m);

  • 24   else p¯s,j+/(k,m+1)=ps,j+/(k,m), p¯s,j+/(k,m+1)=p¯s,j+/(k,m);

  • 25 Let pc+/(k,M) be a vector of pc,i+/(k,M),i=1,,I and define similarly ps+/(k,M).

  • 26 return t, pc+/(k,M), ps+/(k,M)

The proposed bisection algorithm is shown in Algorithm 3, which is to find the prices pc+/,ps+/ such that λi+/(k)Fi1(pc,i+/) and μj+/(k)Gj1(ps,j+/) for some given target arrival rates λi+/(k),μj+/(k) with initial intervals [p¯c,i+/(k,1),p¯ci+/(k,1)] and [p¯s,j+/(k,1),p¯s,j+/(k,1)] for all i, j. In each bisection iteration m, we first calculate the midpoints of the intervals, pc,i+/(k,m) and ps,j+/(k,m) for all i, j just like the standard bisection search, as shown in Lines 2 and 3 of Algorithm 3. For each queue, if the queue length is less than q˜th, we will set the price with pc,i+/(k,m) or ps,j+/(k,m) that was just calculated; otherwise, we will temporarily reject new arrivals into the queue to control the queue length by using the prices pc,i,max or ps,j,min. We will use this set of prices to run the system for a number of time slots such that every price has at least N i.i.d. samples, as shown in Lines 4–15 of Algorithm 3. Next, with the samples we obtained, we can calculate an estimate of the arrival rate corresponding to the price we set for each queue, λ^i+/(k,m) or μ^j+/(k,m), as shown in Lines 18 and 22 of Algorithm 3. Then we can update the lower and upper bounds of the searching intervals by comparing λ^i+/(k,m), μ^j+/(k,m) with λi+/(k,m), μj+/(k,m), as shown in Lines 19 and 20 and Lines 23 and 24 of Algorithm 3. We conduct M=log2(1/ϵ) bisection iterations to obtain sufficiently high accuracy.

5. Main Result

The main result of the paper is shown in Theorem 1.

Theorem 1.

Let Assumptions 14 hold. Under Algorithm 1, Algorithm 2, and Algorithm 3, if ϵ<δ, 2ec,ipc,i,maxpc,i,min, and 2es,jps,j,maxps,j,min, we have

E[R(T)]=O(log4(1/ϵ)ϵ4qth+Tqth+T2ϵβ2+1+Tϵβ21log(1/ϵ)+Tmax{log2(1/ϵ)ϵ2,qth}(η+δ)qth+Tϵδ+log2(1/ϵ)ηϵ2),(15)
and the queue length
maxt=1,,Tmax{maxiQc,i(t),maxjQs,j(t)}max{2log2(1/ϵ)βln(1/ϵ)ϵ2,qth}.(16)

Based on Theorem 1, we optimize the regret bound by setting the parameters ϵ,η,δ,qth to be a function of T. Assume that qth and β are large enough. Then we can first focus on minimizing the order of the following term:

Tη+Tδ+Tϵδ+log2(1/ϵ)ηϵ2.(17)

If we ignore logarithmic order, the optimal solution is ϵ=T1/3, η=δ=T1/6, and the optimal order of (17) is O˜(T5/6). By setting β=5 and qth=Θ(T2/3), we obtain E[R(T)]=O˜(T5/6). We summarize the above result as the following corollary.

Corollary 1.

Let Assumptions 14 hold. Under Algorithm 1, Algorithm 2, and Algorithm 3, with parameters ϵ=T1/3, η=δ=T1/6, qth=T2/3, β=5, for sufficiently large T such that δ<r, ϵ<1/e, 2ec,ipc,i,maxpc,i,min, and 2es,jps,j,maxps,j,min, we can achieve a sublinear regret as

E[R(T)]=O˜(T56),
and the queue length
maxt=1,,Tmax{maxiQc,i(t),maxjQs,j(t)}=O˜(T23).

From (15) and (16) in Theorem 1, we can see that there is a tradeoff between minimizing the regret and minimizing the queue length. Firstly, we should not set qth to be greater than the order of T2/3 because it will not benefit the regret bound orderwisely because qth=T2/3 is large enough to achieve the optimal order of (17). However, we can reduce qth so that the queue-length bound can be improved, although we may suffer a larger regret. Suppose we want the queue-length bound to be the order of O˜(Tγ), where γ<2/3. Then we need to set qth=Θ˜(Tγ) and ϵΘ˜(Tγ/2) by (16). From (15), we need to set the parameters ϵ,η,δ,β to optimize the regret bound. Assume β is large enough. Then optimizing the regret bound is equivalent to minimizing the order of the following term (ignoring logarithmic terms):

1ϵ4Tγ+T1γ+Tη+Tδ+Tϵδ+1ηϵ2.(18)

The optimal solution is ϵ=Tγ/2, η=δ=Tγ/4, and the optimal order of (18) is T1γ/4. By setting β=4/γ1, we obtain E[R(T)]=O˜(T1γ/4). We summarize the above tradeoff as the following corollary.

Corollary 2.

Let Assumptions 14 hold. For any γ(0,23], under Algorithm 1, Algorithm 2, and Algorithm 3, with parameters ϵ=Tγ/2, η=δ=Tγ/4, qth=Tγ, β=4/γ1, for sufficiently large T such that δ<r, ϵ<1/e, 2ec,ipc,i,maxpc,i,min, and 2es,jps,j,maxps,j,min, we can achieve a sublinear regret as

E[R(T)]=O˜(T1γ/4),
and the queue length
maxt=1,,Tmax{maxiQc,i(t),maxjQs,j(t)}=O˜(Tγ).

From Corollary 1 and Corollary 2, when the allowable queue length increases (up to Θ˜(T2/3)), the regret bound that we can achieve becomes better. However, even if we allow the queue length to increase over Θ˜(T2/3), the regret bound cannot be improved and remains O˜(T5/6).

6. Proof Roadmap

In this section, we present a proof roadmap for Theorem 1. The complete proof of Theorem 1 can be found in the Online Appendix. In the proof roadmap, we will consider the choice of parameters as in Corollary 1 to make the proof idea clear, that is, ϵ=T1/3, η=δ=T1/6, qth=T2/3, β=5.

For the first outer iteration, we have no control of the queue length, so the queue length will possibly increase up to MN=Θ˜(T2/3). Also, because the queue length is controlled by the threshold qth=T2/3 after the first outer iteration, the queue-length bound is O˜(T2/3).

For the regret bound, the proof roadmap is shown in Figure 7. We first define an event C, which means that the estimations of all the arrival rates are ϵ-accurate. Then the regret can be divided into the regret when the event C occurs and the regret when the event C does not occur. We show in Lemma 5 in the Online Appendix that the complement of the event, Cc, happens with low probability. Intuitively, because for each estimation we use N=O˜(1/ϵ2) samples, we can obtain ϵ accuracy with high probability by Hoeffding’s inequality. Note that Lemma 5 cannot be proved by simply applying Hoeffding’s inequality, and we will discuss the challenge in Section 7.

Figure 7. Proof Roadmap

Therefore, the regret when the event C does not happen can be bounded. For the regret when the event C happens, that is, the estimations are ϵ-accurate, we further divide it into two parts. One part is the regret when the lengths of all the queues are less than the threshold qth so that there is no queue that is rejecting arrivals, and the other part is the regret induced by rejecting arrivals when there exists a queue whose length is greater than or equal to the threshold qth. For the latter part, we can bound it by the expected number of time slots when a queue is rejecting arrivals. We show in Lemma 6 in the Online Appendix that this can be bounded by O(T5/6). The proof is using Lyapunov drift analysis with a Lyapunov function Vc(t)iQc,i2(t) for the customer side and Vs(t)jQs,j2(t) for the server side. The challenge of the proof will be discussed in Section 7.

For the part of regret when there is no queue that is rejecting arrivals, regret is induced by zero-order optimization process. We show in the Online Appendix that this part of regret can be bounded by O˜(T5/6). The intuition is as follows. If there is only one run of the system in each outer iteration, then from the literature of two-point method of zero-order optimization (Agarwal et al. 2010), the regret will be bounded by O(K), where K is the number of outer iterations. In our algorithm, there are Θ˜(T2/3) runs of the systems in each outer iterations, and there are Θ˜(T1/3) outer iterations. Hence, the regret will be bounded by Θ˜(T2/3)O(T1/3)=O˜(T5/6). However, the proof needs additional steps because, even in the same outer iteration, the prices and arrival rates are changing. In fact, we need to connect the actual prices to the prices corresponding to x(k). Moreover, there can be dependence between the arrival rate estimation and the randomness of the optimization algorithm, which we will discuss in Section 7.

7. Challenges in the Proof

In this section, we discuss the challenges in the proof of Theorem 1 and the methods we use to overcome them.

The first challenge is to prove that the event Cc occurs with low probability (Lemma 5 in the Online Appendix). The event C is defined as

C{for all outer iteration k=1,,T/(2MN), all bisection iteration m=1,,M, all i,j,and all arrival ratesλi(tc,i+(k,m,n)),λi(tc,i(k,m,n)),μj(ts,j+(k,m,n)),μj(ts,j(k,m,n))[0,1],|1Nn=1NAc,i(tc,i+(k,m,n))λi(tc,i+(k,m,n))|<ϵ,|1Nn=1NAc,i(tc,i(k,m,n))λi(tc,i(k,m,n))|<ϵ,|1Nn=1NAs,j(ts,j+(k,m,n))μj(ts,j+(k,m,n))|<ϵ,|1Nn=1NAs,j(ts,j(k,m,n))μj(ts,j(k,m,n))|<ϵ},
where tc,i+(k,m,n) is the time slot when the price pc,i+(k,m) is run for the nth time for the customer-side queue i, and similarly, we define tc,i(k,m,n), ts,j+(k,m,n), and ts,j(k,m,n). In the definition, we require that the accurate estimation holds for any arrival rates within [0, 1] because we want to ensure the independence between the event C and the randomness of u(k) in Algorithm 2, which we will discuss later in this section. To prove that the event C holds with high probability, simply using Hoeffding’s inequality with the union bound does not work because arrival rates are continuous random variables within [0, 1]. Therefore, we use a discretization idea that is similar to the covering argument for linear bandits (Lattimore and Szepesvári 2020). We define a set S[0,1], a discretized version of [0, 1] with resolution ϵ/2, and rewrite the first inequality in the definition of C as |(1/N)n=1N𝟙{Uc,i+(k,m,n)λ}λ|<ϵ, where Uc,i+(k,m,n) is a random variable uniformly distributed in [0, 1] that is generated independently at the very beginning of the process, and similarly, we can rewrite other inequalities in the definition of C. We can bound the quantization error and then use Hoeffding’s inequality and the union bound over λS[0,1] to bound the probability of Cc.

The second challenge is to prove the upper bound on the expected number of time slots when the length of one of the queues is greater than or equal to the threshold qth (Lemma 6 in the Online Appendix). As mentioned in the proof roadmap in the previous section, we use Lyapunov drift analysis with quadratic Lyapunov functions. The challenge is how we make sure that the actual arrival rates are balanced or almost balanced. We know that x(k)D, so the corresponding arrival rates are balanced. Because x+(k) and x(k) are δ-close to x(k), their corresponding arrival rates are almost balanced. We want to prove that the actual arrival rates are close enough to the almost balanced arrival rates corresponding to x+(k) or x(k). To show this, we use an important property that the true price corresponding to x+(k) or x(k) is within the search interval of the bisection algorithm with high probability and an important fact that the width of the search interval is small enough for k2. Therefore, the actual arrival rates are also almost balanced for k2.

The third challenge is bounding the regret induced by zero-order optimization conditioned on ϵ-accurate arrival rate estimations (Lemma 7 in the Online Appendix). As mentioned in the proof roadmap in the previous section, the challenge is that there could be dependence between the estimation of arrival rates and the randomness (u(k)) in the optimization algorithm. By defining the event C such that the ϵ-accuracy is required for all arrival rates in [0, 1], we can show that the event C is independent of u(k). Note that if we change the definition of C such that only the estimations of the actual arrival rates are ϵ-accurate, this independence does not hold. This independence between C and u(k) is essential so that, conditioned on the event C, we can use the proof idea of two-point method of zero-order optimization to bound the regret.

8. Numerical Results

In this section, we present numerical results of the proposed algorithm and compare its performance with that of the UCB algorithm with uniform discretization designed for Lipschitz bandits (Slivkins 2019).

We consider a system with three types of customers (I=3) and three types of servers (J=3). The compatibility graph is given by E={(1,1),(1,2),(1,3),(2,1),(2,2),(3,2),(3,3)}. The demand and supply functions are given by Fi(λi)=2(1λi) for all i=1,2,3 and Gj(μj)=2μj for all j=1,2,3. We compare the following algorithms.

  • Proposed algorithm: The parameters are set to qth=t2/3, β=1.0, ϵ=1.0×t1/3, δ=0.2×t1/6, η=0.1×t1/6, and ec,i=es,j=8max{ϵ,δ,η}.

  • Discretization-UCB: We first discretize the price space uniformly and then apply UCB algorithm as in Slivkins (2019). We gradually refine the discretization so that the discretization error decreases at the rate of T1/(I+J+2), which theoretically achieves the optimal regret order for Lipschitz bandits according to Slivkins (2019).

  • Discretization-UCB with w>0: As mentioned in the introduction, the discretization-UCB algorithm fails to balance the matching rates and arrival rates, making it difficult to control the queue length. Therefore, we modify this algorithm by adding a penalty/bonus term such that the reward of pulling an arm at time t becomes

    (profit at time t)w(iQc,i(t+1)+jQs,j(t+1)iQc,i(t)jQs,j(t)),

where the additional term is the increase/decrease of the total queue length. Note that when w=0, it reduces to the original discretization-UCB algorithm.

For all the algorithms, we set a queue length threshold of qth=t2/3 for fair comparison, which can be viewed as the buffer size of each queue. All the algorithms use the longest-queue-first matching policy.

Figure 8 presents the results of the profit regret and the maximum queue lengths. Note that both the x axis and y axis of the figures are in log scale. We can see that the proposed algorithm significantly outperforms the discretization-UCB algorithm (w=0) in terms of both profit regret and maximum queue length. The maximum queue length under the discretization-UCB algorithm grows at the same rate as the threshold t2/3, meaning that the queue lengths consistently hit the buffer size. This indicates that the discretization-UCB algorithm fails to control the queue lengths. The regret under the discretization-UCB algorithm grows almost linearly with t, because rejecting arrivals whenever the queue length hits the buffer size (threshold) incurs a constant regret, and the queue lengths hit the threshold frequently. The modified discretization-UCB algorithms with w=1.0 and w=2.0 achieve a better maximum queue length performance. However, their profit regret remains significantly worse than that of the proposed algorithm. The reason could be that the additional penalty/bonus affects the exploration of the arms (prices).

Figure 8. Comparison
Note. The shaded areas represent 95% confidence intervals, calculated from 10 runs.

Sensitivity analysis of the proposed algorithm to its parameters, along with additional numerical results, can be found in the Online Appendix.

9. Conclusions

We studied the pricing and matching problem for two-sided queueing systems with unknown demand and supply functions. We proposed a novel learning-based pricing algorithm that combines zero-order stochastic projected gradient ascent with bisection search. We proved that the proposed algorithm yields a sublinear regret O˜(T5/6) and guarantees an anytime queue-length bound O˜(T2/3) and established a tradeoff between the regret bound and the queue-length bound: O˜(T1γ/4) versus O˜(Tγ) for γ(0,2/3]. In addition to ride-sharing platforms, the proposed algorithm also has potential applications in online marketplaces and healthcare systems.

Appendix A. Proof of Lemma 1

Fix any policy. By the concavity in Assumption 2, we have

1Tt=1TE[iλi(t)Fi(λi(t))jμj(t)Gj(μj(t))]E[i(1Tt=1Tλi(t))Fi(1Tt=1Tλi(t))j(1Tt=1Tμj(t))Gj(1Tt=1Tμj(t))].(A.1)

By Jensen’s inequality, we have

E[i(1Tt=1Tλi(t))Fi(1Tt=1Tλi(t))j(1Tt=1Tμj(t))Gj(1Tt=1Tμj(t))]i(1Tt=1TE[λi(t)])Fi(1Tt=1TE[λi(t)])j(1Tt=1TE[μj(t)])Gj(1Tt=1TE[μj(t)]).(A.2)

Combining (A.1) and (A.2), we have

t=1TE[iλi(t)Fi(λi(t))jμj(t)Gj(μj(t))]T[i(1Tt=1TE[λi(t)])Fi(1Tt=1TE[λi(t)])j(1Tt=1TE[μj(t)])Gj(1Tt=1TE[μj(t)])].(A.3)

Note that the dynamics of the queue length of type i customers is

Qc,i(t+1)=Qc,i(t)+Ac,i(t)j:(i,j)EXi,j(t),(A.4)
where Xi,j(t) denotes the number of matches on link (i,j) at time t. Taking expectation on both sides, we have
E[Qc,i(t+1)]=E[Qc,i(t)]+E[λi(t)]j:(i,j)EE[Xi,j(t)].

Taking time average on both sides, we have

1Tt=1TE[Qc,i(t+1)]=1Tt=1TE[Qc,i(t)]+1Tt=1TE[λi(t)]j:(i,j)E1Tt=1TE[Xi,j(t)],
which implies that
1Tt=1TE[λi(t)]j:(i,j)E1Tt=1TE[Xi,j(t)]=1Tt=1T(E[Qc,i(t+1)]E[Qc,i(t)])=1TE[Qc,i(T+1)].(A.5)

Similar argument holds for the server side and hence

1Tt=1TE[μj(t)]i:(i,j)E1Tt=1TE[Xi,j(t)]=1TE[Qs,j(T+1)].(A.6)

From (A.5) and (A.6), we know that the tuple

((1Tt=1TE[λi(t)]1TE[Qc,i(T+1)])iI,(1Tt=1TE[μj(t)]1TE[Qs,j(T+1)])jJ,(1Tt=1TE[Xi,j(t)])(i,j)E)
satisfies Constraints (3)–(6) and is therefore a feasible solution to the fluid optimization problem. Hence, the objective value of this feasible solution must be less than or equal to f*, that is,
i(1Tt=1TE[λi(t)]1TE[Qc,i(T+1)])Fi(1Tt=1TE[λi(t)]1TE[Qc,i(T+1)])j(1Tt=1TE[μj(t)]1TE[Qs,j(T+1)])Gj(1Tt=1TE[μj(t)]1TE[Qs,j(T+1)])f*.(A.7)

By the Lipschitz property in Assumption 1, we know that the function λiFi(λi) is Lipschitz in λi, and the function μjGj(μj) is Lipschitz in μj. This can be shown as below:

|(λi+Δ)Fi(λi+Δ)λiFi(λi)||λi(Fi(λi+Δ)Fi(λi))|+|ΔFi(λi+Δ)|(LFi+pc,i,max)|Δ|.

Similarly,

|(μj+Δ)Gj(μj+Δ)μjGj(μj)|(LGj+ps,j,max)|Δ|.

Hence, we have

i(1Tt=1TE[λi(t)])Fi(1Tt=1TE[λi(t)])j(1Tt=1TE[μj(t)])Gj(1Tt=1TE[μj(t)])i(1Tt=1TE[λi(t)]1TE[Qc,i(T+1)])Fi(1Tt=1TE[λi(t)]1TE[Qc,i(T+1)])j(1Tt=1TE[μj(t)]1TE[Qs,j(T+1)])Gj(1Tt=1TE[μj(t)]1TE[Qs,j(T+1)])+i(LFi+pc,i,max)1TE[Qc,i(T+1)]+j(LGj+ps,j,max)1TE[Qs,j(T+1)].(A.8)

Combining (A.3), (A.7), and (A.8), we have

t=1TE[iλi(t)Fi(λi(t))jμj(t)Gj(μj(t))]Tf*+i(LFi+pc,i,max)E[Qc,i(T+1)]+j(LGj+ps,j,max)E[Qs,j(T+1)].

Lemma 1 is proved.

Appendix B. Proof of Lemma 2

For the single-link system, we will omit the subscripts i, j in the notations because there is only one customer queue and one server queue. The fluid optimization problem (2)–(6) reduces to

maxxxF(x)xG(x)s.t.x[0,1].(B.1)

Let f(x)xF(x)xG(x). Consider a class of problem instances (i.e., a class of functions F, G) satisfying Assumptions 13 for which there exists an optimal solution x* to Problem (B.1) satisfying the following properties:

  • (A) x*{0,1}.

  • (B) There exist constants νr>0, νc>0, and νr(x*)c(x*) such that for any x[0,1],

    r(x*)r(x)ν(x*x)+νr|xx*|c(x)c(x*)ν(xx*)+νc|xx*|,

where r(x)xF(x), c(x)xG(x), and r(·) and c(·) represent subdifferentials of functions r and c, respectively.

We will first show that this class of problem instances is not empty. Consider the following functions F and G:

F(x){32x,if x[0,12]32x+12x,if x(12,1]G(x){x2,if x[0,12]x18x,if x(12,1].(B.2)

We can show that this problem instance (B.2) satisfies the requirement. First note that both F and G are strictly decreasing, bijective, and Lipschitz in [0, 1]. Also, the inverse functions of F and G are also Lipschitz because the derivatives/subderivatives of F and G are strictly greater than zero. Hence, Assumption 1 holds. By definition, we obtain

r(x)={3x2x2,if x[0,12]3x2x2+12,if x(12,1]c(x)={x22,if x[0,12]x218,if x(12,1].

By definition, we can show that r is concave and c is convex. Hence, Assumption 2 holds. The subdifferentials of r and c are as follows:

r(x)={{34x},if x[0,12)[12,1],if x=12{322x},if x(12,1]c(x)={x,if x[0,12)[12,1],if x=122x,if x(12,1].

Also, we have

(rc)(x)={{35x},if x[0,12)[12,12],if x=12{324x},if x(12,1].

Because 0(rc)(1/2), x*=1/2 is an optimal solution to Problem (B.1). Hence, Assumption 3 and Property (A) hold. Next, we will prove Property (B). Let νr=1/4, νc=1/4, and ν=3/4. Note that νr(x*)c(x*)=[1/2,1]. Consider two cases, xx*=1/2 and x>x*=1/2.

  • xx*=1/2: For r, we have

    r(x*)r(x)=13x+2x2,ν(x*x)+νr|xx*|=12x.

Then

r(x*)r(x)[ν(x*x)+νr|xx*|]=122x+2x2=(2x12)20.

Hence, r(x*)r(x)ν(x*x)+νr|xx*|. For c, we have

c(x)c(x*)=12x218,ν(xx*)+νc|xx*|=12(12x).

Then

c(x)c(x*)[ν(xx*)+νc|xx*|]=12x212x+18=(x2122)20.

Hence, c(x)c(x*)ν(xx*)+νc|xx*|.

  • x>x*=1/2: For r, we have

    r(x*)r(x)=123x2+x2,ν(x*x)+νr|xx*|=12(x12).

Then

r(x*)r(x)[ν(x*x)+νr|xx*|]=14x+x2=(x12)20.

Hence, r(x*)r(x)ν(x*x)+νr|xx*|. For c, we have

c(x)c(x*)=x214,ν(xx*)+νc|xx*|=x12.

Then

c(x)c(x*)[ν(xx*)+νc|xx*|]=x2x+14=(x12)20.

Hence, c(x)c(x*)ν(xx*)+νc|xx*|.

Combining the above two cases, we have proved Property (B). Hence, this problem instance (B.2) belongs to the defined class of problem instances. Therefore, the defined class of problem instances is not empty.

Next, we will show that, for this class of problem instances, any policy that keeps the anytime maximum queue length bounded by Tγ must incur a regret of at least Ω(T1γ). Fix any γ[0,1/2). Fix any policy such that maxt=1,,Tmax{Qc(t),Qs(t)}Tγ. From the regret definition (7), we have

E[R(T)]=t=1T(E[x*F(x*)λ(t)F(λ(t))]+E[μ(t)G(μ(t))x*G(x*)])=t=1T(E[r(x*)r(λ(t))]+E[c(μ(t))c(x*)]).(B.3)

From (B.3) and Property (B), we can obtain the following lower bound:

E[R(T)]t=1T(E[ν(x*λ(t))+νr|λ(t)x*|]+E[ν(μ(t)x*)+νc|μ(t)x*|])=t=1TνE[μ(t)λ(t)]+t=1TE[νr|λ(t)x*|+νc|μ(t)x*|].(B.4)

From the queue dynamics, we have

Qc(t+1)=Qc(t)+Ac(t)X(t).

Taking expectation on both sides, we have

E[Qc(t+1)]=E[Qc(t)]+E[λ(t)]E[X(t)],
where X(t) is the number of matches on the link at time t. Similarly, we have
E[Qs(t+1)]=E[Qs(t)]+E[μ(t)]E[X(t)].

Hence, we have

E[Qc(t+1)]E[Qs(t+1)]=E[Qc(t)]E[Qs(t)]+E[λ(t)μ(t)].

Define Q(t)Qc(t)Qs(t). Then, we have

E[Q(t+1)]=E[Q(t)]+E[λ(t)μ(t)].

Summing both sides from t=1 to t=T and noting that Q(1)=Qc(1)Qs(1)=0, we obtain

E[Q(T+1)]=t=1TE[λ(t)μ(t)].(B.5)

From (B.4) and (B.5), we obtain

E[R(T)]|ν||E[Q(T+1)]|+t=1TE[νr|λ(t)x*|+νc|μ(t)x*|].(B.6)

By Lemma 3 in the Online Appendix, we know that either Qc(T+1)=0 or Qs(T+1)=0. Hence, |E[Q(T+1)]|E[|Q(T+1)|]E[max{Qc(T+1),Qs(T+1)}]Tγ under the policy. Hence, (B.6) can be further bounded by

E[R(T)]|ν|Tγ+t=1TE[νr|λ(t)x*|+νc|μ(t)x*|]|ν|Tγ+min{νc,νr}t=1TE[|λ(t)x*|+|μ(t)x*|].(B.7)

By the triangle inequality, the term t=1TE[|λ(t)x*|+|μ(t)x*|] in (B.7) can be bounded by

t=1TE[|λ(t)x*|+|μ(t)x*|]12t=1TE[|λ(t)μ(t)|]+12t=1TE[|μ(t)x*|+|λ(t)x*|].(B.8)

Next, we will bound the term t=1TE[|λ(t)μ(t)|] in (B.8) using the Lyapunov drift method. Consider the drift E[Q2(t+1)Q2(t)]. By the queue dynamics, we can obtain

E[Q2(t+1)Q2(t)]=E[(Q(t)+Ac(t)As(t))2Q2(t)]=E[(Ac(t)As(t))2]+2E[Q(t)(Ac(t)As(t))]=E[(Ac(t)As(t))2]+2E[Q(t)E[Ac(t)As(t)|Q(t),λ(t),μ(t)]]=E[(Ac(t)As(t))2]+2E[Q(t)(λ(t)μ(t))],
where the third line uses the law of iterated expectation. Summing both sides from t=1 to t=T, we obtain
E[Q2(T+1)]=t=1TE[(Ac(t)As(t))2]+2E[t=1TQ(t)(λ(t)μ(t))]t=1TE[(Ac(t)As(t))2]2E[maxt=1,,T|Q(t)|t=1T|λ(t)μ(t)|].(B.9)

For the variance term E[(Ac(t)As(t))2] in (B.9), we have

E[(Ac(t)As(t))2]=E[Ac2(t)]+E[As2(t)]2E[Ac(t)As(t)]=E[E[Ac2(t)|λ(t)]]+E[E[As2(t)|μ(t)]]2E[E[Ac(t)As(t)|λ(t),μ(t)]]=E[λ(t)+μ(t)2λ(t)μ(t)],
where the second equality uses the law of iterated expectation and the last equality uses the independence between Ac(t) and As(t) conditioned on λ(t) and μ(t). Adding and subtracting x* and 2x*μ(t), we obtain
E[(Ac(t)As(t))2]=E[λ(t)x*+x*+μ(t)2x*μ(t)+2(x*λ(t))μ(t)]=E[x*+μ(t)2x*μ(t)]+E[λ(t)x*+2(x*λ(t))μ(t)]E[x*+μ(t)2x*μ(t)]3E[|λ(t)x*|].

Adding and subtracting x* and 2(x*)2, we further obtain

E[(Ac(t)As(t))2]x*+x*2(x*)2+E[μ(t)x*+2x*(x*μ(t))]3E[|λ(t)x*|]2x*(1x*)3E[|μ(t)x*|]3E[|λ(t)x*|].(B.10)

Hence, from (B.9) and (B.10), we have

E[Q2(T+1)]t=1T2x*(1x*)3t=1TE[|μ(t)x*|+|λ(t)x*|]2E[maxt=1,,T|Q(t)|t=1T|λ(t)μ(t)|].

Rearranging terms and noting that maxt=1,,T|Q(t)|maxt=1,,Tmax{Qc(t),Qs(t)}Tγ, we obtain

E[t=1T|λ(t)μ(t)|]t=1T2x*(1x*)2Tγ32Tγt=1TE[|μ(t)x*|+|λ(t)x*|]Tγ2.

Because x*{0,1} by Property (A), we have 2x*(1x*)>0. Hence, t=1T2x*(1x*)=Θ(T). Hence, we have

E[t=1T|λ(t)μ(t)|]Θ(T1γ)32Tγt=1TE[|μ(t)x*|+|λ(t)x*|]Tγ2=Θ(T1γ)32Tγt=1TE[|μ(t)x*|+|λ(t)x*|],(B.11)
where the last line holds since γ<1/2. From (B.8) and (B.11) and noting that E[t=1T|λ(t)μ(t)|]0, we have
t=1TE[|λ(t)x*|+|μ(t)x*|]max{Θ(T1γ)34Tγt=1TE[|μ(t)x*|+|λ(t)x*|],0}+12t=1TE[|μ(t)x*|+|λ(t)x*|]=max{Θ(T1γ)+(1234Tγ)t=1TE[|μ(t)x*|+|λ(t)x*|],12t=1TE[|μ(t)x*|+|λ(t)x*|]}miny:0y2T{max{Θ(T1γ)+(1234Tγ)y,y2}},
where in the last inequality we substitute t=1TE[|μ(t)x*|+|λ(t)x*|] with y and then take the minimum. If γ>0, then the right-hand side will be of order miny:0y2T{max{Θ(T1γ)+Θ(y),Θ(y)}}=miny:0y2TΘ(T1γ+y)=Θ(T1γ). If γ=0, then the right-hand side will be of order miny:0y2T{max{Θ(T)Θ(y),Θ(y)}}=Θ(T). Hence, we obtain
t=1TE[|λ(t)x*|+|μ(t)x*|]Θ(T1γ).(B.12)

From (B.7) and (B.12), we have

E[R(T)]|ν|Tγ+min{νc,νr}Θ(T1γ)=Θ(T1γ)Θ(Tγ)=Θ(T1γ),
where the last equality is by γ<1/2. Lemma 2 is proved.

Appendix C. Exact Expressions of ec,i and es,j in Algorithm 2

ec,i=2ηϵ|E|3/2LFiδ[i=1ILFi(1+LFi1(pc,i,maxpc,i,min))+j=1JLGj(1+LGj1(ps,j,maxps,j,min))]+2ϵLFi(1+LFi1(pc,i,maxpc,i,min))+η|E|3/2LFi(i=1I|Ec,i|(LFi+pc,i,max)+j=1J|Es,j|(LGj+ps,j,max))+2δ|E|1/2LFi=Θ(ηϵδ+η+δ+ϵ),(C.1)
es,j=2ηϵ|E|3/2LGjδ[i=1ILFi(1+LFi1(pc,i,maxpc,i,min))+j=1JLGj(1+LGj1(ps,j,maxps,j,min))]+2ϵLGj(1+LGj1(ps,j,maxps,j,min))+η|E|3/2LGj(i=1I|Ec,i|(LFi+pc,i,max)+j=1J|Es,j|(LGj+ps,j,max))+2δ|E|1/2LGj=Θ(ηϵδ+η+δ+ϵ).(C.2)

References

  • Adan I, Weiss G (2012) Exact FCFS matching rates for two infinite multitype sequences. Oper. Res. 60(2):475–489.LinkGoogle Scholar
  • Agarwal A, Dekel O, Xiao L (2010) Optimal algorithms for online convex optimization with multi-point bandit feedback. Kalai AT, Mohri M, eds. Proc. 23rd Annual Conf. Learn. Theory (ACM, New York), 28–40.Google Scholar
  • Aveklouris A, Puha AL, Ward AR (2023) A fluid approximation for a matching model with general reneging distributions. Queueing Systems 106:199–238.Google Scholar
  • Aveklouris A, DeValve L, Stock M, Ward A (2024) Matching impatient and heterogeneous demand and supply. Oper. Res. 73(3):1637–1658.LinkGoogle Scholar
  • Besbes O, Zeevi A (2015) On the (surprising) sufficiency of linear models for dynamic pricing with demand learning. Management Sci. 61(4):723–739.LinkGoogle Scholar
  • Caldentey R, Kaplan EH, Weiss G (2009) FCFS infinite bipartite matching of servers and customers. Adv. Appl. Probability 41(3):695–730.Google Scholar
  • Chen Y, Hu M (2020) Pricing and matching with forward-looking buyers and sellers. Manufacturing Service Oper. Management 22(4):717–734.LinkGoogle Scholar
  • Chen B, Shi C (2024) Tailored base-surge policies in dual-sourcing inventory systems with demand learning. Oper. Res. 73(4):1723–1743.Google Scholar
  • Chen X, Liu Y, Hong G (2023) Online learning and optimization for queues with unknown demand curve and service distribution. Preprint, submitted March 6, https://arxiv.org/abs/2303.03399.Google Scholar
  • Chen X, Liu Y, Hong G (2024) An online learning approach to dynamic pricing and capacity sizing in service systems. Oper. Res. 72(6):2677–2697.LinkGoogle Scholar
  • Duchi JC, Jordan MI, Wainwright MJ, Wibisono A (2015) Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Trans. Inform. Theory 61(5):2788–2806.Google Scholar
  • Flaxman AD, Kalai AT, McMahan HB (2004) Online convex optimization in the bandit setting: Gradient descent without a gradient. Preprint, submitted August 2, https://arxiv.org/abs/cs/0408007.Google Scholar
  • Gurvich I, Ward A (2015) On the dynamic control of matching queues. Stochastic Systems 4(2):479–523.LinkGoogle Scholar
  • Hu M, Zhou Y (2022) Dynamic type matching. Manufacturing Service Oper. Management 24(1):125–142.LinkGoogle Scholar
  • Lattimore T (2024) Bandit convex optimisation. Preprint, submitted February 9, https://arxiv.org/abs/2402.06535.Google Scholar
  • Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
  • Neely M (2022) Stochastic Network Optimization with Application to Communication and Queueing Systems (Springer, Cham, Switzerland).Google Scholar
  • Nguyen LM, Stolyar AL (2018) A queueing system with on-demand servers: Local stability of fluid limits. Queueing Systems 89:243–268.Google Scholar
  • Shamir O (2017) An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. J. Machine Learn. Res. 18(52):1–11.Google Scholar
  • Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Google Scholar
  • Srikant R, Ying L (2014) Communication Networks: An Optimization, Control and Stochastic Networks Perspective (Cambridge University Press, Cambridge, UK).Google Scholar
  • Tassiulas L, Ephremides A (1990) Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. Proc. 29th IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 2130–2132.Google Scholar
  • Varma SM, Castro F, Maguluri ST (2020) Near optimal control in ride hailing platforms with strategic servers. Preprint, submitted August 9, https://arxiv.org/abs/2008.03762.Google Scholar
  • Varma SM, Bumpensanti P, Maguluri ST, Wang H (2023) Dynamic pricing and matching for two-sided queues. Oper. Res. 71(1):83–100.LinkGoogle Scholar
  • Vaze R, Nair J (2022) Non-asymptotic near optimal algorithms for two sided matchings. Proc. 20th Internat. Sympos. Modeling Optim. Mobile Ad Hoc Wireless Networks (IEEE, Piscataway, NJ), 17–24.Google Scholar
  • Yang Z, Srikant R, Ying L (2023) Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted UCB. Ruiz F, Dy J, van de Meent J-W, eds. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 4275–4312.Google Scholar