UvA-DARE (Digital Academic Repository) Detecting Markov Chain Instability Detecting Markov Chain Instability: A Monte Carlo Approach

. We devise a Monte Carlo based method for detecting whether a non-negative Markov chain is stable for a given set of parameter values. More precisely, for a given subset of the parameter space, we develop an algorithm that is capable of deciding whether the set has a subset of positive Lebesgue measure for which the Markov chain is unstable. The approach is based on a variant of simulated annealing, and consequently only mild assumptions are needed to obtain performance guarantees. The theoretical underpinnings of our algorithm are based on a result stating that the stability of a set of parameters can be phrased in terms of the stability of a single Markov chain that searches the set for unstable parameters. Our framework leads to a procedure that is capable of performing statistically rigorous tests for instability, which has been extensively tested using several examples of standard and non-standard queueing networks.


Introduction
The stability of a Markov chain is arguably among its most important properties. For example, in queueing applications it offers the guarantee that service has been sufficiently provisioned to cope with the load imposed on the network in the long run. For this reason the assessment of the stability of Markov chains has long been an area of intense research. The objective is often to determine the set of parameter values for which the system's state does not diverge, referred to as the stability region, of a Markov chain. For many relatively standard Markov chains the stability region is easily expressed in terms of quantities related to the transition probabilities. However, despite a substantial and growing literature, for a large class of systems determining the stability region has appeared a subtle and highly non-trivial task. Importantly, various (at first sight) counterintuitive results have been found; in particular, for specific queueing models "naïvely conjectured" conditions turn out to be insufficient to ensure stability.
More specifically, initial results, for example those by Jackson (1963), Baskett et al. (1975), and Kelly (1975), suggested that the stability of queueing networks would be determined by the network's subcritical region (i.e., the set of parameters for which the nominal load at each queue is less than 1). This conjecture was later proven incorrect by a series of counterexamples that showed instability can occur with subcritical parameters when seemingly benign work conserving rules are applied. Early examples include those of Lu and Kumar (1991), Rybko and Stolyar (1993), and Kumar and Seidman (1990). In these examples the instability is typically essentially caused by the priority rules that apply between the customer classes. Similar effects can, however, be constructed in first-in first-out queueing networks with customer classes that have strongly differing mean service requirements, see for example Bramson (1994). It was thus realized that, at first sight counterintuitively, decreasing the mean service requirement of certain job classes can in fact induce instability. As a consequence the stability region need not be monotone (nor convex) in its parameters. For example, Bordenave et al. provide an instance of a non-convex stability region in Bordenave et al. (2012). There are various other examples of Stochastic Systems, 2017, vol. 7, no. 2, pp. 289-314, © 2017 queueing networks with unusually shaped stability regions. In MacPhee et al. (2007) provide an example with a "thick null recurrent set," in Baccelli and Bonald (1999) show that for certain TCP models the stability region is of a fractal nature, and in Nazarathy et al. (2015) investigate a case where the stability region is conjectured to consist of disjoint parts.
To avoid determining the stability regions of queueing networks on a case-by-case basis, various general approaches have been proposed. Perhaps the most straightforward among these amounts to determining the invariant measure of the number of customers; when this allows a normalization, then a stationary distribution exists. This approach works for a set of classical models, relying on concepts such as product form and (quasi-)reversibility (Kelly 1979), but unfortunately not for many (sometimes just slightly more complex) other systems.
An arguably more robust approach to determining stability is to construct an appropriate Lyapunov function, and then apply the Foster-Lyapunov theorem. Along these lines Tassiulas and Ephremides, for example, use a quadratic Lyapunov function to find a series of policies which are stable in a wide variety of settings (Tassiulas and Ephremides 1992). Constructing an appropriate Lyapunov function is often specific to the application at hand, but the approach can be simplified by studying the fluid model associated with the queueing network. Such a fluid approach was first described by Rybko and Stolyar (1993) and was developed in a general form by Dai (1995); a textbook treatment is presented in Bramson (2008). Importantly, for specific models this approach can help determine the conditions under which there is stability, but it does not instantly provide a stability condition for a given network at hand. Thus far, no general framework has been developed that is capable of deciding whether a given Markov chain is stable or not.
The objective of this paper is to develop a general simulation based approach to determining when a given Markov chain should be classified as unstable. A first paper to consider this approach is Wieland et al. (2003). Then, concurrently to our work, Leahu and Mandjes (2016) proposes a simulation based method for determining the stability region of a multiclass queueing network with respect to its arrival rate, when it is possible to verify that the stability region satisfies particular stochastic monotonicity properties. Given the variety of models and counter-examples discussed above, we place importance on the generality of settings to which our algorithm is suitable. To apply our algorithm we do not place structural assumptions on the stability region, we do not restrict parameters of interest, and we do not restrict the mechanism from which the simulations are derived. We focus on providing theoretical guarantees on the performance of our method for a broad class of models where simulations can exhibit either positive or negative drift.
In particular, rather than specific parameter choices, we are interested in the stability classification of parameter sets. The distinguishing features are: (i) that the methodology can be easily used for a relatively broad class of systems, and (ii) that the technique is based on Monte Carlo simulation. Clearly, it is straightforward to develop a simulation-based method that can speculatively test stability for a single parameter setting. It is nevertheless far from obvious how an algorithm should be set up that can identify whether a system is unstable for any of the parameter values within a given set.
This paper resolves this issue by proposing a simulated annealing (Kirkpatrick and Vecchi 1983) based algorithm that systematically searches the parameter set, and determines whether it contains a subset of positive measure consisting only of unstable parameter values. Instead of having to perform a series of simulations to answer the stability identification question, our algorithm performs a single simulation run of a process that encompasses both the queueing network and the parameter set, which is provably capable of finding positive measurable subsets of unstable parameters. That is, the output of our algorithm is a statistical statement that provides explicit asymptotic performance guarantees. We view our work as a substantive pioneering study on the simulation based computation of the stability region of Markov chains.
The framework we propose has the major advantages over existing ones of being broadly applicable and relying only on mild modeling assumptions. Our method provably provides the correct outcome if the Markov chain has bounded increments. Another significant advantage of the approach is that the annealing algorithm can essentially be performed separately from the simulation of the queueing network; as a consequence, the program can be organized with an inner loop (simulating the queueing network with given parameter values using a rather complex simulator) and an outer loop (simulating the annealing step). It thus enables us to computationally determine the stability region for (i) relatively straightforward models with non standard features for which this has not been identified in closed form, but also for (ii) larger, realistic models capturing application-specific details.
We now proceed by providing an informal description of the setting we consider as well as our algorithm. The key object in this paper is the collection of Markov chains each of them evolving on the state space , where ⊂ I is a set of parameter values. For instance, λ ∈ could parametrize the arrival rates of a queueing network consisting of I queues. Our algorithm detects if there is a subset¯ ⊂ of positive measure for which the Markov chains (X (λ) : λ ∈¯ ) are unstable. Importantly, for reasons that will become clear, we use a definition of "stability" that differs slightly from those in common use. We essentially define stability of an individual chain through a Lyapunov drift condition imposed on a function f . For example, f (x) could give the total number of jobs in the queueing network when it is in state x. Informally speaking, if for f the process f (X (λ) ) has negative drift above some finite threshold, then we call X (λ) f -stable. Alternatively, if f (X (λ) ) has positive drift above some finite threshold, then we call X (λ) f -unstable. If there exists a¯ ⊂ of positive Lebesgue measure such that X (λ) is f -unstable for all λ ∈¯ , then we call f -unstable, and otherwise we call stable. The main idea behind the algorithm is that it generates a discrete time process with state space ( , ). Given an initial state (x, λ), a new parameter proposal γ is chosen uniformly from . The Markov chain X (γ) then evolves for τ(x) time units starting from initial state x. In our implementation τ(x) is chosen to be proportional to f (x). Denoting X (γ) τ(x) : y, the next state of the bivariate process is subsequently chosen by comparing the proposed state (y, γ) with the current state (x, λ) according to the Metropolis rule: Here [z] − : min{0, z} and η is a positive tuning parameter for the algorithm. The above iteration is motivated by the simulated annealing algorithm initially proposed by Kirkpatrick and Vecchi (1983). The main advantage of this type of update is the relative generality of optimization problems that it can provably handle, while still being superior to exhaustive search methods. A key distinction between our method and the typical implementation of simulated annealing is that our cooling schedule is achieved using a combination of the fixed parameter η and the parameter τ(x), that varies with the state of the Markov chain.
In addition to the global search algorithm just described, we will also study a local search version. This version is different in two respects. Firstly, the new parameter proposal γ is sampled uniformly from the neighborhood of the current parameter, a set we denote by λ . Secondly, we allow X (λ) and X (γ) to evolve for τ(x) time units starting from initial state x, let X (λ) τ(x) : y and then apply the above Metropolis rule with x replaced by y . Our key theorems apply to both versions of the algorithm and we explore differences in performance of the two methods through examples.
After having pointed out how the algorithm works, we now provide results that separate its sample paths into either the stable or unstable regimes. Let S k (Y k , Λ k ) be the state of the bivariate process achieved after k iterations of the above rule (1) and let T k be the total amount of time that the algorithm has run for by the kth iteration. The first main theoretical contribution of this paper, later stated formally in Theorem 3.4, shows under mild conditions on X, that if there does not exist a subset of unstable parameters¯ ⊂ with positive Lebesgue measure, then almost surely Importantly, the second main theoretical contribution of this paper, later stated formally in Theorem 3.5, shows that there exists a true "dichotomy" since if there does exist an unstable set of parameters¯ ⊂ with positive Lebesgue measure, then the process (Y k : k ∈ + ) diverges, in the sense that, almost surely The bound (3) relies on a martingale argument in combination with an application of the Azuma-Hoeffding inequality. The bound (2) is proven by a coupling argument: as it turns out, in the stable situation the process f (Y) can be majorized by a Markov chain (W k : k ∈ + ) that has an asymptotic drift of zero. An advantage of the coupling approach used to prove (2) is that the Markov process W is easily simulated, and can therefore be used to provide probabilistic bounds on the likelihood of instability. We use this approach to perform rigorous statistical tests for instability of the underlying set . There are many potential approaches to the adaptation of our theoretical results to a practical test for instability. The approach we suggest is to compare f (Y k ) with the quantiles of W k . Specifically, we let the process f (Y) evolve according to the rule given in (1) until the total number of steps in taken between steps of Y exceeds some predetermined level k * , at which point we compare f (Y k ) with the 1 − α quantile of W k , with α being the desired confidence level. If f (Y k ) exceeds this Stochastic Systems, 2017, vol. 7, no. 2, pp. 289-314, © 2017 quantile then we obtain a strong rigorous statistical statement of instability, whereas otherwise we fail to reject the "null hypothesis" of stability.
We provide five example applications of our algorithm. We apply it to a system consisting of a set of parallel queues studied by Tassiulas and Ephremides (1993), a tandem queueing system, the celebrated Rybko-Stolyar network (Rybko and Stolyar 1993), a network of input queued switches studied by Andrews and Zhang (2003), and a broken diamond random access network (RAN) recently studied by Ghaderi et al. (2014).
Since the stability region is well known for the parallel and tandem systems, these are ideal examples on which to verify that the algorithm performs as expected. We use the tandem system to show that although our results are in a discrete time setting, we are still able to effectively study continuous time systems using a jump chain associated with the process. Additionally, this network also highlights that we are able to test multidimensional parameter sets for instability, and suggests that we are able to relax the Markov assumption. The Rybko-Stolyar network is a popular example of a system with oscillating queue sizes. Not only does our analysis confirm existing theoretical results that give sufficient conditions for stability of this system, it also provides a statistically rigorous statement that these conditions are also necessary. The network of input queued switches allows us to show that our algorithm provides interesting results for systems with high dimensional state spaces and complex dynamics. In addition, we are able to use our methodology to show that this is an example of a system where the longest queue first policy is not maximally stable. Our final example, the RAN of Ghaderi et al., is currently a hot topic of research in the applied probability community. This system exhibits oscillatory queue size sample path behavior reminiscent of the Rybko-Stolyar network, but in a higher dimensional setting. We are able to expand on the theoretical results of Ghaderi et al. (2014) by providing more specific (statistical) information about which parameter sets are unstable. Throughout this section results are given in terms of both the global and local versions of the algorithm. For some of the models (parallel, tandem, Rybko-Stolyar), the local algorithm appears to perform better, while for others (switches, RAN) the global algorithm appears to be superior.
The remainder of this paper is structured as follows. In Section 2 we give a formal description of our framework and the assumptions imposed. Section 3 presents the algorithm and states our main results, i.e., Theorems 3.5 and 3.4. In Section 4 detailed proofs are given (of our main results, propositions, and lemmas). We then provide a range of case studies in Section 5 that demonstrate the algorithm's potential. Section 6 presents concluding remarks as well as an outlook on future research.

Framework
In this section we present the setup considered in the paper. The object of study is the irreducible Markov chain X (λ) that is parametrized by parameter λ ∈ ; these parameters can, for example, be thought of as the arrival or service rates in a queueing network. It is assumed throughout that is a closed subset of I + , for some I ∈ , with finite positive Lebesgue measure (which is denoted | |). The Markov chain, which may represent the evolution of the population of a queueing network, attains values in : J + : {0, 1, . . .} J , for some J ∈ . As pointed out in the introduction, the main goal of this paper is to devise a procedure that identifies if a parameter set contains any unstable parameters. Put more precisely, the algorithm verifies whether or not there is a subset¯ of such that for all λ ∈¯ the associated Markov chain is unstable. Further, for each λ ∈ , we let λ be the neighborhood of λ. As is commonly assumed for local search algorithms, we assume that λ ∈ λ and for any λ 1 , λ n ∈ there is a sequence of neighborhoods with λ k+1 ∈ k for k 1, . . . , n − 1.
We will work extensively with a Lyapunov function that maps the state of the Markov chain to a nonnegative real number, that is a monotone function f : → [0, ∞). In our queueing network example, f (x) could represent the sum of the queue sizes within the network (that is, the total network population). We assume that f is unbounded in the sense that lim inf It is assumed throughout that for all λ the process f (X (λ) ) has bounded increments, implying there exists a constant φ f > 0 such that We now provide the formal definitions of stability and instability, as used in this paper.
Definition 2.1. Given f , we say that the set of parameters is f -stable if there exists δ > 0, σ > 0, and κ > 0 such that for all x such that |x| ≥ κ, for all λ ∈ , and all k ≥ σ.
Similarly, the set of parameters is f -unstable if there exists a set¯ ⊂ of positive measure, δ > 0, σ > 0, and κ > 0 such that for all x such that |x| ≥ κ, for all λ ∈¯ , and all k ≥ σ.
For a given value of λ, the conditions (5) and (6) are Lyapunov conditions for which one can obtain positive recurrence or transience of the Markov chain X (λ) , respectively (see for instance Hairer 2010). We remark that a countable state space Markov chain is positive recurrent if and only if there exists a Lyapunov function f for which it is f -stable, see Meyn and Tweedie (2012, Theorem 11.0.1). In our definition of f -stable we consider over set of Markov chains for which the same choice of f provides positive recurrent. In this sense, the definitions of "stable" and "unstable" then ask whether or not the Markov chains X (λ) are positive recurrent for parameters λ in . For our simulations we use the L 1 norm, i.e., the sum of queue sizes, though of course other functions might be considered.
We further remark that if a fluid limit,f (X (λ) ), exists for each (rescaled) process, then the above conditions (5) and (6) imply that In other words, (5) and (6) respectively imply fluid stability and fluid instability (see for instance Bramson 1994). In general, fluid stability and instability are not equivalent to the positive recurrence and transience of an underlying Markov process. Nevertheless, the Lyapunov analysis of fluid models remains one of the most widely deployed and established devices used to determine the positive recurrence and transience of Markov processes. Similarly, our work provides a broadly applicable technique that may be used to determine the positive recurrence and transience of families of Markov processes. Now that we have introduced our framework, the next section describes our algorithm, as well as the main results upon which the algorithm is based.

Implementation and Main Results
In this section we explicitly give our algorithm and provide a detailed discussion of the choices underlying it. We then give in Theorems 3.4 and 3.5 our main theoretical contribution. We follow this up with a suggested method of using our results to implement actual tests for instability.

Algorithm
We now describe our algorithm for identifying whether a parameter set is unstable. Our approach is based on the principle of searching the relevant parameter set for a parameter choice that maximizes the drift of the Markov process under consideration. As such, well known optimization algorithms provide an ideal source of inspiration for potential methods. As previously mentioned, the approach taken in this paper is based on the well known simulated annealing optimization algorithm. While many other optimization techniques have found acceptance through testing against well known "hard" problems, the simulated annealing algorithm has shown itself to be amenable to rigorous results on performance guarantees.
In this section we provide a detailed description of our algorithm, and through the use of two theorems provide guarantees on its asymptotic performance. A key advantage of this strong theoretical grounding is that the machinery used to provide these theoretical guarantees also allows us to develop a hypothesis test that outputs a statistical statement of whether or not a Markov chain is stable given a particular parameter set. In this section we also include an illustration of the algorithm and its output in the context of a single server discrete time queueing system. In later sections we demonstrate the algorithm's potential through a series of experiments concerning more complex systems.
We assume that τ(x) c f (x) + d, where c, d ∈ (0, ∞) are chosen by the algorithm's user. Note that this implies τ(x) → ∞ as |x| → ∞ and that τ has bounded increments. Finally, let T k give the time that our chain has been running for at the k-th step, that is, We now have all of the machinery needed to give both versions of our algorithm. Stochastic Systems, 2017, vol. 7, no. 2, pp. 289-314, © 2017 The Author(s)

Algorithm 3.1 (Global search algorithm)
(iv) If stopping condition is met, then stop, else set k k + 1 and return to (i).
As outlined in the introduction, each step of the global search algorithm compares x, as sampled in the previous step, with a new value y, sampled using a uniformly at random selected parameter γ from with runtime τ(x) and initial state x. The state is then updated according to the Metropolis rule (7).
Recalling from Section 2 that λ is a neighborhood of λ in , the local search version operates as follows.

Algorithm 3.2 (Local search algorithm)
(v) If stopping condition is met, then stop, else set k k + 1 and return to (i).
The local search algorithm compares states (x , λ) and (y, γ) where x is sampled by running the current parameter λ for a further τ steps and (y, γ) is sampled by running a neighboring parameter γ ∈ λ for the same number of steps. These states are then compared according to the Metropolis rule (8).
As we will discuss in more detail, the relative performance of the global search and local search differ depending on the model and setting to which they apply. Under general modeling assumptions both algorithms converge to a behavior that only accept unstable parameters in . The Global Search Algorithm proposes parameters uniformly at random and thus asymptotically will only accept parameters uniformly at random in the unstable set¯ . This is useful if one wants to identify the region of instability, in addition to determining if instability occurs. The Local Search Algorithm proposes two neighboring parameters and compares them simultaneously. In this way the Local Search Algorithm applies a hill-climbing heuristic. In this sense it is more aggressive in approaching regions of instability, but will not identify the entire unstable region.
When analyzing Algorithm 3.1, we assume that is a general measurable set and that¯ is a set with positive Lebesgue measure, while for the local search Algorithm 3.2 we place some restrictions. We assume the following: In a queueing setting x 0 may, for example, correspond to a state where all queues are idle. An important feature of our work is that we do not place structural conditions on such as convexity or monotonicity. Since examples of Markov processes violating these conditions frequently occur in both theory and practice, by avoiding such conditions our work is widely applicable. Another key feature is that we do not assume knowledge of the process generating each sample path is available, we only require samples of the state description in response to parameter choices. This further extends the set of models that may be analyzed using our method, since practical simulators (although Markovian) are often not generated from a simple closed form Markovian descriptor (transition matrix or infinitesimal generator), but rather come in the form of a "black box" that provides outputs in response to parameter inputs.
Note that we have not provided an explicit stopping condition for the algorithm yet. Since our results are asymptotic in the number of steps k, it may be sensible to run the algorithm until some large k, chosen based on CPU time limitations. An alternative may be to dictate a particular total budget of time that the algorithm may evolve in . To do this, choose a k * and run the algorithm until T k > k * . In either case it may not be obvious whether the sample path belongs to the stable or unstable regimes, an issue that we address with a test for instability in Section 3.3.
We now briefly address some of the choices we made in the design of the algorithm. Firstly, note that the τ function we introduced has replaced the cooling schedule from the traditional simulated annealing algorithm. The functional form of τ ensures that when Y k is large the subsequent Λ k+1 proposal is given an increased opportunity to demonstrate that it has higher drift. Since a large Y k hints that an unstable parameter choice has been recently chosen this helps to ensure that CPU budget is expended comparing parameter choices which appear to be unstable.
The conditioning in step (ii) of the algorithm on X (γ) 0 x, rather than starting each new sample from X (γ) 0 equal to zero, is intentionally designed to allow the system to build up to a size where instability properties become evident. That is, as per Definition 6, the drift properties we are seeking only become evident after |x| > κ has occurred. Forcing the system to reach |x| > κ in a single X (γ) τ(x) sample may result in the algorithm inefficiently repeating "burn in" time.

Main Results
Our main theoretical contributions are Theorems 3.5 and 3.4 below. These demonstrate the stability of a set can be summarized asymptotic sample path behavior of the process f (Y)/T. In essence the stability of a parameterized family of Markov processes can be summarized by the stability of a single Markov process, as generated by Algorithm 3.1 or Algorithm 3.2.
The main results of this paper are as follows: Theorem 3.4. If the set is stable then, almost surely, Theorem 3.4 shows that when is stable, the sample path of f (Y)/T converges to 0.
Theorem 3.5. If the set is unstable then, almost surely, Theorem 3.5 shows that when is unstable, the sample path of f (Y)/T eventually never returns to 0. In practical use it is f (Y k )/T k , for some large k, that is observed, rather than its limiting value. It is therefore not possible to directly apply the theorems. Instead, when f (Y)/T appears to converge to 0, the contrapositive of Theorem 3.5 provides evidence that the parameter set is not unstable. Conversely, when f (Y)/T appears to diverge, converge to a positive constant, or fluctuate within a set that does not contain 0, the contrapositive of Theorem 3.4 provides evidence that the parameter set is not stable.
We now include a short example to illustrate these theorems. Consider a simple discrete time queueing system where an arrival occurs at the beginning of each time slot with probability p ∈ [0, 1], and then subsequently, if the queue is non-empty, a service occurs with probability 0.5. Clearly, so long as the queue is non-empty the expected change in queue size between time periods is p − 0.5. Hence, for p < 0.5 the system is L 1 -stable with κ σ 1 and δ 0.5 − p. Figure 1 illustrates Theorems 3.5 and 3.4 using the sample path behavior of f (Y)/T for this simple system. The sample path corresponding to p sampled from [0, 0.4] appear to converge towards 0, providing evidence that this set is not unstable. Similarly, the sample path corresponding to p sampled from [0, 0.6] appears to remain constant at approximately 10 −2 in the global case and appears to diverge in the local case, providing evidence that this set is not stable.
We remark that the behavior of the unstable sample path substantially differs between the global and local versions of the algorithm. In the global case the unstable sample path quickly separates from the stable sample path and appears to tend towards some constant value. For the local algorithm, however, the stable and unstable sample paths appear highly similar until suddenly the unstable sample path rapidly increases. This suggests that if n is not large enough, the local algorithm may perform very poorly, however for n large it may perform vastly better.
It is not necessarily clear in finite time if a sample path of f (Y)/T belongs to the regime of Theorem 3.4 or Theorem 3.5. In the next subsection we address this issue by presenting a test for instability. Stochastic Systems, 2017, vol. 7, no. 2, pp. 289-314, © 2017 The Author(s)

A Test for Instability
We now provide a method to test, statistically, whether or not a parameter set is unstable. Here one could consider a null-hypothesis which states that the parameter set is stable for some given δ, cf. (5). Given this and the simulated model, we can construct a closed form family of random variables Z(w), w ≥ 0 (given by Lemma 4.1 in Section 4.1) such that Z(w) stochastically majorizes the increments of f (Y k ). With this choice of Z(w), we can then define a Markov chain (W k : k ∈ + ) according to the recursion The following proposition will be proven to show that there is a coupling where the Markov chain ( Since Theorem 3.5 says that f (Y) will diverge in the unstable case, we suggest comparing f (Y k ), as outputted by Algorithm 3.1, with the quantiles of W k . In particular, let q Note that, given a problem instance chosen according to Definition 2.1 and (4) (that is, particular values of φ f , δ, σ, and κ), the quantiles of q (α) k can be estimated quickly and easily through Monte Carlo simulations of the W process. If k then we suggest concluding that the parameter set is f -unstable for that problem instance. Otherwise we suggest that there is not enough evidence to make a conclusion either way.
To illustrate this approach we return to the simple example introduced in the previous section. In Figure 2 estimated q (0.05) curves with δ 0.05 and δ 0.01, τ(x) 0.5x + 1, and σ κ Y 0 1 are compared with estimated mean curves for the f (Y) process with [0, ] for 0.6, 0.55, 0.5, 0.45, and 0.4. As expected the q (0.05) curves bound the mean curves of f (Y) for < 0.5, while for > 0.5 the mean curves of f (Y) appear to eventually exceed the q (0.05) curve. With reference to Definition 2.1, δ is the downward drift that a process must exhibit in order to be stable. As can be seen here, it is to be expected that q (0.05) curves generated using a particular δ value bound those generated using higher values of δ. Since we test for a stabilizing drift up to δ, it is desirable to use a δ which is as low as possible. In Section 5 we investigate further the trade-off between simulation run-time and δ that users of our algorithm must keep in mind. Note that the global and local q (0.05) curves are nearly indistinguishable from each other here.
In some instances, obtaining long sample paths of f (Y) may be a computationally intensive task. We now describe an approach to managing the user's simulation budget, but various other approaches could be taken. In order to achieve a significance level of at least α, we propose choosing a simulation budget k * , to take the first sample point f (Y k ) such that T k+1 > k * and to then compare this f (Y k ) with an estimate of q (α) k . If f (Y k ) exceeds q (α) k then we suggest rejecting the "null hypothesis" of stability, and otherwise we suggest concluding that there is not enough evidence to make a conclusion. This is the approach that we take in Section 5. Note that this does not involve comparing the "test statistic" f (Y k ) with its distribution, but rather we compare it with a distribution which is stochastically dominant. Assuming that the 1 − α quantile estimate for W k is accurate, asymptotically in k * the significance level will in fact tend to 0 for all α > 0, and never exceed α. We summarize the above discussion in Algorithm 3.7, below. Note that this algorithm is just one of many potential extensions of our basic Algorithm 3.1, for which we give specific theoretical results, and an input to this algorithm is an appropriate (q (α) : k ∈ + ) estimate.

Algorithm 3.7 (Stability test algorithm)
(iv) If T k + τ(Y k ) > k * , then proceed to (v), else set k k + 1 and return to (i). ( In Section 4 we prove the results presented above, and then in the Section 5 we will demonstrate the algorithm's potential on some more complex systems.

Proofs
We first prove Theorem 3.4 in Section 4.1, which applies to the stable regime, in the context of the global search algorithm and provide a remark on the minor modifications to this proof that would be needed to show the local search case. We then prove Theorem 3.5, which applies to the unstable case, for the global search and local search algorithms in Sections 4.2 and 4.3 respectively.

Stable Parameter Set
In this subsection we prove Theorem 3.4. In what follows, we first give a formal definition of the random variables Z(w). Then, to prove Theorem 3.4, we require Proposition 3.6, given in Section 3, and Proposition 4.2, given below. The first of these propositions shows the existence of a process that majorizes any Y process generated from a stable parameter set. The second proposition then shows that this majorizing process has an asymptotic drift of zero, which leads to the result of the theorem. In order to obtain Proposition 3.6 we require Lemma 4.1, given next, and Lemma A.1, which is a simple technical lemma that can be found in the appendix. Lemma 4.1 explicitly provides a level dependent random variable that bounds the jumps of the f (Y) process and Lemma A.1 gives a useful monotonicity property for these jumps. Proposition 4.2 is proven by contradiction and depends on Lemma A.2 which states that the sequence of random variables defined in Lemma 4.1 are square integrable and tend to an expectation of zero as the level diverges.
In this section all of the proofs are performed in the context of the global search algorithm, however at the end of the section we remark on the minor modification required to adapt the proof to the local search algorithm context.
We develop a process W that stochastically majorizes any f (Y) process generated from a stable parameter set. Recall φ f , δ and σ from Definition 2.1. We define the function n such that n(w) is the smallest integer such that σn(w) ≥ τ(x) when the underlying process is in a state x f −1 (w). We bound the jumps of f (X (λ) ), for a given stable λ. The following lemma provides this bound.

Lemma 4.1.
If λ is f -stable, then there exists random variables (Z(w): w ≥ 0) and a constant w * such that, for all x with where, for σ, κ and δ as given in (5), Z(w) is a random variable with distribution where . The proof of this lemma is straightforward, yet, somewhat technical; a proof is given in the appendix. The form of the expression for (Z(w) ≥ z) given above can be understood as follows. By (5) the stable downward drift condition only applies when the chain has run for at least σ steps, so we consider the process on steps of size σ and ensure that we take enough of these steps, n(w), to exceed τ(x). The maximum is a result of the trivial upper bound on probabilities, and the 1 + n(w) exponential terms correspond to a union bound using an equivalent number of applications of the Azuma-Hoeffding inequality.
The downward drift condition requires |x| > κ, and so we apply Azuma-Hoeffding to different martingales depending on whether the sample path of interest enters the states {x: |x| < κ} or not. The first exponential term corresponds to sample paths that never enter |x| < κ, and so the martingale we use does not include κ and has steps which are bounded by (φ f + δ)σ. The remaining exponential terms correspond to sample paths that hit the level κ. We associate with these sample paths a martingale which reflects the fact that for such sample paths there be must be an excursion from κ to z + f (x) that can be stopped just before this excursion occurs. As such these remaining exponentials do not rely on δ.
From Lemma 4.1 we can prove Proposition 3.6.
Proof of Proposition 3.6. The inequality in Lemma 4.1 bounds the upward movement of the Markov process X λ .
For w 0 f (x 0 ), we see that Namely . Therefore the bound (15) holds by Lemma 4.1 for z > 0. Further, for z ≤ 0, the right-hand side of (15) is equal to 1, so the bound trivially holds.
Lemma A.1, stated and proved in the appendix, assists with the coupling of W and Y by providing a monotonicity property for the transitions of W. Specifically, for constants v, w with w * ≤ v ≤ w, we have that Combining together (15) and (16), we have that whenever f (x 0 ) ≤ w 0 . A direct consequence of this inequality is that there is a coupling of f (Y k ) and W k where, provided f (Y 0 ) ≤ W 0 , then f (Y k ) ≤ W k for all k. This short, but standard, argument is presented in the next paragraph. Let and U be an independent uniform [0, 1] random variable. The distribution of F −1 Y, x 0 (U) and F −1 Z, w 0 (U) are respectively versions of f (Y 1 ) and W 1 for initial values Y 0 x 0 and W 0 w 0 (see e.g., Williams 1991, Section 3.12). Thus Notice that once f (Y 1 ) is determined, we can extend the coupling to determine Y 1 . To do this, we take an independent random variable with distribution

Now, from inequality (17) it is clear that
for all values of u. Thus under this coupling Continuing inductively, we have f (Y k ) ≤ W k for all k, as claimed.
We now analyze the chain W k . As the following lemma states, we find that its asymptotic drift is zero, whenever the parameter set is stable.
Proof. It is a straightforward calculation to show that Z(w) is L 2 bounded in w and that ƐZ(w) → 0 as w → ∞. This is shown in Lemma A.2 in the appendix. We analyze the martingale Since Z(w) is L 2 bounded, M k is an L 2 martingale (with unbounded variation). Further, such L 2 martingales obey the strong law of large numbers, that is For instance, see Williams (1991, Section 12.14) for a proof. We therefore have lim sup where the first equality holds due to (18) and the final equality holds due to (19). We now note that the inequality (20) can only hold when lim k→∞ W k /k 0. To see this, note that if lim sup k W k /k were positive then W k must diverge. However, as was shown in Lemma A.2, we also have that Ɛ[Z(W n−1 ) | W n−1 ] → 0 as W n−1 → ∞. Thus the average of these terms must be zero, that is which is a contradiction. Thus, lim sup k→∞ W k /k 0, as required.
The proof of Theorem 3.4 is now an application of Propositions 3.6 and 4.2.

Proof of Theorem 3.4. For
The time increment τ(x) is bounded below, so T k ≥ γk for some positive constant γ.

Remark 4.3 (Adapting the Proof to Local-Search).
We briefly remark one way in which the above argument can be adapted to the local-search case. Firstly, we note that local search (in the worse case) will choose the maximum of two independent simulation runs. Given that both parameters γ and λ are stable and given Lemma 4.1, we then have that the local search update can be bounded as follows In the second inequality above, we apply Lemma 4.1 to obtain two i.i.d. copies of Z( f (x)). From this one can see that the result of the proof of Theorem 3.4 follows by replacing (12) with for two iid versions of Z. This gives one straightforward way of adapting the proof of Theorem 3.4. Other methods with tighter bounds are also possible.

Proof of Theorem 3.5 for the Global Search Algorithm
In this subsection we prove Theorem 3.5 for the global search algorithm. The proof relies on two lemmas, Lemmas 4.4, and 4.5. In Lemma 4.4 we bound the drift of f (Y) and show that f (Y) can be used to construct a submartingale. This follows from the fact that unstable parameters choices significantly increase the drift, while stable parameter choices do not significantly decrease it. In Lemma 4.5 we bound the moments of this submartingale. Then, using standard martingale arguments we show that every time the submartingale exceeds some level, with positive probability it stays above this level forever. Since f (Y) is an irreducible Markov chain our divergence result then follows. In the following we use the notation [x] + : max{x, 0} and ∆ f (x, y) : f (y) − f (x).

Lemma 4.4.
If is unstable, then there exist constants κ ≥ 0 and a > 0 such that for all x with |x| ≥ κ Proof. Aside from the simulation in between Y samples, our algorithm consists of two random steps: (i) the selection of Λ k+1 , and (ii) the random state update rule (7). Upon conditioning on these two steps the expected change in f can be calculated as follows Denote p : |¯ |/| | ∈ (0, 1], where¯ is the set for which X λ is unstable (cf. (6)). Now split the above integral by distinguishing between: (i) µ ∈¯ , and (ii) µ ∈ \¯ . It is readily verified that for all z ∈ the function The stated bound is trivial for z ≥ 0, and for z < 0 simply note that z exp(−η[−z] + ) is minimized at z −η −1 . For µ ∈¯ we use the lower bound of z in (23), In the second inequality above, we apply the assumption that our Markov chain is unstable on (cf. (6)). For µ ∈ \¯ , we use the lower bound of −(eµ) −1 in (23). This yields Combining Equation (22) with Inequalities (24) and (25) yields Since τ(x) → ∞ as |x| → ∞. There exists κ > 0 such that p δ τ(x) − (1 − p)(eµ) −1 ≥ pδτ(x)/2 for all |x| > κ. Letting a pδ/2, we have the result.
Let k κ be the hitting time for f (Y) on the states {x: |x| ≤ κ}. An immediate consequence of the above proof is that the process forms a submartingale. This in itself is not sufficient to prove f (Y k ) diverges in the sense of Theorem 3.5. However, this is possible when we bound the moments of F k as follows. In the following let S k (Y k , Λ k ).

Lemma 4.5.
If is unstable, then there exist an r > 0 such that for Proof. The change in the process from F k−1 to F k is achieved by a process with bounded increments. Upon applying the Azuma-Hoeffding inequality, we have that Now consider the following sequence of inequalities: In the first inequality above, we apply the bound that Ɛ[F k − F k−1 ] ≥ aτ k from Lemma 4.4. In the second inequality, for values of z such that −aτ k − r −1 log z ≥ 0 we bound the integrand from above by 1, which results in the exp(−raτ k ) term appearing. In the final inequality we apply (28).
We now show that the right hand side of the expression above is strictly less than 1 for a suitable choice of r, and τ k suitably large: The final inequality follows by integrating over rather than + and by noting that the integral of exp(−y 2 ) over is equal to √ π. Observe that there exists r > 0 such that r 2 /4 − ra < 0. Thus for this choice of r, for all τ k suitably large, we have that, as desired, We can now prove Theorem 3.5 using well known martingale arguments. Stochastic Systems, 2017, vol. 7, no. 2, pp. 289-314, © 2017 Proof of Theorem 3.5. We first apply standard stopping arguments to (27) to show that if Y 0 is such that |Y 0 | > κ, then there is positive probability that F k will not go negative, namely, for some K > 0. We do so by investigating the probability of its complement. Let T be the first time when F k < 0 occurs for k ≥ 0, which is a stopping time. Using Lemma 4.5, recalling that where K : min y: | y|>κ { f (y)} is a positive constant since f is positive and f (x) → ∞ as |x| → ∞. The first two equalities above apply our stopping time definition and an exponential change of variable. The first inequality above applies Markov's inequality, the second applies Fatou's lemma and the third is the optional stopping theorem (see e.g., Williams 1991, Section 10.10) applied to our supermartingale. The next step is to show using the Strong Markov Property, that at every time when |Y | > κ holds, there is a positive probability that the process F k remains positive for all remaining time. Due to irreducibility |Y k | > κ occurs infinitely often, and so eventually it will be that F k > 0 for all time. We now argue this point more formally. Let 0 be the first time that |Y k | > κ holds. For n ≥ 1, let which is the process F started from time n−1 . Let σ n be the first time after n−1 when F (n) k < 0 holds, and let n be the first time after σ n that |Y k | > κ holds. Since our Markov chain is irreducible it must be that if σ n is finite, then n+1 is finite. By this and (29) we have Thus, upon noting that σ n cannot possibly be finite if σ n−1 is not, we have (σ n < ∞) ≤ exp(−rK) (σ n−1 < ∞) < · · · < exp(−nrK). k < 0, infinitely often ) 0. Thus, there exists a k such that for all k ≥ k , we have that

Now, note that
as required.

Proof of Theorem 3.5 for the Local Search Algorithm
We now prove Theorem 3.5 under the premise that the local search algorithm, Algorithm 3.2, is applied. First, we consider the situation where the local search algorithm must compare an unstable parameter λ with a stable parameter γ. The following lemma will be used to show that the probability of the Metropolis rule, (8), selecting γ will be a low probability event.

Lemma 4.6. For the events
with λ ∈¯ and γ ¯ there exists positive constants β 1 and β 2 such that Proof. The bound (30) is a consequence of the Azuma-Hoeffding Inequality. In particular, t∧τ * ) is a submartingale with bounded increments and drift δ. Thus we can directly apply the Azuma-Hoeffding Inequality to obtain (30).
The bound (31) is a direct consequence of Lemma 4.1. In particular, taking w f (x) and z (δ/2) f (x), the terms in the exponential in statement (14) of Lemma 4.1 are such that Further, n( f (x)) O(τ(x)). This in turn implies that there are constants β 1 and β 2 such that (31) holds.
We let (Y, Λ) (x, λ) be the initial state of Algorithm 3.2, we let γ ¯ be the parameter selected in Step (i) of Algorithm 3.2, and we let (Y , Λ ) the state of Algorithm 3.2 after its first iteration. Given this notation, the following lemma, which is a consequence of the above result, shows that with high probability Λ λ and that over this step τ(x) is increased by a positive fraction.
Lemma 4.7. There exists positive constants ε, β 3 , and β 4 such that Proof. Let A and B be the events specified in Lemma 4.6, above. Given the event A c , for Λ λ we have that f (Y ) ≥ (1 + 3δ/4) f (x). Since f (x) Θ(τ(x)), for an appropriate choice of ε > 0 (dependent only on δ), we have that τ(Y ) ≥ (1 + ε)τ(x). Now given this choice of ε the following equalities hold, The second inequality follows from definition of the Metropolis rule, (8), and from Lemma 4.6. From this it is clear there are appropriate constants β 3 and β 4 , as required.

Proof of Theorem 3.5 for Local-Search Algorithm.
We see that under Assumption 3.3, the local search algorithm is such that the process Λ k will eventually visit a state in¯ . To see this note that, from any state (Y k , Λ k ) (x, λ) with λ Λ , by irreducibility and positive recurrence of X (λ) and the fact λ ∈ λ , there is a positive probability of reaching state (x 0 , λ). Further, by (9) there is a positive probability of reaching a state (x 0 , µ) for any µ ∈¯ . From that state, again by the irreducibility of X (µ) , there is a positive probability of reaching a state x with τ(x ) > τ for any specified value of τ. Once such a state is reached we now show that there is a positive probability of Λ k remaining inΛ indefinitely. Let E k be the following event Then, by Lemma 4.7, Thus for suitably large initial values of τ 0 we have that Hence, eventually it must occur that the algorithm evolves only according to unstable parameter choices.

Examples
This section presents five example applications of the algorithm, where each example is designed to highlight aspects of the algorithm's implementation and use. More specifically, we subsequently consider a network of parallel queues, a tandem queueing system, the Rybko-Stolyar network, a network of input queued switches, and a random access network (RAN). Before proceeding with the examples we briefly provide some details on algorithm parameter choices used in this section. The η parameter scales the effect of the difference between outcomes sampled from consecutive draws from Λ n samples. Taking η → ∞ is akin to adopting a greedy hill climbing approach, while η → 0 is equivalent to moving at random. Hence the choice of η allows for a trade off between moving towards more unstable parameter choices and becoming trapped in local (stable) optimizers. Throughout this section we have chosen η 1 as a balance between these two extremes. For the local search algorithm, associated with each λ ∈ is a neighborhood λ from which the next parameter candidate will be selected. Choosing these neighborhoods to be large will explore more aggressively, while smaller neighborhoods will increase the effect of local gradient information. Throughout our illustrations we take λ to be a ball centered at λ with radius 0.01 that intersects with . Finally, throughout the section we use (A) as an indicator variable for the algorithm declaring the set A unstable.

Parallel Queues with Randomly Varying Connectivity
For our first example we extend the illustrative example used in Section 3. Consider a system where N parallel queues compete for the service of a single server. Time is slotted, and in each time slot t ∈ + queue i ∈ {1, . . . , N } is connected to the server with probability 0.8. Similarly, at the beginning of each time slot an arrival occurs at each queue with probability p ∈ [0, 1], so that there are at most N arrivals to the system in any particular time slot. After the arrivals have occurred and connectivity is determined, the longest non-zero queue that is connected to the server is reduced by one with probability 4 5 -a policy called longest queue first (LQF). The system is therefore a discrete time Markov chain X (p) taking values in N + . We illustrate this system in Figure 3. The stability region for this irreducible Markov chain is known. In particular, from Corollary 1 in Tassiulas and Ephremides (1993) we have that for any p ≤ * , where * 4 5 the limiting distribution of X (p) exists, and otherwise does not. Therefore any ⊂ [0, ∞) that shares an intersection with [ * , ∞) of positive measure, is unstable under our Definition 2.1. Taking N 4 and to be of the form [0, ), we therefore have instability for approximately those instances when > 0.2, that is * ≈ 0.1997. Furthermore, Theorem 1 in Tassiulas and Ephremides (1993) shows that the system is stable under the LQF policy for the network's subcritical region-a property known as maximal stability. This property is well known to hold for single-hop networks under LQF and its generalization the Max Weight-α algorithm (see e.g., McKeown et al. 1999, Tassiulas andEphremides 1992).
In Figure 4 we give the proportion of simulation runs out of 1,000 where the parameter set [0, 0.3] is declared unstable by the local and global algorithms as k * is increased. Recall that k * is the total number of steps the algorithm is permitted to take in before a value of f (Y k ) is compared to q (α) k . Now, the greatest change in f occurs when there are no services and all queues experience an arrival, so that φ 4. We assume κ 4 and σ 1. It can be seen that longer simulation runs are more likely to declare the system unstable, with an apparent almost sure declaration of instability in the limit. In this case, the local algorithm approaches this limit far more rapidly than the global algorithm. Figure 5 explores the effect of the chosen δ on an unstable declaration for an unstable parameter set. The figure gives the proportion of simulation runs out of 100 where the parameter set [0, 0.21] is declared unstable by the  local and global algorithms. Recall that the definition of stability we use compares the drift of the process under consideration with a linear function that depends on δ. As discussed in Section 3, with reference to Figure 2, if a parameter is unstable for a particular δ, then this implies instability for all higher values of δ. This is because a W process parameterized by a particular δ will stochastically dominate all W processes parameterized by higher choices of δ. Figure 5 demonstrates that this occurs for both the global and local search algorithms.
Again we see that the local algorithm appears to perform better-in this example it has detected lower values of downward drift when k * 10 5 , 10 6 . Figure 6 explores the effect of the chosen δ on an unstable declaration for a stable parameter set. The figure gives the proportion of simulation runs out of 100 where the parameter set [0, 0.19] is declared unstable by the global algorithm. The algorithm rejects the null hypothesis of stability once δ reaches approximately 0.28. This indicates that f (Y 10 5 ) exceeds the (estimated) 95-th percentile of W 10 5 parameterized with δ 0.3, but is bounded by the 95-th percentile of a W 10 5 parameterized by δ 0.25. For [0, 0.19] rejecting the null hypothesis of downward drift greater than 0.28 is not unexpected. Over the range of δ considered the local algorithm did not make a declaration of instability; hence the local algorithm nonetheless appears to perform better. In Figure 7 we give the proportion of simulation runs out of 100 where the parameter set [0, ] is declared unstable for a range of . It can be seen that longer simulation runs declare the system unstable for a larger proportion of the values that give an unstable . The figure provides evidence that in the discrete time case the algorithm is performing as it is intended to, in the next section we move to a continuous time example.

Tandem Queues
Our next example is the tandem queueing system. We will contrast the results for a Markov system consisting of two M/M/1 queues with a system that has renewal arrivals and i.i.d. service times at both nodes (which is not Markov). In the former system jobs arrive to a server according to a Poisson process with rate one, they are then processed one at a time, first come first served (FCFS), with Exp(µ −1 1 ) service times, before being sent to a subsequent server where they are again processed one at a time, FCFS, with service time Exp(µ −1 2 ). It is well known that the output from the first server to the second corresponds to a Poisson process with rate min{1, µ −1 1 }. Consequently, the system is L 1 -stable for (µ 1 , µ 2 ) ∈ [0, 1] 2 , and L 1 -unstable otherwise. In the latter system we assume the times between arrivals to the first server are Erlang distributed with rate parameter 1/2 and shape parameter 2. Jobs are also served FCFS and must pass through the first server before being sent to the second. In this case the service times are Weibull distributed with shape parameter 2, so that they have distribution function (1 − exp(−(x/µ) k ) for x ≥ 0, with k 2 and scale parameters µ µ 1 and µ µ 2 for the first and second server, respectively. Note that in both cases the mean time between arrivals is 1, that the mean service times are µ 1 and µ 2 for the former case, and are Γ(1.5)µ 1 ≈ 0.8862 µ 1 and Γ(1.5)µ 2 ≈ 0.8862 µ 2 in the latter case.
To apply our discrete time framework to these continuous time systems, we have used the embedded process corresponding to the sequence of states recorded immediately after each jump (which is Markovian for the M/M/1 system, and non Markovian for the system with renewal arrivals and i.i.d. service times). In Figures 8  and 9 we are testing parameter sets of the form (µ 1 , µ 2 ) ∈ [0, ] 2 , and as such sets with > 1 are L 1 -unstable in the Markov case and approximately > 1.1284 (0.8862) −1 in the non Markov case. In both the global and local cases it is clear that the test converges to an accurate declaration of instability over ∈ (0.5, 1.5) as k * → ∞.  Figure 9. Non-Markovian tandem system L 1 -stability tests with α 0.05 for (µ 1 , µ 2 ) sampled from sets of the form [0, ] 2 with τ(x) 0.5|x| + 1, δ 0.05, σ 1, κ 1, 0.01 and k * 10 5 (dotted), 10 6 (dashed), 10 7 (solid). The figures provides evidence that it is possible to relax the discrete time and Markov assumptions we made in the theoretical development of our algorithm. Further, we are stretching the original modeling framework since there is no fixed σ after which the systems exhibit unstable behavior. The required number of steps before an upward drift is expected to occur depends on the system state. Consequently, over short time periods, unstable parameter choices may appear stable, e.g., in the Markov system, when the second server has a very large queue but a parameter selection with µ 1 > 1 and µ 2 < 2 − µ 1 is made. Nonetheless, asymptotically both systems are expected to become infinitely large due to the first queue being unstable, and through the τ function our algorithm is able to maintain accurate prediction. Due to this, in systems of this kind the choice of c in the τ function may have an important impact on the algorithm's performance.

Global
In Figure 10 we perform instability tests on [0, 1.2] for a range of c. For the global algorithm the choice of c can have a substantial impact on performance, for k * 10 6 a high value of c is required to obtain a high level of accuracy. For the local algorithm, however, the choice of c does not appear to have as much of an effect as the choice of k * . This suggests that if k * is limited by computational resources, then it is preferable to use the global algorithm with a high c-particularly if the system is suspected of exhibiting oscillatory behavior.

Rybko-Stolyar Queueing Network
The Rybko-Stolyar queueing network, displayed in Figure 11, was introduced in Rybko and Stolyar (1993) as an example of a work-conserving queueing network that can be unstable for sub-critical parameter choices. To the best of our knowledge, matching necessary and sufficient conditions for instability are not known.
This queueing network consists of two stations, each with a single server, which we call the left and right stations. All customers served at the left station require Exp(µ l ) service time and all customers served at the right station require Exp(µ r ) service time. There are two classes of customers. The first class enters the network according to a Poisson process at rate λ where it is served at the left station before proceeding to the right station to be served, and from here it departs the network. Jobs from the second class also enter the network at rate λ, are served at the right station, proceed to be served at the left station, and then depart from the Stochastic Systems, 2017, vol. 7, no. 2, pp. 289-314, © 2017 The Author(s) Figure 11. The Rybko-Stolyar network. r l Figure 12. Rybko-Stolyar system L 1 -stability tests with α 0.05 for µ l sampled from sets of the form [ , + 1] for k * 10 5 (dotted), 10 6 (dashed), 10 7 (solid) with λ 1, µ r 4, τ(x) 0.5|x| + 1, δ 0.05, φ κ σ 1 and 0.01. network. Within each customer class the customers are served on a FCFS basis. Between the customer classes, however, there is priority: jobs being served at their second station (bold in Figure 11) have priority over jobs being served at their first station. In Rybko and Stolyar (1993) it is shown that for λ equal to one and µ r > 0, a sufficient condition for instability is µ l < 2. In Figure 12 we consider the situation where µ l is sampled from sets of the form ( , + 1) for ∈ (1, 3), with λ 1 and µ r 4. Due to the result from Rybko and Stolyar (1993) we expect that ∈ (1, 2) will be returned as unstable by the algorithm. This occurs for k * equal to 10 7 . Interestingly, for > 2 we never reject the null hypothesis of stability, suggesting that µ l < 2 is also a necessary condition for instability with λ 1 and µ r > 0. In this case the local algorithm appears to outperform the global algorithm. The estimates for the local algorithm do, however, exhibit a large amount of variance (over the 100 sample paths used to generate the figure).

A Switch Network
Our next example is a network of input-queued switches which was investigated by Andrews and Zhang (2003). This discrete time model provides an example where the LQF policy is not maximally stable. In this simulation study, we are able to demonstrate the use of our algorithm on a model which exhibits complex queueing dynamics on a 52 dimensional state space. Again, unlike the parallel queue or tandem models considered earlier, the explicit form of the stability region of this model is unknown.
The model we are considering is illustrated in Figure 13. It has four main switches with labels A, B, C, and D and four auxiliary switches with labels A , B , C , and D . Each of the main switches has ten external input queues to which a packet arrival occurs instantaneously at the beginning of each time slot independently and with probability r/30.
Packets are given a type according to the switch at which they first arrive, for example packets starting at A are of type 1; packets are routed through the network according to their type. After these arrivals the longest of the 12 queues at each main switch and of the three queues at each auxiliary switch sends a single packet to the corresponding input queue of another switch or are removed from the system (as designated by Figure 13). Packets sent in a time slot arrive at their destination at the beginning of the next time slot.
In Figure 14 we test for L 1 -instability in r on parameter sets of the form s [0.5, ]. Due to the large size of the system we have chosen δ 5. We set φ 40, τ(x) 0.5 |x| + 1, and κ σ 1. Although the stability region for this model is not yet known, this figure provides strong (statistical) evidence that the set [0, 0.95] is unstable. We have thus demonstrated that our algorithm can be used to provide statistical evidence that the LQF policy is not necessarily maximally stable in multi-hop settings. In this case the global algorithm appears to perform much better, suggesting that k * 10 7 is not great enough for the local algorithm to start performing well. It may be the case the ratio |¯ |/ has a substantial impact on performance in finite time. Figure 15 explores this relationship by testing for stability of [ , 0.95] over a variety of (k * , ) combinations. Intuitively, this ratio should have a greater impact on performance of the local algorithm than the global algorithm. Instead, the figure indicates highly similar (poor) performance over the varying combinations of (k * , ), with some degradation of accuracy for very low k * . While for the global algorithm fixing either k * or and then increasing the other leads to substantial increases in accuracy. Figure 15. Network of input queued switches L 1 -stability tests for r sampled from sets of the form [ , 0.95] for k * ∈ (0, 7 · 10 6 ], τ(x) 0.5|x| + 1, δ 5, φ 40, κ σ 1, and 0.01.

A Broken Diamond Random Access Network
So far we have presented classical examples that facilitated the assessment of the algorithm's performance. In our final example we address a contemporary area of research initiated by Ghaderi et al. (2014), exploring the stability properties of a wireless network with a queue-based random-access algorithm. We focus on a network consisting of nodes {1, 2, . . . , 6}, some of which are connected by edges, as depicted in Figure 16 (where it is remarked that Ghaderi et al. (2014) is set in a more general context). In our model we assume that nodes which are connected by an edge interfere with each other, that is, they cannot transmit simultaneously.
In this continuous time model, packets arrive to node i according to a Poisson process with rate λ i and take Exp(µ i ) time to transmit, so that the traffic intensity at node i is ρ i λ i /µ i . Let U(t) ∈ {0, 1} 6 be a vector of indicator variables representing which nodes are active at time t and X(t) ∈ {0, 1, . . .} 6 be a vector representing the number of packets at each node at time t.
In order to fully describe the evolution of this process, we must specify how nodes decide when to attempt transmission of packets. Whenever a node is not being interfered with it will wait an Exp(ν i ) amount of back-off period. At this point, it will then begin transmitting with probability φ i (X i (t)), where φ i (0) 0, and otherwise it will begin another back-off period with the same distribution. After each successful transmission, node i will release the medium and begin a back-off period with probability ψ i (X i (t − )), with ψ i (1) 1 for all i, and otherwise begin another transmission.
It is easy to see that (X, U) is a Markov process evolving according to the rates given in Table 1. Note that hereū i 0 indicates that none of the neighbors of i is transmitting.
In order to perform our stability test, we assume κ φ σ 1. In Figure 17 we test for L 1 instability with δ 0.05. Looking at Figure 17, which uses our simulation based stability test, we are able to say, with a strong Table 1. Transition rates of the random access network network in Figure 16.

Transition
Rate States (x, u) → (x + e i , u) λ i All (x, u) → (x, u + e i ) ν i φ i (x i ) x i > 0, u i 0,ū i 0 (x, u) → (x − e i , u) µ i (1 − ψ i (x i )) x i ≥ 1, u i 1 (x, u) → (x − e i , u − e i ) µ i ψ i (x i ) x i ≥ 1, u i 1 Figure 17. Broken diamond random access network L 1 -stability tests with α 0.05 for sets of the form [0, ] for k * 10 5 (dotted), 10 6 (dashed), 10 7 (solid), τ(x) 0.5|x| + 1, δ 0.05, φ κ σ 1, and 0.01. statistically firm footing, that there exists a constant ρ * (κ,α) < 1, such that for all ρ ∈ (ρ * (κ,α), ] the network is unstable for a range of in approximately [0.6, 1]. This statement expands on the statement of the theorem (for a particular choice of parameters) by allowing for more information to be gained about what values are likely to be possible for ρ * (κ,α). Of course, our statement does not rule out perverse behavior such as the network suddenly exploding after 10 7 jumps of the process. It can however be very quickly and easily applied to similar or even vastly more complex networks. We note that the global algorithm is in this case enabling us to make this strong statement, while the local algorithm algorithm only allows us to make the statement for a substantially reduced set of .

Concluding Remarks
The main contribution of this paper concerned the development of an automated procedure that determines if, for a specified set of parameter values, a given Markov chain is unstable. A distinctive feature of our work is that our method is simulation based, and in addition broadly applicable and straightforward to implement. It provides statistical statements on the stability of the parameter set, but, notably, we have succeeded in providing explicit performance guarantees. Some of our experiments show that our technique provides us with useful insights for models for which the stability set has not been characterized so far.
Our paper can be considered as a pioneering study on this topic, and various extensions and improvements are envisaged. An important first branch of research could relate to relaxing the assumptions imposed, such as the fact that we restrict ourselves to the class of Markov processes and the bounded step size assumption. Experiments that we performed for non Markovian tandem queues indicated that the approach still provides us with the correct result, if we perform our algorithm as if the underlying system is Markovian. In order to remove the bounded step size assumption it would be necessary to use a concentration inequality that is stronger than Azuma-Hoeffding. Additionally, our experiments contrasted global and local search versions of the algorithm. We obtained mixed results on performance, and were unable to declare either version superior to the other. Determining conditions that point towards which of these versions should be used in different circumstances remains to be a challenge.
The objective of a second branch of research could be to enhance our procedure such that it can identify, in case instability is detected, which components of the multi-dimensional Markov chain are unstable. A third branch is of an empirical nature, and relates to models of which the stability region is not yet known. By performing systematic simulation studies one could possibly state conjectures.
Since the increments of f (X (λ) ) are bounded by φ f we have that