Queueing, Predictions, and Large Language Models: Challenges and Open Problems
Published Online:22 Jul 2025https://doi.org/10.1287/stsy.2025.0106
References
- (2024) InferCept: Efficient intercept support for augmented large language model inference. Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, Berkenkamp F, eds. Proc. 41st Internat. Conf. Machine Learn. (ICML 2024), vol. 238 (PMLR, New York), 8056–8082.Google Scholar
- (2023) GPT-4 technical report. Preprint, submitted March 15, https://arxiv.org/abs/2303.08774.Google Scholar
- (2024) Taming throughput-latency tradeoff in LLM inference with Sarathi-serve. Preprint, submitted March 4, https://arxiv.org/abs/2403.02310.Google Scholar
- (2023) SEH: Size estimate hedging scheduling of queues. ACM Trans. Model. Comput. Simulation 33(4):14.Google Scholar
- Algorithms with Predictions Project (2024) Algorithms with predictions: Paper list. Accessed July 2, 2025, https://algorithms-with-predictions.github.io.Google Scholar
- Anthropic (2024) Prompt caching with Claude. Accessed July 2, 2025, https://www.anthropic.com/news/prompt-caching.Google Scholar
- (2021) Flow time scheduling with uncertain processing time. Khuller S, Vassilievska Williams V, eds. Proc. 53rd Annual ACM SIGACT Sympos. Theory Comput. (STOC) (ACM, New York), 1070–1080.Google Scholar
- (2022) Distortion-oblivious algorithms for minimizing flow time. Naor J (Seffi), Buchbinder N, eds. Proc. 2022 ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 252–274.Google Scholar
- (1999) Modern Information Retrieval, vol. 463 (ACM Press, New York).Google Scholar
- (2022) Probing classifiers: Promises, shortcomings, and advances. Comput. Linguistics 48(1):207–219.Google Scholar
- (2024) Graph of thoughts: Solving elaborate problems with large language models. Wooldridge M, Dy J, Natarajan S, eds. Proc. Thirty Eighth AAAI Conf. Artificial Intelligence (AAAI-24), vol. 38(16) (AAAI Press, Palo Alto, CA), 17682–17690.Google Scholar
- (2025) Competitive strategies to use “warm start” algorithms with predictions. Azar Y, Panigrahi D, eds. Proc. 2025 Annual ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 3775–3801.Google Scholar
- (2017) Online algorithms with advice: A survey. ACM Comput. Surveys 50(2):93–129.Google Scholar
- (2024) On dynamic graph algorithms with predictions. Woodruff DP, ed. Proc. 2024 Annual ACM-SIAM Sympos. Discrete Algorithms (SODA 2024) (SIAM, Philadelphia), 3534–3557.Google Scholar
- (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 1877–1901.Google Scholar
- (1956) The output of a queuing system. Oper. Res. 4(6):699–704.Link, Google Scholar
- (2024) Tail optimality of the nudge-M scheduling algorithm. Sigmetrics Performance Evaluation Rev. 52(2):21–23.Google Scholar
- (2021) Scheduling with service-time information: The power of two priority classes. Preprint, submitted February 16, https://arxiv.org/abs/2105.10499.Google Scholar
- (2023b) ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs. Preprint, submitted September 22, https://arxiv.org/abs/2309.13007.Google Scholar
- (2023a) Accelerating large language model decoding with speculative sampling. Preprint, submitted February 3, https://arxiv.org/abs/2302.01318.Google Scholar
- (2019) Agentgraph: Toward universal dialogue management with structured deep reinforcement learning. IEEE/ACM Trans. Audio Speech Language Processing 27(9):1378–1391.Google Scholar
- (2021) Evaluating large language models trained on code. Preprint, submitted July 7, https://arxiv.org/abs/2107.03374.Google Scholar
- (2024) Enabling efficient batch serving for LMaaA via generation length prediction. Preprint, submitted June 7, https://arxiv.org/abs/2406.04785.Google Scholar
- (2023) Flashattention-2: Faster attention with better parallelism and work partitioning. Preprint, submitted July 17, https://arxiv.org/abs/2307.08691.Google Scholar
- (2023) Flash-Decoding for long-context inference. Accessed July 2, 2025, https://pytorch.org/blog/flash-decoding/.Google Scholar
- (2022) Flashattention: Fast and memory-efficient exact attention with IO-awareness. Adv. Neural Inform. Processing Systems 35:16344–16359. Google Scholar
- (2019) Scheduling with inexact job sizes: The merits of shortest processing time first. Preprint, submitted July 10, https://arxiv.org/abs/1907.04824.Google Scholar
- (2015) PSBS: Practical size-based scheduling. IEEE Trans. Comput. 65(7):2199–2212.Google Scholar
- (2014) Revisiting size-based scheduling with estimated job sizes. Proc. 2014 IEEE 22nd Internat. Sympos. Model. Anal. Simulation Comput. Telecomm. Systems (IEEE Computer Society, Washington, DC), 411–420.Google Scholar
- (2024) Binary search with distributional predictions. Preprint, submitted November 25, https://arxiv.org/abs/2411.16030.Google Scholar
- (2024) Shortest-job-first scheduling in many-server queues with impatient customers and noisy service-time estimates. Oper. Res., ePub ahead of print December 14, https://doi.org/10.1287/opre.2022.310.Google Scholar
- (2022) A survey on in-context learning. Preprint, submitted December 31, https://arxiv.org/abs/2301.00234.Google Scholar
- (2023) LLMcarbon: Modeling the end-to-end carbon footprint of large language models. Preprint, submitted September 25, https://arxiv.org/abs/2309.14393.Google Scholar
- (2023) Alphazero-like tree-search can guide large language model decoding and training. Preprint, submitted September 29, https://arxiv.org/abs/2309.17179.Google Scholar
- (2024a) Efficient LLM scheduling by learning to rank. Preprint, submitted August 28, https://arxiv.org/abs/2408.15792.Google Scholar
- Fu Y, Chen J, Zhu S, Fu Z, Dai Z, Qiao A, Zhang H (2024b) Efficiently serving LLM reasoning programs with certaindex. Preprint, submitted December 30, https://arxiv.org/html/2412.20993v1.Google Scholar
- (2024) Empowering biomedical discovery with AI agents. Cell 187(22):6125–6151.Google Scholar
- (2023) Retrieval-augmented generation for large language models: A survey. Preprint, submitted December 18, https://arxiv.org/abs/2312.10997.Google Scholar
- (2024) Optimal scheduling in multiserver queues. ACM SIGMETRICS Performance Evaluation Rev. 51(3):29–32.Google Scholar
- (2022) Incentive compatible queues without money. Preprint, submitted February 11, https://arxiv.org/abs/2202.05747.Google Scholar
- (2024) Large language model based multi-agents: A survey of progress and challenges. Preprint, submitted January 21, https://arxiv.org/abs/2402.01680.Google Scholar
- (2025) Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. Preprint, submitted January 22, https://arxiv.org/abs/2501.12948.Google Scholar
- (2023) Reasoning with language model is planning with world model. Preprint, submitted May 24, https://arxiv.org/abs/2305.14992.Google Scholar
- (2013) Performance Modeling and Design of Computer Systems: Queueing Theory in Action (Cambridge University Press, Cambridge, UK).Google Scholar
- (2024) A Gittins policy for optimizing tail latency. ACM SIGMETRICS Performance Evaluation Rev. 52(2):15–17.Google Scholar
- (2019) Designing and interpreting probes with control tasks. Preprint, submitted September 8, https://arxiv.org/abs/1909.03368.Google Scholar
- (2019) A structural probe for finding syntax in word representations. Proc. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (Association for Computational Linguistics, Stroudsburg, PA).Google Scholar
- (2025) Serving models, fast and slow: Optimizing heterogeneous LLM inferencing workloads at scale. Preprint, submitted February 20, https://arxiv.org/abs/2502.14617.Google Scholar
- (2023) S3: Increasing GPU utilization during generative inference for higher throughput. Advances in Neural Information Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 18015–18027.Google Scholar
- (2023) The promise and peril of generative AI. Nature 614(7947):214–216. Google Scholar
- (2024) Hydragen: High-throughput LLM inference with shared prefixes. Preprint, submitted February 7, https://arxiv.org/abs/2402.05099.Google Scholar
- (2024) DSPY: Compiling declarative language model calls into state-of-the-art pipelines. Proc. Twelfth Internat. Conf. Learn. Representations (ICLR 2024) (ICLR, Appleton, WI).Google Scholar
- (2023) Efficient memory management for large language model serving with pagedattention. Proc. 29th Sympos. Operating Systems Principles (ACM, New York), 611–626.Google Scholar
- LangChain Team (2024) LangChain. Accessed June 16, 2025, https://github.com/langchain-ai/langchain.Google Scholar
- (2023) Fast inference from transformers via speculative decoding. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. Proc. 40th Internat. Conf. Machine Learn. (ICML 2023), vol. 202 (PMLR, New York), 19274–19286.Google Scholar
- (2024) LooGLE: Can long-context language models understand long contexts? Preprint, submitted November 8, https://arxiv.org/abs/2311.04939.Google Scholar
- (2022) Competition-level code generation with AlphaCode. Preprint, submitted February 8, https://arxiv.org/abs/2203.07814.Google Scholar
- (2024) Parrot: Efficient serving of LLM-based applications with semantic variable. Proc. 18th USENIX Sympos. Operating Systems Design Implementation (OSDI ’24) (USENIX Association, Santa Clara, CA), 929–945.Google Scholar
- (2024) DeepSeek-V3 technical report. Preprint, submitted December 27, https://arxiv.org/abs/2412.19437.Google Scholar
- (2021) Competitive caching with machine learned advice. J. ACM 68(4):24.Google Scholar
- (2017) Scheduling jobs with estimation errors for multi-server systems. Proc. 29th Internat. Teletraffic Congress (ITC 29), vol. 1 (IEEE, Piscataway, NJ), 10–18.Google Scholar
- (2023) Augmented language models: A survey. Preprint, submitted February 15, https://arxiv.org/abs/2302.07842.Google Scholar
- (2001) The power of two choices in randomized load balancing. IEEE Trans. Parallel Distributed Systems 12(10):1094–1104.Google Scholar
- (2020) Scheduling with predictions and the price of misprediction. Vidick T, ed. Proc. 11th Innovations Theoret. Comput. Sci. Conf. (ITCS 2020), LIPIcs, vol. 151 (Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Wadern, Germany), 14:1–14:18.Google Scholar
- (2021) Queues with small advice. Bender MA, Gilbert J, Hendrickson B, Sullivan BD, eds. Proc. 2021 SIAM Conf. Appl. Comput. Discrete Algorithms (ACDA 2021) (SIAM, Philadelphia), 1–12.Google Scholar
- (2022) The supermarket model with known and predicted service times. IEEE Trans. Parallel Distributed Systems 33(11):2740–2751.Google Scholar
- (2020) Algorithms with predictions. Roughgarden T, ed. Beyond the Worst-Case Analysis of Algorithms (Cambridge University Press, Cambridge, UK), 646–662.Google Scholar
- (2022) Algorithms with predictions. Comm. ACM 65(7):33–35.Google Scholar
- (2025) Robust Gittins for stochastic scheduling. Preprint, submitted April 14, https://arxiv.org/abs/2504.10743.Google Scholar
- NVIDIA (2024) FasterTransformer. Accessed July 2, 2025, https://github.com/NVIDIA/FasterTransformer.Google Scholar
- (2024) RouteLLM: Learning to route LLMs with preference data. Preprint, submitted June 26, https://arxiv.org/abs/2406.18665.Google Scholar
- OpenAI (2024) Learning to reason with LLMs. Accessed July 2, 2025, https://openai.com/index/learning-to-reason-with-llms/.Google Scholar
- (2024) Splitwise: Efficient generative LLM inference using phase splitting. Proc. 51st Annual ACM/IEEE 51st Internat. Sympos. Comput. Architecture (ISCA 2024) (IEEE, Piscataway, NJ), 3775–3801.Google Scholar
- (2023) Gorilla: Large language model connected with massive APIs. Preprint, submitted May 24, https://arxiv.org/abs/2305.15334.Google Scholar
- (2023) Efficiently scaling transformer inference. Proc. Machine Learn. Systems 5:606–624.Google Scholar
- (2023) ToolLLM: Facilitating large language models to master 16000+ real-world APIs. Preprint, submitted July 31, https://arxiv.org/abs/2307.16789.Google Scholar
- (2024a) Power-aware deep learning model serving with μ-serve. Proc. 2024 USENIX Annual Tech. Conf. (USENIX ATC ’24) (USENIX Association, Santa Clara, CA), 75–93.Google Scholar
- (2024b) Efficient interactive LLM serving with proxy model-based sequence length prediction. Preprint, submitted April 12, https://arxiv.org/abs/2404.08509.Google Scholar
- Qwen Team (2024) QwQ: Reflect deeply on the boundaries of the unknown. Accessed July 2, 2025, https://qwenlm.github.io/blog/qwq-32b-preview/.Google Scholar
- (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):1–24.Google Scholar
- (2021) Zero-shot text-to-image generation. Meilă M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn. (ICML 2021), vol. 139 (PMLR, New York), 8821–8831.Google Scholar
- (2024) Cinepile: A long video question answering dataset and benchmark. Preprint, submitted May 14, https://arxiv.org/abs/2405.08813.Google Scholar
- (1957) Waiting times when queues are in tandem. Ann. Math. Statist. 28(3):768–773.Google Scholar
- (2023a) Evaluating dispatching and scheduling strategies for firm real-time jobs in edge computing. Proc. 49th Annual Conf. IEEE Indust. Electronics Soc. (IECON 2023) (IEEE, Piscataway, NJ), 1–6.Google Scholar
- (2023b) Scheduling firm real-time applications on the edge with single-bit execution time prediction. Proc. 2023 IEEE 25th Internat. Sympos. Real-Time Distributed Comput. (ISORC 2023) (IEEE, Piscataway, NJ), 207–213.Google Scholar
- , Debut L, Chaumond J, Wolf T (2019) DistilBERT: A distilled version of BERT: Smaller, faster, cheaper and lighter. Preprint, submitted October 2, https://arxiv.org/abs/1910.01108.Google Scholar
- (2023) Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 68539–68551.Google Scholar
- (2018) SOAP bubbles: Robust scheduling under adversarial noise. Proc. 56th Annual Allerton Conf. Comm. Control Comput. (Allerton 2018) (IEEE, Piscataway, NJ), 144–154.Google Scholar
- (2021) The Gittins policy in the M/G/1 queue. Proc. 19th Internat. Sympos. Model. Optim. Mobile Ad Hoc Wireless Networks (WiOpt 2021) (IEEE, Piscataway, NJ), 248–255.Google Scholar
- (2020a) The Gittins policy is nearly optimal in the M/G/k under extremely general conditions. Proc. ACM Measurement Anal. Comput. Systems 4(3):43.Google Scholar
- (2022) Uniform bounds for scheduling with job size estimates. Braverman M, ed. 13th Innovations Theoret. Comput. Sci. Conf. ITCS, LIPIcs, vol. 215 (Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Wadern, Germany), 114:1–114:30.Google Scholar
- (2018) SOAP: One clean analysis of all age-based scheduling policies. Proc. ACM Measurement Anal. Comput. Systems 2(1):16.Google Scholar
- (2020b) Characterizing policies with optimal response time tails under heavy-tailed job sizes. Proc. ACM Measurement Anal. Comput. Systems 4(2):30.Google Scholar
- (2024) SkipPredict: When to invest in predictions for scheduling. Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, Zhang C, eds. Advances in Neural Information Processing Systems, vol. 37 (Curran Associates, Inc., Red Hook, NY).Google Scholar
- (2025a) Don’t stop me now: Embedding based scheduling for LLMs. Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
- (2025b) Fast inference for augmented large language models. Preprint, submitted October 25, https://arxiv.org/abs/2410.18248.Google Scholar
- (2024) Preble: Efficient distributed prompt scheduling for LLM serving. Preprint, submitted May 8, https://arxiv.org/abs/2407.00023.Google Scholar
- (2024) DynamoLLM: Designing LLM inference clusters for performance and energy efficiency. Preprint, submitted August 1, https://arxiv.org/abs/2408.00741.Google Scholar
- (2023) Stanford alpaca: An instruction-following LLaMA model. Accessed July 2, 2025, https://github.com/tatsu-lab/stanford_alpaca.Google Scholar
- (2023) LLaMA: Open and efficient foundation language models. Preprint, submitted February 27, https://arxiv.org/abs/2302.13971.Google Scholar
- (1997) Dobrushin’s mean-field approximation for a queue with dynamic routing. Markov Processes Related Fields 3(4):493–526.Google Scholar
- (2024b) Augmenting black-box LLMs with medical textbooks for biomedical question answering. Findings 2024 Conf. Empirical Methods Natl. Language Processing (EMNLP 2024) (Association for Computational Linguistics, Stroudsburg, PA), 1754–1770.Google Scholar
- (2022) Self-consistency improves chain of thought reasoning in language models. Preprint, submitted March 21, https://arxiv.org/abs/2203.11171.Google Scholar
- (2024a) A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18(6):186345.Google Scholar
- (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inform. Processing Systems 35:24824–24837.Google Scholar
- (2008) Scheduling despite inexact job-size information. Proc. 2008 ACM SIGMETRICS Internat. Conf. Measurement Model. Comput. Systems (SIGMETRICS ’08) (ACM, New York), 25–36.Google Scholar
- (2024) ChatGPT gets its “Wolfram superpowers!” Accessed July 2, 2025, https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/.Google Scholar
- (2024) Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models. Preprint, submitted August 1, https://arxiv.org/abs/2408.00724.Google Scholar
- (2023) Fast distributed inference serving for large language models. Preprint, submitted May 10, https://arxiv.org/abs/2305.05920.Google Scholar
- (2024) Strongly tail-optimal scheduling in the light-tailed M/G/1. Proc. ACM Measurement Anal. Comput. Systems 8(2):27.Google Scholar
- (2022) Orca: A distributed serving system for transformer-based generative models. Proc. 16th USENIX Sympos. Operating Systems Design Implementation (OSDI ’22) (USENIX Association, Berkeley, CA), 521–538.Google Scholar
- (2024) The shift from models to compound AI systems. Accessed July 2, 2025, https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems.Google Scholar
- (2024) Retrieval-augmented generation for AI-generated content: A survey. Preprint, submitted February 29, https://arxiv.org/abs/2402.19473.Google Scholar
- (2024) Response length perception and sequence scheduling: An LLM-empowered LLM inference pipeline. Advances in Neural Information Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 65517–65530.Google Scholar
- (2023) Efficiently programming large language models using SGLang. Preprint, submitted December 12, https://arxiv.org/abs/2312.07104.Google Scholar
- (2025) SGLang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems, vol. 38 (Curran Associates Inc., Red Hook, NY), 62557–62583.Google Scholar
- (2024) DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. Preprint, submitted January 18, https://arxiv.org/abs/2401.09670.Google Scholar

