Open Access

Queueing, Predictions, and Large Language Models: Challenges and Open Problems

Michael Mitzenmacher
Michael Mitzenmacher
[email protected]
Computer Science Department, Harvard University, Cambridge, Massachusetts 02138
Search for more papers by this author
,
Rana Shahout
Corresponding Author
Rana Shahout
[email protected]
https://orcid.org/0000-0002-9254-8529
Computer Science Department, Harvard University, Cambridge, Massachusetts 02138
Search for more papers by this author

Michael Mitzenmacher

[email protected]

Computer Science Department, Harvard University, Cambridge, Massachusetts 02138

Search for more papers by this author

Rana Shahout

Corresponding Author

Rana Shahout

[email protected]

https://orcid.org/0000-0002-9254-8529

Computer Science Department, Harvard University, Cambridge, Massachusetts 02138

Search for more papers by this author

Published Online:22 Jul 2025https://doi.org/10.1287/stsy.2025.0106

References

Abhyankar R, He Z, Srivatsa V, Zhang H, Zhang Y (2024) InferCept: Efficient intercept support for augmented large language model inference. Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, Berkenkamp F, eds. Proc. 41st Internat. Conf. Machine Learn. (ICML 2024), vol. 238 (PMLR, New York), 8056–8082.Google Scholar
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, et al. (2023) GPT-4 technical report. Preprint, submitted March 15, https://arxiv.org/abs/2303.08774.Google Scholar
Agrawal A, Kedia N, Panwar A, Mohan J, Kwatra N, Gulavani BS, Tumanov A, Ramjee R (2024) Taming throughput-latency tradeoff in LLM inference with Sarathi-serve. Preprint, submitted March 4, https://arxiv.org/abs/2403.02310.Google Scholar
Akbari-Moghaddam M, Down DG (2023) SEH: Size estimate hedging scheduling of queues. ACM Trans. Model. Comput. Simulation 33(4):14.Google Scholar
Algorithms with Predictions Project (2024) Algorithms with predictions: Paper list. Accessed July 2, 2025, https://algorithms-with-predictions.github.io.Google Scholar
Anthropic (2024) Prompt caching with Claude. Accessed July 2, 2025, https://www.anthropic.com/news/prompt-caching.Google Scholar
Azar Y, Leonardi S, Touitou N (2021) Flow time scheduling with uncertain processing time. Khuller S, Vassilievska Williams V, eds. Proc. 53rd Annual ACM SIGACT Sympos. Theory Comput. (STOC) (ACM, New York), 1070–1080.Google Scholar
Azar Y, Leonardi S, Touitou N (2022) Distortion-oblivious algorithms for minimizing flow time. Naor J (Seffi), Buchbinder N, eds. Proc. 2022 ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 252–274.Google Scholar
Baeza-Yates R, Ribeiro-Neto B (1999) Modern Information Retrieval, vol. 463 (ACM Press, New York).Google Scholar
Belinkov Y (2022) Probing classifiers: Promises, shortcomings, and advances. Comput. Linguistics 48(1):207–219.Google Scholar
Besta M, Blach N, Kubicek A, Gerstenberger R, Podstawski M, Gianinazzi L, Gajda J, et al. (2024) Graph of thoughts: Solving elaborate problems with large language models. Wooldridge M, Dy J, Natarajan S, eds. Proc. Thirty Eighth AAAI Conf. Artificial Intelligence (AAAI-24), vol. 38(16) (AAAI Press, Palo Alto, CA), 17682–17690.Google Scholar
Blum A, Srinivas V (2025) Competitive strategies to use “warm start” algorithms with predictions. Azar Y, Panigrahi D, eds. Proc. 2025 Annual ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 3775–3801.Google Scholar
Boyar J, Favrholdt LM, Kudahl C, Larsen KS, Mikkelsen JW (2017) Online algorithms with advice: A survey. ACM Comput. Surveys 50(2):93–129.Google Scholar
Brand J, Forster S, Nazari Y, Polak A (2024) On dynamic graph algorithms with predictions. Woodruff DP, ed. Proc. 2024 Annual ACM-SIAM Sympos. Discrete Algorithms (SODA 2024) (SIAM, Philadelphia), 3534–3557.Google Scholar
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 1877–1901.Google Scholar
Burke PJ (1956) The output of a queuing system. Oper. Res. 4(6):699–704.Link, Google Scholar
Charlet N, Van Houdt B (2024) Tail optimality of the nudge-M scheduling algorithm. Sigmetrics Performance Evaluation Rev. 52(2):21–23.Google Scholar
Chen Y, Dong J (2021) Scheduling with service-time information: The power of two priority classes. Preprint, submitted February 16, https://arxiv.org/abs/2105.10499.Google Scholar
Chen JCY, Saha S, Bansal M (2023b) ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs. Preprint, submitted September 22, https://arxiv.org/abs/2309.13007.Google Scholar
Chen C, Borgeaud S, Irving G, Lespiau JB, Sifre L, Jumper J (2023a) Accelerating large language model decoding with speculative sampling. Preprint, submitted February 3, https://arxiv.org/abs/2302.01318.Google Scholar
Chen L, Chen Z, Tan B, Long S, Gašić M, Yu K (2019) Agentgraph: Toward universal dialogue management with structured deep reinforcement learning. IEEE/ACM Trans. Audio Speech Language Processing 27(9):1378–1391.Google Scholar
Chen M, Tworek J, Jun H, Yuan Q, Pinto HPDO, Kaplan J, Edwards H, et al. (2021) Evaluating large language models trained on code. Preprint, submitted July 7, https://arxiv.org/abs/2107.03374.Google Scholar
Cheng K, Hu W, Wang Z, Du P, Li J, Zhang S (2024) Enabling efficient batch serving for LMaaA via generation length prediction. Preprint, submitted June 7, https://arxiv.org/abs/2406.04785.Google Scholar
Dao T (2023) Flashattention-2: Faster attention with better parallelism and work partitioning. Preprint, submitted July 17, https://arxiv.org/abs/2307.08691.Google Scholar
Dao T, Haziza D, Massa F, Sizov G (2023) Flash-Decoding for long-context inference. Accessed July 2, 2025, https://pytorch.org/blog/flash-decoding/.Google Scholar
Dao T, Fu D, Ermon S, Rudra A, Ré C (2022) Flashattention: Fast and memory-efficient exact attention with IO-awareness. Adv. Neural Inform. Processing Systems 35:16344–16359. Google Scholar
Dell’Amico M (2019) Scheduling with inexact job sizes: The merits of shortest processing time first. Preprint, submitted July 10, https://arxiv.org/abs/1907.04824.Google Scholar
Dell’Amico M, Carra D, Michiardi P (2015) PSBS: Practical size-based scheduling. IEEE Trans. Comput. 65(7):2199–2212.Google Scholar
Dell’Amico M, Carra D, Pastorelli M, Michiardi P (2014) Revisiting size-based scheduling with estimated job sizes. Proc. 2014 IEEE 22nd Internat. Sympos. Model. Anal. Simulation Comput. Telecomm. Systems (IEEE Computer Society, Washington, DC), 411–420.Google Scholar
Dinitz M, Im S, Lavastida T, Moseley B, Niaparast A, Vassilvitskii S (2024) Binary search with distributional predictions. Preprint, submitted November 25, https://arxiv.org/abs/2411.16030.Google Scholar
Dong J, Ibrahim R (2024) Shortest-job-first scheduling in many-server queues with impatient customers and noisy service-time estimates. Oper. Res., ePub ahead of print December 14, https://doi.org/10.1287/opre.2022.310.Google Scholar
Dong Q, Li L, Dai D, Zheng C, Ma J, Li R, Xia H, et al. (2022) A survey on in-context learning. Preprint, submitted December 31, https://arxiv.org/abs/2301.00234.Google Scholar
Faiz A, Kaneda S, Wang R, Osi R, Sharma P, Chen F, Jiang L (2023) LLMcarbon: Modeling the end-to-end carbon footprint of large language models. Preprint, submitted September 25, https://arxiv.org/abs/2309.14393.Google Scholar
Feng X, Wan Z, Wen M, McAleer SM, Wen Y, Zhang W, Wang J (2023) Alphazero-like tree-search can guide large language model decoding and training. Preprint, submitted September 29, https://arxiv.org/abs/2309.17179.Google Scholar
Fu Y, Chen J, Zhu S, Fu Z, Dai Z, Qiao A, Zhang H (2024a) Efficient LLM scheduling by learning to rank. Preprint, submitted August 28, https://arxiv.org/abs/2408.15792.Google Scholar
Fu Y, Chen J, Zhu S, Fu Z, Dai Z, Qiao A, Zhang H (2024b) Efficiently serving LLM reasoning programs with certaindex. Preprint, submitted December 30, https://arxiv.org/html/2412.20993v1.Google Scholar
Gao S, Fang A, Huang Y, Giunchiglia V, Noori A, Schwarz JR, Ektefaie Y, Kondic J, Zitnik M (2024) Empowering biomedical discovery with AI agents. Cell 187(22):6125–6151.Google Scholar
Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, Wang M, Wang H (2023) Retrieval-augmented generation for large language models: A survey. Preprint, submitted December 18, https://arxiv.org/abs/2312.10997.Google Scholar
Grosof I (2024) Optimal scheduling in multiserver queues. ACM SIGMETRICS Performance Evaluation Rev. 51(3):29–32.Google Scholar
Grosof I, Mitzenmacher M (2022) Incentive compatible queues without money. Preprint, submitted February 11, https://arxiv.org/abs/2202.05747.Google Scholar
Guo T, Chen X, Wang Y, Chang R, Pei S, Chawla NV, Wiest O, Zhang X (2024) Large language model based multi-agents: A survey of progress and challenges. Preprint, submitted January 21, https://arxiv.org/abs/2402.01680.Google Scholar
Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, Zhu Q, et al. (2025) Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. Preprint, submitted January 22, https://arxiv.org/abs/2501.12948.Google Scholar
Hao S, Gu Y, Ma H, Hong JJ, Wang Z, Wang DZ, Hu Z (2023) Reasoning with language model is planning with world model. Preprint, submitted May 24, https://arxiv.org/abs/2305.14992.Google Scholar
Harchol-Balter M (2013) Performance Modeling and Design of Computer Systems: Queueing Theory in Action (Cambridge University Press, Cambridge, UK).Google Scholar
Harlev A, Yu G, Scully Z (2024) A Gittins policy for optimizing tail latency. ACM SIGMETRICS Performance Evaluation Rev. 52(2):15–17.Google Scholar
Hewitt J, Liang P (2019) Designing and interpreting probes with control tasks. Preprint, submitted September 8, https://arxiv.org/abs/1909.03368.Google Scholar
Hewitt J, Manning CD (2019) A structural probe for finding syntax in word representations. Proc. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (Association for Computational Linguistics, Stroudsburg, PA).Google Scholar
Jaiswal S, Jain K, Simmhan Y, Parayil A, Mallick A, Wang R, Amant RS, et al. (2025) Serving models, fast and slow: Optimizing heterogeneous LLM inferencing workloads at scale. Preprint, submitted February 20, https://arxiv.org/abs/2502.14617.Google Scholar
Jin Y, Wu CF, Brooks D, Wei GY (2023) S3: Increasing GPU utilization during generative inference for higher throughput. Advances in Neural Information Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 18015–18027.Google Scholar
Jo A (2023) The promise and peril of generative AI. Nature 614(7947):214–216. Google Scholar
Juravsky J, Brown B, Ehrlich R, Fu DY, Ré C, Mirhoseini A (2024) Hydragen: High-throughput LLM inference with shared prefixes. Preprint, submitted February 7, https://arxiv.org/abs/2402.05099.Google Scholar
Khattab O, Singhvi A, Maheshwari P, Zhang Z, Santhanam K, Haq S, Sharma A, et al. (2024) DSPY: Compiling declarative language model calls into state-of-the-art pipelines. Proc. Twelfth Internat. Conf. Learn. Representations (ICLR 2024) (ICLR, Appleton, WI).Google Scholar
Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, Yu CH, Gonzalez J, Zhang H, Stoica I (2023) Efficient memory management for large language model serving with pagedattention. Proc. 29th Sympos. Operating Systems Principles (ACM, New York), 611–626.Google Scholar
LangChain Team (2024) LangChain. Accessed June 16, 2025, https://github.com/langchain-ai/langchain.Google Scholar
Leviathan Y, Kalman M, Matias Y (2023) Fast inference from transformers via speculative decoding. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. Proc. 40th Internat. Conf. Machine Learn. (ICML 2023), vol. 202 (PMLR, New York), 19274–19286.Google Scholar
Li J, Wang M, Zheng Z, Zhang M (2024) LooGLE: Can long-context language models understand long contexts? Preprint, submitted November 8, https://arxiv.org/abs/2311.04939.Google Scholar
Li Y, Choi D, Chung J, Kushman N, Schrittwieser J, Leblond R, Eccles T, et al. (2022) Competition-level code generation with AlphaCode. Preprint, submitted February 8, https://arxiv.org/abs/2203.07814.Google Scholar
Lin C, Han Z, Zhang C, Yang Y, Yang F, Chen C, Qiu L (2024) Parrot: Efficient serving of LLM-based applications with semantic variable. Proc. 18th USENIX Sympos. Operating Systems Design Implementation (OSDI ’24) (USENIX Association, Santa Clara, CA), 929–945.Google Scholar
Liu A, Feng B, Xue B, Wang B, Wu B, Lu C, Zhao C, et al. (2024) DeepSeek-V3 technical report. Preprint, submitted December 27, https://arxiv.org/abs/2412.19437.Google Scholar
Lykouris T, Vassilvitskii S (2021) Competitive caching with machine learned advice. J. ACM 68(4):24.Google Scholar
Mailach R, Down DG (2017) Scheduling jobs with estimation errors for multi-server systems. Proc. 29th Internat. Teletraffic Congress (ITC 29), vol. 1 (IEEE, Piscataway, NJ), 10–18.Google Scholar
Mialon G, Dessì R, Lomeli M, Nalmpantis C, Pasunuru R, Raileanu R, Rozière B, et al. (2023) Augmented language models: A survey. Preprint, submitted February 15, https://arxiv.org/abs/2302.07842.Google Scholar
Mitzenmacher M (2001) The power of two choices in randomized load balancing. IEEE Trans. Parallel Distributed Systems 12(10):1094–1104.Google Scholar
Mitzenmacher M (2020) Scheduling with predictions and the price of misprediction. Vidick T, ed. Proc. 11th Innovations Theoret. Comput. Sci. Conf. (ITCS 2020), LIPIcs, vol. 151 (Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Wadern, Germany), 14:1–14:18.Google Scholar
Mitzenmacher M (2021) Queues with small advice. Bender MA, Gilbert J, Hendrickson B, Sullivan BD, eds. Proc. 2021 SIAM Conf. Appl. Comput. Discrete Algorithms (ACDA 2021) (SIAM, Philadelphia), 1–12.Google Scholar
Mitzenmacher M, Dell’Amico M (2022) The supermarket model with known and predicted service times. IEEE Trans. Parallel Distributed Systems 33(11):2740–2751.Google Scholar
Mitzenmacher M, Vassilvitskii S (2020) Algorithms with predictions. Roughgarden T, ed. Beyond the Worst-Case Analysis of Algorithms (Cambridge University Press, Cambridge, UK), 646–662.Google Scholar
Mitzenmacher M, Vassilvitskii S (2022) Algorithms with predictions. Comm. ACM 65(7):33–35.Google Scholar
Moseley B, Newman H, Pruhs K, Zhou R (2025) Robust Gittins for stochastic scheduling. Preprint, submitted April 14, https://arxiv.org/abs/2504.10743.Google Scholar
NVIDIA (2024) FasterTransformer. Accessed July 2, 2025, https://github.com/NVIDIA/FasterTransformer.Google Scholar
Ong I, Almahairi A, Wu V, Chiang WL, Wu T, Gonzalez JE, Kadous MW, Stoica I (2024) RouteLLM: Learning to route LLMs with preference data. Preprint, submitted June 26, https://arxiv.org/abs/2406.18665.Google Scholar
OpenAI (2024) Learning to reason with LLMs. Accessed July 2, 2025, https://openai.com/index/learning-to-reason-with-llms/.Google Scholar
Patel P, Choukse E, Zhang C, Shah A, Goiri Í, Maleki S, Bianchini R (2024) Splitwise: Efficient generative LLM inference using phase splitting. Proc. 51st Annual ACM/IEEE 51st Internat. Sympos. Comput. Architecture (ISCA 2024) (IEEE, Piscataway, NJ), 3775–3801.Google Scholar
Patil SG, Zhang T, Wang X, Gonzalez JE (2023) Gorilla: Large language model connected with massive APIs. Preprint, submitted May 24, https://arxiv.org/abs/2305.15334.Google Scholar
Pope R, Douglas S, Chowdhery A, Devlin J, Bradbury J, Heek J, Xiao K, Agrawal S, Dean J (2023) Efficiently scaling transformer inference. Proc. Machine Learn. Systems 5:606–624.Google Scholar
Qin Y, Liang S, Ye Y, Zhu K, Yan L, Lu Y, Lin Y, et al. (2023) ToolLLM: Facilitating large language models to master 16000+ real-world APIs. Preprint, submitted July 31, https://arxiv.org/abs/2307.16789.Google Scholar
Qiu H, Mao W, Patke A, Cui S, Jha S, Wang C, Franke H, Kalbarczyk Z, Başar T, Iyer RK (2024a) Power-aware deep learning model serving with μ-serve. Proc. 2024 USENIX Annual Tech. Conf. (USENIX ATC ’24) (USENIX Association, Santa Clara, CA), 75–93.Google Scholar
Qiu H, Mao W, Patke A, Cui S, Jha S, Wang C, Franke H, Kalbarczyk ZT, Başar T, Iyer RK (2024b) Efficient interactive LLM serving with proxy model-based sequence length prediction. Preprint, submitted April 12, https://arxiv.org/abs/2404.08509.Google Scholar
Qwen Team (2024) QwQ: Reflect deeply on the boundaries of the unknown. Accessed July 2, 2025, https://qwenlm.github.io/blog/qwq-32b-preview/.Google Scholar
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):1–24.Google Scholar
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. Meilă M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn. (ICML 2021), vol. 139 (PMLR, New York), 8821–8831.Google Scholar
Rawal R, Saifullah K, Farré M, Basri R, Jacobs D, Somepalli G, Goldstein T (2024) Cinepile: A long video question answering dataset and benchmark. Preprint, submitted May 14, https://arxiv.org/abs/2405.08813.Google Scholar
Reich E (1957) Waiting times when queues are in tandem. Ann. Math. Statist. 28(3):768–773.Google Scholar
Salman SM, Papadopoulos AV, Mubeen S, Nolte T (2023a) Evaluating dispatching and scheduling strategies for firm real-time jobs in edge computing. Proc. 49th Annual Conf. IEEE Indust. Electronics Soc. (IECON 2023) (IEEE, Piscataway, NJ), 1–6.Google Scholar
Salman SM, Dao VL, Papadopoulos AV, Mubeen S, Nolte T (2023b) Scheduling firm real-time applications on the edge with single-bit execution time prediction. Proc. 2023 IEEE 25th Internat. Sympos. Real-Time Distributed Comput. (ISORC 2023) (IEEE, Piscataway, NJ), 207–213.Google Scholar
Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT: A distilled version of BERT: Smaller, faster, cheaper and lighter. Preprint, submitted October 2, https://arxiv.org/abs/1910.01108.Google Scholar
Schick T, Dwivedi-Yu J, Dessi R, Raileanu R, Lomeli M, Hambro E, Zettlemoyer L, Cancedda N, Scialom T (2023) Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 68539–68551.Google Scholar
Scully Z, Harchol-Balter M (2018) SOAP bubbles: Robust scheduling under adversarial noise. Proc. 56th Annual Allerton Conf. Comm. Control Comput. (Allerton 2018) (IEEE, Piscataway, NJ), 144–154.Google Scholar
Scully Z, Harchol-Balter M (2021) The Gittins policy in the M/G/1 queue. Proc. 19th Internat. Sympos. Model. Optim. Mobile Ad Hoc Wireless Networks (WiOpt 2021) (IEEE, Piscataway, NJ), 248–255.Google Scholar
Scully Z, Grosof I, Harchol-Balter M (2020a) The Gittins policy is nearly optimal in the M/G/k under extremely general conditions. Proc. ACM Measurement Anal. Comput. Systems 4(3):43.Google Scholar
Scully Z, Grosof I, Mitzenmacher M (2022) Uniform bounds for scheduling with job size estimates. Braverman M, ed. 13th Innovations Theoret. Comput. Sci. Conf. ITCS, LIPIcs, vol. 215 (Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Wadern, Germany), 114:1–114:30.Google Scholar
Scully Z, Harchol-Balter M, Scheller-Wolf A (2018) SOAP: One clean analysis of all age-based scheduling policies. Proc. ACM Measurement Anal. Comput. Systems 2(1):16.Google Scholar
Scully Z, Van Kreveld L, Boxma O, Dorsman JP, Wierman A (2020b) Characterizing policies with optimal response time tails under heavy-tailed job sizes. Proc. ACM Measurement Anal. Comput. Systems 4(2):30.Google Scholar
Shahout R, Mitzenmacher M (2024) SkipPredict: When to invest in predictions for scheduling. Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, Zhang C, eds. Advances in Neural Information Processing Systems, vol. 37 (Curran Associates, Inc., Red Hook, NY).Google Scholar
Shahout R, Malach E, Liu C, Jiang W, Yu M, Mitzenmacher M (2025a) Don’t stop me now: Embedding based scheduling for LLMs. Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
Shahout R, Liang C, Xin S, Lao Q, Cui Y, Yu M, Mitzenmacher M (2025b) Fast inference for augmented large language models. Preprint, submitted October 25, https://arxiv.org/abs/2410.18248.Google Scholar
Srivatsa V, He Z, Abhyankar R, Li D, Zhang Y (2024) Preble: Efficient distributed prompt scheduling for LLM serving. Preprint, submitted May 8, https://arxiv.org/abs/2407.00023.Google Scholar
Stojkovic J, Zhang C, Goiri Í, Torrellas J, Choukse E (2024) DynamoLLM: Designing LLM inference clusters for performance and energy efficiency. Preprint, submitted August 1, https://arxiv.org/abs/2408.00741.Google Scholar
Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto TB (2023) Stanford alpaca: An instruction-following LLaMA model. Accessed July 2, 2025, https://github.com/tatsu-lab/stanford_alpaca.Google Scholar
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, et al. (2023) LLaMA: Open and efficient foundation language models. Preprint, submitted February 27, https://arxiv.org/abs/2302.13971.Google Scholar
Vvedenskaya ND, Suhov YM (1997) Dobrushin’s mean-field approximation for a queue with dynamic routing. Markov Processes Related Fields 3(4):493–526.Google Scholar
Wang Y, Ma X, Chen W (2024b) Augmenting black-box LLMs with medical textbooks for biomedical question answering. Findings 2024 Conf. Empirical Methods Natl. Language Processing (EMNLP 2024) (Association for Computational Linguistics, Stroudsburg, PA), 1754–1770.Google Scholar
Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, Chowdhery A, Zhou D (2022) Self-consistency improves chain of thought reasoning in language models. Preprint, submitted March 21, https://arxiv.org/abs/2203.11171.Google Scholar
Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, et al. (2024a) A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18(6):186345.Google Scholar
Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, Zhou D, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inform. Processing Systems 35:24824–24837.Google Scholar
Wierman A, Nuyens M (2008) Scheduling despite inexact job-size information. Proc. 2008 ACM SIGMETRICS Internat. Conf. Measurement Model. Comput. Systems (SIGMETRICS ’08) (ACM, New York), 25–36.Google Scholar
Wolfram S (2024) ChatGPT gets its “Wolfram superpowers!” Accessed July 2, 2025, https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/.Google Scholar
Wu Y, Sun Z, Li S, Welleck S, Yang Y (2024) Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models. Preprint, submitted August 1, https://arxiv.org/abs/2408.00724.Google Scholar
Wu B, Zhong Y, Zhang Z, Huang G, Liu X, Jin X (2023) Fast distributed inference serving for large language models. Preprint, submitted May 10, https://arxiv.org/abs/2305.05920.Google Scholar
Yu G, Scully Z (2024) Strongly tail-optimal scheduling in the light-tailed M/G/1. Proc. ACM Measurement Anal. Comput. Systems 8(2):27.Google Scholar
Yu GI, Jeong JS, Kim GW, Kim S, Chun BG (2022) Orca: A distributed serving system for transformer-based generative models. Proc. 16th USENIX Sympos. Operating Systems Design Implementation (OSDI ’22) (USENIX Association, Berkeley, CA), 521–538.Google Scholar
Zaharia M, Khattab O, Chen L, Davis JQ, Miller H, Potts C, Zou J, et al. (2024) The shift from models to compound AI systems. Accessed July 2, 2025, https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems.Google Scholar
Zhao P, Zhang H, Yu Q, Wang Z, Geng Y, Fu F, Yang L, Zhang W, Jiang J, Cui B (2024) Retrieval-augmented generation for AI-generated content: A survey. Preprint, submitted February 29, https://arxiv.org/abs/2402.19473.Google Scholar
Zheng Z, Ren X, Xue F, Luo Y, Jiang X, You Y (2024) Response length perception and sequence scheduling: An LLM-empowered LLM inference pipeline. Advances in Neural Information Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 65517–65530.Google Scholar
Zheng L, Yin L, Xie Z, Huang J, Sun C, Yu CH, Cao S, et al. (2023) Efficiently programming large language models using SGLang. Preprint, submitted December 12, https://arxiv.org/abs/2312.07104.Google Scholar
Zheng L, Yin L, Xie Z, Sun CL, Huang J, Yu CH, Cao S, et al. (2025) SGLang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems, vol. 38 (Curran Associates Inc., Red Hook, NY), 62557–62583.Google Scholar
Zhong Y, Liu S, Chen J, Hu J, Zhu Y, Liu X, JX, Zhang H (2024) DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. Preprint, submitted January 18, https://arxiv.org/abs/2401.09670.Google Scholar

Volume 15, Issue 3

September 2025

Pages 195-272

Article Information

Metrics

Information

Received:March 07, 2025
Accepted:June 16, 2025
Published Online:July 22, 2025

Cite as

Michael Mitzenmacher, Rana Shahout (2025) Queueing, Predictions, and Large Language Models: Challenges and Open Problems. Stochastic Systems 15(3):195-219.

https://doi.org/10.1287/stsy.2025.0106

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Queueing, Predictions, and Large Language Models: Challenges and Open Problems

References

Volume 15, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News